IROS2025

Abstract:
Enabling embodied agents to complete complex human instructions from natural language is crucial to autonomous systems in household services. Conventional methods can only accomplish human instructions in the known environment where all interactive objects are provided to the embodied agent, and directly deploying the existing approaches for the unknown environment usually generates infeasible plans that manipulate non-existing objects. On the contrary, we propose an embodied instruction following (EIF) method for complex tasks in the unknown environment, where the agent efficiently explores the unknown environment to generate feasible plans with existing objects to accomplish abstract instructions. Specifically, we build a hierarchical embodied instruction following framework including the high-level task planner and the low-level exploration controller with multimodal large language models. We then construct a semantic representation map of the scene with dynamic region attention to demonstrate the known visual clues, where the goal of task planning and scene exploration is aligned for human instruction. For the task planner, we generate the feasible step-by-step plans for human goal accomplishment according to the task completion process and the known visual clues. For the exploration controller, the optimal navigation or object interaction policy is predicted based on the generated step-wise plans and the known visual clues. The experimental results demonstrate that our method can achieve 45.09% success rate in 204 complex human instructions such as making breakfast and tidying rooms in large house-level scenes. Code and supplementary are available at https://gary3410.github.io/eif_unknown/.

Abstract:
Language-driven grasp detection has the potential to revolutionize human-robot interaction by allowing robots to understand and execute grasping tasks based on natural language commands. However, existing approaches face two key challenges. First, they often struggle to interpret complex text instructions or operate ineffectively in densely cluttered environments. Second, most methods require a training or fine-tuning step to adapt to new domains, limiting their generation in real-world applications. In this paper, we introduce GraspMAS, a new multi-agent system framework for language-driven grasp detection. GraspMAS is designed to reason through ambiguities and improve decision-making in real-world scenarios. Our framework consists of three specialized agents: Planner, responsible for strategizing complex queries; Coder, which generates and executes source code; and Observer, which evaluates the outcomes and provides feedback. Intensive experiments on two large-scale datasets demonstrate that our GraspMAS significantly outperforms existing baselines. Additionally, robot experiments conducted in both simulation and real-world settings further validate the effectiveness of our approach. Our project page is available at https://zquang2202.github.io/GraspMAS.

Abstract:
Navigation in dynamic environments requires autonomous systems to reason about uncertainties in the behavior of other agents. In this paper, we introduce a unified framework that combines trajectory planning with multimodal predictions and active probing to enhance decision-making under uncertainty. We develop a novel risk metric that seamlessly integrates multimodal prediction uncertainties through mixture models. When these uncertainties follow a Gaussian mixture distribution, we prove that our risk metric admits a closed-form solution, and is always finite, thus ensuring analytical tractability. To reduce prediction ambiguity, we incorporate an active probing mechanism that strategically selects actions to improve its estimates of behavioral parameters of other agents, while simultaneously handling multimodal uncertainties. We extensively evaluate our framework in autonomous navigation scenarios using the MetaDrive simulation environment. Results demonstrate that our active probing approach successfully navigates complex traffic scenarios with uncertain predictions. Additionally, our framework shows robust performance across diverse traffic agent behavior models, indicating its broad applicability to real-world autonomous navigation challenges.

Abstract:
A prior global topological map (e.g., the OpenStreetMap, OSM) can boost the performance of autonomous mapping by a ground mobile robot. However, the prior map is usually incomplete due to lacking labeling in partial paths. To solve this problem, this paper proposes an OSM maker using airborne sensors carried by low-altitude aircraft, where the core of the OSM maker is a novel efficient pathfinder approach based on LiDAR and camera data, i.e., a binary dual-stream road segmentation model. Specifically, a multi-scale feature extraction based on the UNet architecture is implemented for images and point clouds. To reduce the effect caused by the sparsity of point cloud, an attention-guided gated block is designed to integrate image and point-cloud features. To optimize the model for edge deployment that significantly reduces storage footprint and computational demands, we propose a binarization streamline to each model component, including a variant of vision transformer (ViT) architecture as the encoder of the image branch, and new focal and perception losses to optimize the model training. The experimental results on two datasets demonstrate that our pathfinder method achieves SOTA accuracy with high efficiency in finding paths from the low-level airborne sensors, and we can create complete OSM prior maps based on the segmented road skeletons. Code and data are available at: https://github.com/IMRL/Pathfinder.

Abstract:
Category-level object pose estimation, which predicts the pose of objects within a known category without prior knowledge of individual instances, is essential in applications like warehouse automation and manufacturing. Existing methods relying on RGB images or point cloud data often struggle with object occlusion and generalization across different instances and categories. This paper proposes a multimodal-based keypoint learning framework (MK-Pose) that integrates RGB images, point clouds, and category-level textual descriptions. The model uses a self-supervised keypoint detection module enhanced with attention- based query generation, soft heatmap matching and graph-based relational modeling. Additionally, a graph-enhanced feature fusion module is designed to integrate local geometric information and global context. MK-Pose is evaluated on CAMERA25 and REAL275 dataset, and is further tested for cross-dataset capability on HouseCat6D dataset. The results demonstrate that MK-Pose outperforms existing state-of-the-art methods in both IoU and average precision without shape priors. Codes will be released at https://github.com/yangyifanYYF/MK-Pose.

Abstract:
Warehouse robotic systems equipped with vacuum grippers must reliably grasp a diverse range of objects from densely packed shelves. However, these environments present significant challenges, including occlusions, diverse object orientations, stacked and obstructed items, and surfaces that are difficult to suction. We introduce TetraGrip, a novel vacuum-based grasping strategy featuring four suction cups mounted on linear actuators. Each actuator is equipped with an optical time-of-flight (ToF) proximity sensor, enabling reactive grasping.We evaluate TetraGrip in a warehouse-style setting, demonstrating its ability to manipulate objects in stacked and obstructed configurations. Our results show that our RL-based policy improves picking success in stacked-object scenarios by 22.9% compared to a single-suction gripper. Additionally, we demonstrate that TetraGrip can successfully grasp objects in scenarios where a single-suction gripper fails due to physical limitations, specifically in two cases: (1) picking an object occluded by another object and (2) retrieving an object in a complex scenario. These findings highlight the advantages of multi-actuated, suction-based grasping in unstructured warehouse environments. The project website is available at: https://tetragrip.github.io/.

Abstract:
In real-world scenarios, environment changes caused by human or agent activities make it extremely challenging for robots to perform various long-term tasks. Recent works typically struggle to effectively understand and adapt to dynamic environments due to the inability to update their environment representations in memory in response to environment changes and lack of fine-grained reconstruction of the environments. To address these challenges, we propose DynamicGSG, a dynamic, high-fidelity, open-vocabulary scene graph construction system leveraging Gaussian Splatting. DynamicGSG builds hierarchical scene graphs using advanced vision language models to represent the spatial hierarchy and semantic relationships between objects in the environments, utilizes a joint feature loss to supervise Gaussian instance grouping while optimizing the Gaussian maps, and locally updates the Gaussian scene graphs according to real environment changes for long-term environment adaptation. Experiments and ablation studies demonstrate the performance and efficacy of our proposed method in terms of semantic segmentation, language-guided object retrieval, and reconstruction quality. In addition, we validate the dynamic updating capabilities of our system within real-world laboratory settings. The source code and supplementary materials will be available at: https://github.com/GeLuzhou/Dynamic-Gsg.

Abstract:
Depth map enhancement using paired high-resolution RGB images offers a cost-effective solution for improving low-resolution depth data from lightweight ToF sensors. Nevertheless, naively adopting a depth estimation pipeline to fuse the two modalities requires groundtruth depth maps for supervision. To address this, we propose a self-supervised learning framework, SelfToF, which generates detailed and scale-aware depth maps. Starting from an image-based self-supervised depth estimation pipeline, we add low-resolution depth as inputs, design a new depth consistency loss, propose a scale-recovery module, and finally obtain a large performance boost. Furthermore, since the ToF signal sparsity varies in real-world applications, we upgrade SelfToF to SelfToF with submanifold convolution and guided feature fusion. Consequently, SelfToF maintain robust performance across varying sparsity levels in ToF data. Overall, our proposed method is both efficient and effective, as verified by extensive experiments on the NYU and ScanNet datasets. The code is available at https://github.com/denyingmxd/selftof.

Abstract:
Accurate pain expression synthesis is essential for improving clinical training and human-robot interaction. Current Robotic Patient Simulators (RPSs) lack realistic pain facial expressions, limiting their effectiveness in medical training. In this work, we introduce PainDiffusion, a generative model that synthesizes naturalistic facial pain expressions. Unlike traditional heuristic or autoregressive methods, PainDiffusion operates in a continuous latent space, ensuring smoother and more natural facial motion while supporting indefinite-length generation via diffusion forcing. Our approach incorporates intrinsic characteristics such as pain expressiveness and emotion, allowing for personalized and controllable pain expression synthesis. We train and evaluate our model using the BioVid HeatPain Database. Additionally, we integrate PainDiffusion into a robotic system to assess its applicability in real-time rehabilitation exercises. Qualitative studies with clinicians reveal that PainDiffusion produces realistic pain expressions, with a 31.2% ± 4.8% preference rate against ground-truth recordings. Our results suggest that PainDiffusion can serve as a viable alternative to real patients in clinical training and simulation, bridging the gap between synthetic and naturalistic pain expression. Code and videos are available at: https://damtien444.github.io/paindf/.

Abstract:
Performing robotic grasping from a cluttered bin based on human instructions is a challenging task, as it requires understanding both the nuances of free-form language and the spatial relationships between objects. Vision-Language Models (VLMs) trained on web-scale data, such as GPT-4o, have demonstrated remarkable reasoning capabilities across both text and images. But can they truly be used for this task in a zero-shot setting? And what are their limitations? In this paper, we explore these research questions via the free-form language-based robotic grasping task and propose a novel method, FreeGrasp, leveraging the pre-trained VLMs’ world knowledge to reason about human instructions and object spatial arrangements. Our method detects all objects as keypoints and uses these keypoints to annotate marks on images, aiming to facilitate GPT-4o’s zero-shot spatial reasoning. This allows our method to determine whether a requested object is directly graspable or if other objects must be grasped and removed first. Since no existing dataset is specifically designed for this task, we introduce a synthetic dataset FreeGraspData by extending the MetaGraspNetV2 dataset with human-annotated instructions and ground-truth grasping sequences. We conduct extensive analyses with both FreeGraspData and real-world validation with a gripper-equipped robotic arm, demonstrating state-of-the-art performance in grasp reasoning and execution. Project website: https://tev-fbk.github.io/FreeGrasp/.

Abstract:
Diffusion Policies have demonstrated impressive performance in robotic manipulation tasks. However, their long inference time, resulting from an extensive iterative denoising process, and the need to execute an action chunk before the next prediction to maintain consistent actions limit their applicability to latency-critical tasks or simple tasks with a short cycle time. While recent methods explored distillation or alternative policy structures to accelerate inference, these often demand additional training, which can be resource-intensive for large robotic models. In this paper, we introduce a novel approach inspired by the Real-Time Iteration (RTI) Scheme, a method from optimal control that accelerates optimization by leveraging solutions from previous time steps as initial guesses for subsequent iterations. We explore the application of this scheme in diffusion inference and propose a scaling-based method to effectively handle discrete actions, such as grasping, in robotic manipulation. The proposed scheme significantly reduces runtime computational costs without the need for distillation or policy redesign. This enables a seamless integration into many pre-trained diffusion-based models, in particular, to resource-demanding large models. We also provide theoretical conditions for the contractivity which could be useful for estimating the initial denoising step. Quantitative results from extensive simulation experiments show a substantial reduction in inference time, with comparable overall performance compared with Diffusion Policy using full-step denoising. Our project page with additional resources is available at: https://rti-dp.github.io/

Abstract:
Token merging has emerged as an effective strategy to accelerate Vision Transformers (ViT) by reducing computational costs. However, existing methods primarily rely on the visual token’s feature similarity for token merging, overlooking the potential of integrating spatial information, which can serve as a reliable criterion for token merging in the early layers of ViT, where the visual tokens only possess weak visual information. In this paper, we propose ToSA, a novel token merging method that combines both semantic and spatial awareness to guide the token merging process. ToSA leverages the depth image as input to generate pseudo spatial tokens, which serve as auxiliary spatial information for the visual token merging process. With the introduced spatial awareness, ToSA achieves a more informed merging strategy that better preserves critical scene structure. Experimental results demonstrate that ToSA outperforms previous token merging methods across multiple benchmarks on visual and embodied question answering while largely reducing the runtime of the ViT, making it an efficient solution for ViT acceleration. The code will be available at: https://github.com/hsiangwei0903/ToSA.

Abstract:
The rapid iteration of autonomous vehicle (AV) deployments leads to increasing needs for building realistic and scalable multi-agent traffic simulators for efficient evaluation. Recent advances in this area focus on closed-loop simulators that enable generating diverse and interactive scenarios. This paper introduces Neural Interactive Agents (NIVA), a probabilistic framework for multi-agent simulation driven by a hierarchical Bayesian model that enables closed-loop, observation-conditioned simulation through autoregressive sampling from a latent, finite mixture of Gaussian distributions. We demonstrate how NIVA unifies preexisting sequence-to-sequence trajectory prediction models and emerging closed-loop simulation models trained on Next-token Prediction (NTP) from a Bayesian inference perspective. Experiments on the Waymo Open Motion Dataset demonstrate that NIVA attains competitive performance compared to the existing method while providing embellishing control over intentions and driving styles.

Abstract:
We study policy distillation under privileged information, where a student policy with only partial observations must learn from a teacher with full-state access. A key challenge is information asymmetry: the student cannot directly access the teacher’s state space, leading to distributional shifts and policy degradation. Existing approaches either modify the teacher to produce realizable but sub-optimal demonstrations or rely on the student to explore missing information independently, both of which are inefficient. Our key insight is that the student should strategically interact with the teacher —querying only when necessary and resetting from recovery states —to stay on a recoverable path within its own observation space. We introduce two methods: (i) an imitation learning approach that adaptively determines when the student should query the teacher for corrections, and (ii) a reinforcement learning approach that selects where to initialize training for efficient exploration. We validate our methods in both simulated and real-world robotic tasks, demonstrating significant improvements over standard teacher-student baselines in training efficiency and final performance. The project website is available here.

Abstract:
Object rearrangement, which involves arranging objects step-by-step to achieve tidy states, is critical in robotic applications. Progress in this area is often constrained by issues such as high-cost data collection and physically infeasible trajectory prediction. To address these challenges, we propose the Data-Bootstrapped, Physics-Informed Rearrangement (DPR) framework, which leverages a transformer for sequential decision making. Specifically, DPR integrates Enhanced Data Generation with a Physics Reward Feedback Transformer. Enhanced Data Generation consists of Random Trajectory Reverse for producing high-quality training data and Bootstrapped Trajectory Synthesis, which leverages the transformer’s sequence modeling to diversify training trajectories. To ensure the feasibility of the generated trajectories and to improve the transformer’s performance, we incorporate a Physical Reward Feedback mechanism into the transformer. Experiments on ball and room rearrangement tasks show that DPR significantly outperforms existing methods in terms of both efficiency and effectiveness. Code will be released soon.

Abstract:
Localizing a person from a moving monocular camera is critical for Human-Robot Interaction (HRI). To estimate the 3D human position from a 2D image, existing methods either depend on the geometric assumption of a fixed camera or use a position regression model trained on datasets containing little camera ego-motion. These methods are vulnerable to fierce camera ego-motion, resulting in inaccurate person localization. We consider person localization as a part of a pose estimation problem. By representing a human with a four-point model, our method jointly estimates the 2D camera attitude and the person’s 3D location through optimization. Evaluations on both public datasets and real robot experiments demonstrate our method outperforms baselines in person localization accuracy. Our method is further implemented into a person-following system and deployed on an agile quadruped robot.

Abstract:
This study presents Flower Pose Estimation (FloPE), a real-time flower pose estimation framework for computationally constrained robotic pollination systems. Robotic pollination has been proposed to supplement natural pollination to ensure global food security due to the decreased population of natural pollinators. However, flower pose estimation for pollination is challenging due to natural variability, flower clusters, and high accuracy demands due to the flowers’ fragility when pollinating. This method leverages 3D Gaussian Splatting to generate photorealistic synthetic datasets with precise pose annotations, enabling effective knowledge distillation from a high-capacity teacher model to a lightweight student model for efficient inference. The approach was evaluated on both single and multi-arm robotic platforms, achieving a mean pose estimation error of 0.6 cm and 19.14 degrees within a low computational cost. Our experiments validate the effectiveness of FloPE, achieving up to 78.75% pollination success rate and outperforming prior robotic pollination techniques.

Abstract:
Self-improvement requires robotic systems to initially learn from human-provided data and then gradually enhance their capabilities through interaction with the environment. This is similar to how humans improve their skills through continuous practice. However, achieving effective self-improvement is challenging, primarily because robots tend to repeat their existing abilities during interactions, often failing to generate new, valuable data for learning. In this paper, we identify the key to successful self-improvement: modal-level exploration and data selection. By incorporating a modal-level exploration mechanism during policy execution, the robot can produce more diverse and multi-modal interactions. At the same time, we select the most valuable trials and high-quality segments from these interactions for learning. We successfully demonstrate effective robot self-improvement on both simulation benchmarks and real-world experiments. The capability for self-improvement will enable us to develop more robust and high-success-rate robotic control strategies at a lower cost. Our code and experiment scripts are available at ericjin2002.github.io/SIME.

Abstract:
Sequentially grasping multiple objects with multi-fingered hands is common in daily life, where humans can fully leverage the dexterity of their hands to enclose multiple objects. However, the diversity of object geometries and the complex contact interactions required for high-DOF hands to grasp one object while enclosing another make sequential multi-object grasping challenging for robots. In this paper, we propose SeqMultiGrasp, a system for sequentially grasping objects with a four-fingered Allegro Hand. We focus on sequentially grasping two objects, ensuring that the hand fully encloses one object before lifting it and then grasps the second object without dropping the first. Our system first synthesizes single-object grasp candidates, where each grasp is constrained to use only a subset of the hand’s links. These grasps are then validated in a physics simulator to ensure stability and feasibility. Next, we merge the validated single-object grasp poses to construct multi-object grasp configurations. For real-world deployment, we train a diffusion model conditioned on point clouds to propose grasp poses, followed by a heuristic-based execution strategy. We test our system using 8 × 8 object combinations in simulation and 6 × 3 object combinations in real. Our diffusion-based grasp model obtains an average success rate of 65.8% over 1,600 simulation trials and 56.7% over 90 real-world trials, suggesting that it is a promising approach for sequential multi-object grasping with multi-fingered hands. Supplementary material is available on our project website: https://hesic73.github.io/SeqMultiGrasp.

Abstract:
Many robotic systems require extended deployments in complex, dynamic environments. In such deployments, parts of the environment may change between subsequent robot observations. Most robotic mapping or environment modeling algorithms are incapable of representing dynamic features in a way that enables predicting their future state. Instead, they opt to filter certain state observations, either by removing them or some form of weighted averaging. This paper introduces Perpetua, a method for modeling the dynamics of semi-static features. Perpetua is able to: incorporate prior knowledge about the dynamics of the feature if it exists, track multiple hypotheses, and adapt over time to enable predicting of future feature states. Specifically, we chain together mixtures of "persistence" and "emergence" filters to model the probability that features will disappear or reappear in a formal Bayesian framework. The approach is an efficient, scalable, general, and robust method for estimating the states of features in an environment, both in the present as well as at arbitrary future times. Through experiments on simulated and real-world data, we find that Perpetua yields better accuracy than similar approaches while also being online adaptable and robust to missing observations.

Abstract:
When planning motions in a configuration space that has underlying symmetries (e.g. when manipulating one or multiple symmetric objects), the ideal planning algorithm should take advantage of those symmetries to produce shorter trajectories. However, finite symmetries lead to complicated changes to the underlying topology of configuration space, preventing the use of standard algorithms. We demonstrate how the key primitives used for sampling-based planning can be efficiently implemented in spaces with finite symmetries. A rigorous theoretical analysis, building upon a study of the geometry of the configuration space, shows improvements in the sample complexity of several standard algorithms. Furthermore, a comprehensive slate of experiments demonstrates the practical improvements in both path length and runtime.

Abstract:
Understanding lane topology relationships accurately is critical for safe autonomous driving. However, existing two-stage methods suffer from inefficiencies due to error propagations and increased computational overheads. To address these challenges, we propose a one-stage architecture that simultaneously predicts traffic elements, lane centerlines and topology relationship, improving both the accuracy and inference speed of lane topology understanding for autonomous driving. Our key innovation lies in reusing intermediate attention resources within distinct transformer decoders. This approach effectively leverages the inherent relational knowledge within the element detection module to enable the modeling of topology relationships among traffic elements and lanes without requiring additional computationally expensive graph networks. Furthermore, we are the first to demonstrate that knowledge can be distilled from models that utilize standard definition (SD) maps to those operates without using SD maps, enabling superior performance even in the absence of SD maps. Extensive experiments on the OpenLane-V2 dataset show that our approach outperforms baseline methods in both accuracy and efficiency, achieving superior results in lane detection, traffic element identification, and topology reasoning. Our code is available at https://github.com/Yang-Li-2000/one-stage.git.

Abstract:
Point cloud prediction (PCP) aims to forecast future 3D point clouds of scenes by leveraging sequential historical LiDAR scans, offering a promising avenue to enhance the perceptual capabilities of autonomous systems. However, existing methods mostly adopt an end-to-end approach without explicitly modeling moving instances, limiting their effectiveness in dynamic real-world environments. In this paper, we propose IMPNet, a novel instance motion-aware network for future point cloud scene prediction. Unlike prior works, IMPNet explicitly incorporates motion and instance-level information to enhance PCP accuracy. Specifically, we extract appearance and motion features from range images and residual images using a dual-branch convolutional network and fuse them via a motion attention block. Our framework further integrates a motion head for identifying moving objects and an instance-assisted training strategy to improve instance-wise point cloud predictions. Extensive experiments on multiple datasets demonstrate that our proposed network achieves state-of-the-art (SOTA) performance in PCP with superior predictive accuracy and robust generalization across diverse driving scenarios. Our method has been released at https://github.com/nubot-nudt/IMPNet.

Abstract:
Diffusion models demonstrate superior performance in capturing complex distributions from large-scale datasets, providing a promising solution for quadrupedal locomotion control. However, the robustness of the diffusion planner is inherently dependent on the diversity of the pre-collected datasets. To mitigate this issue, we propose a two-stage learning framework to enhance the capability of the diffusion planner under limited dataset (reward-agnostic). Through the offline stage, the diffusion planner learns the joint distribution of state-action sequences from expert datasets without using reward labels. Subsequently, we perform the online interaction in the simulation environment based on the trained offline planner, which significantly diversified the original behavior and thus improves the robustness. Specifically, we propose a novel weak preference labeling method without the ground-truth reward or human preferences. The proposed method exhibits superior stability and velocity tracking accuracy in pacing, trotting, and bounding gait under different speeds and can perform a zero-shot transfer to the real Unitree Go1 robots. The project website for this paper is at https://shangjaven.github.io/preference-aligned-diffusion-legged/.

Abstract:
Fisheye cameras, renowned for their panoramic field of view (FOV) of 360°, are crucial for surround-view perception in autonomous driving. However, research on object perception in fisheye images lags behind that of standard images. To address this gap, we propose a feature-aligned fisheye object detection network specifically tailored for autonomous driving. Current fisheye perception algorithms often overlook the misalignment issues that typically arise in object detectors. To tackle these challenges in the feature pyramid network (FPN), we introduce a feature-aligned pyramid module (FaPM), which learns pixel transformation offsets to contextually align feature maps. Additionally, we present a location-aligned detection head (LaDH) to align the spatial distribution of classification and regression localization. Integrating these modules into a detection framework results in a novel feature-aligned fisheye object detector. Our method undergoes extensive evaluation on the WoodScape dataset, achieving a mean average precision (mAP) of 32.2%, surpassing the performance of existing methods.

Abstract:
Real-time high-accuracy optical flow estimation is critical for a variety of real-world robotic applications. However, current learning-based methods often struggle to balance accuracy and computational efficiency: methods that achieve high accuracy typically demand substantial processing power, while faster approaches tend to sacrifice precision. These fast approaches specifically falter in their generalization capabilities and do not perform well across diverse real-world scenarios. In this work, we revisit the limitations of the SOTA methods and present NeuFlow-V2, a novel method that offers both — high accuracy in real-world datasets coupled with low computational overhead. In particular, we introduce a novel light-weight backbone and a fast refinement module to keep computational demands tractable while delivering accurate optical flow. Experimental results on synthetic and real-world datasets demonstrate that NeuFlow-V2 provides similar accuracy to SOTA methods while achieving 10x-70x speedups. It is capable of running at over 20 FPS on 512x384 resolution images on a Jetson Orin Nano. The full training and evaluation code is available at https://github.com/ neufieldrobotics/NeuFlow_v2.

Abstract:
Imitation learning has emerged as a powerful paradigm in robot manipulation, yet its generalization capability remains constrained by object-specific dependencies in limited expert demonstrations. To address this challenge, we propose knowledge-driven imitation learning, a framework that leverages external structural semantic knowledge to abstract object representations within the same category. We introduce a novel semantic keypoint graph as a knowledge template and develop a coarse-to-fine template-matching algorithm that optimizes both structural consistency and semantic similarity. Evaluated on three real-world robotic manipulation tasks, our method achieves superior performance, surpassing image-based diffusion policies with only one-quarter of the expert demonstrations. Extensive experiments further demonstrate its robustness across novel objects, backgrounds, and lighting conditions. This work pioneers a knowledge-driven approach to data-efficient robotic learning in real-world settings. Code and more materials are available on knowledge-driven.github.io.

Abstract:
Recent advances in autonomous driving are moving towards mapless approaches, where High-Definition (HD) maps are generated online directly from sensor data, reducing the need for expensive labeling and maintenance. However, the reliability of these online-generated maps remains uncertain. While incorporating map uncertainty into downstream trajectory prediction tasks has shown potential for performance improvements, current strategies provide limited insights into the specific scenarios where this uncertainty is beneficial. In this work, we first analyze the driving scenarios in which mapping uncertainty has the greatest positive impact on trajectory prediction and identify a critical, previously overlooked factor: the agent’s kinematic state. Building on these insights, we propose a novel Proprioceptive Scenario Gating that adaptively integrates map uncertainty into trajectory prediction based on forecasts of the ego vehicle’s future kinematics. This lightweight, self-supervised approach enhances the synergy between online mapping and trajectory prediction, providing interpretability around where uncertainty is advantageous and outperforming previous integration methods. Additionally, we introduce a Covariance-based Map Uncertainty approach that better aligns with map geometry, further improving trajectory prediction. Extensive ablation studies confirm the effectiveness of our approach, achieving up to 23.6% improvement in mapless trajectory prediction performance over the state-of-the-art method using the real-world nuScenes driving dataset. Our code, data, and models are publicly available at https://github.com/Ethan-Zheng136/Map-Uncertainty-for-Trajectory-Prediction.

Abstract:
Cooperative perception systems play a vital role in enhancing the safety and efficiency of vehicular autonomy. Although recent studies have highlighted the efficacy of vehicle-to-everything (V2X) communication techniques in autonomous driving, a significant challenge persists: how to efficiently integrate multiple high-bandwidth features across an expanding network of connected agents such as vehicles and infrastructure. In this paper, we introduce CoMamba, a novel cooperative 3D detection framework designed to leverage state-space models for real-time onboard vehicle perception. Compared to prior state-of-the-art transformer-based models, CoMamba enjoys being a more scalable 3D model using bidirectional state space models, bypassing the quadratic complexity pain-point of attention mechanisms. Through extensive experimentation on V2X/V2V datasets, CoMamba achieves superior performance compared to existing methods while maintaining real-time processing capabilities. The proposed framework not only enhances object detection accuracy but also significantly reduces processing time, making it a promising solution for next-generation cooperative perception systems in intelligent transportation networks.

Abstract:
Teleoperation is a crucial tool for collecting human demonstrations, but controlling robots with bimanual dexterous hands remains a challenge. Existing teleoperation systems struggle to handle the complexity of coordinating two hands for intricate manipulations. We introduce Bunny-VisionPro, a real-time bimanual dexterous teleoperation system that leverages a VR headset. Unlike previous vision-based teleoperation systems, we design novel low-cost devices to provide haptic feedback to the operator, enhancing immersion. Our system prioritizes safety by incorporating collision and singularity avoidance while maintaining real-time performance through innovative designs. Bunny-VisionPro outperforms prior systems on a standard task suite, achieving higher success rates and reduced task completion times. Moreover, the high-quality teleoperation demonstrations improve downstream imitation learning performance, leading to better generalizability. Notably, Bunny-VisionPro enables imitation learning with challenging multi-stage, long-horizon dexterous manipulation tasks, which have rarely been addressed in previous work. Our system’s ability to handle bimanual manipulations while prioritizing safety and real-time performance makes it a powerful tool for advancing dexterous manipulation and imitation learning. Our web page is available at https://dingry.github.io/projects/bunny_visionpro.

Abstract:
Interactive navigation is crucial in scenarios where proactively interacting with objects can yield shorter paths, thus significantly improving traversal efficiency. Existing methods primarily focus on using the robot body to relocate obstacles during navigation. However, they prove ineffective in narrow or constrained spaces where the robot’s dimensions restrict its manipulation capabilities. This paper introduces a novel interactive navigation framework for legged manipulators, featuring an active arm-pushing mechanism that enables the robot to reposition movable obstacles in space-constrained environments. To this end, we develop a reinforcement learning-based arm-pushing controller with a two-stage reward strategy for object manipulation. Specifically, this strategy first directs the manipulator to a designated pushing zone to achieve a kinematically feasible contact configuration. Then, the end effector is guided to maintain its position at appropriate contact points for stable object displacement while preventing toppling. The simulations validate the robustness of the arm-pushing controller, showing that the two-stage reward strategy improves policy convergence and long-term performance. Real-World experiments further demonstrate the effectiveness of the proposed navigation framework, which achieves shorter paths and reduced traversal time. The open-source project can be found at https://zhihaibi.github.io/interactive-push.github.io/.

Abstract:
Vision-language-action (VLA) models present a promising paradigm by training policies directly on real robot datasets like Open X-Embodiment. However, the high cost of real-world data collection hinders further data scaling, thereby restricting the generalizability of VLAs. In this paper, we introduce ReBot, a novel real-to-sim-to-real approach for scaling real robot datasets and adapting VLA models to target domains, which is the last-mile deployment challenge in robot manipulation. Specifically, ReBot replays real-world robot trajectories in simulation to diversify manipulated objects (real-to-sim), and integrates the simulated movements with inpainted real-world background to synthesize physically realistic and temporally consistent robot videos (sim-to-real). Our approach has several advantages: 1) it enjoys the benefit of real data to minimize the sim-to-real gap; 2) it leverages the scalability of simulation; and 3) it can generalize a pretrained VLA to a target domain with fully automated data pipelines. Extensive experiments in both simulation and real-world environments show that ReBot significantly enhances the performance and robustness of VLAs. For example, in SimplerEnv with the WidowX robot, ReBot improved the in-domain performance of Octo by 7.2% and OpenVLA by 21.8%, and out-of-domain generalization by 19.9% and 9.4%, respectively. For real-world evaluation with a Franka robot, ReBot increased the success rates of Octo by 17% and OpenVLA by 20%. More information can be found at our project page.

Abstract:
Quadrupeds have gained rapid advancement in their capability of traversing across complex terrains. The adoption of deep Reinforcement Learning (RL), transformers and various knowledge transfer techniques can greatly reduce the sim-to-real gap. However, the classical teacher-student framework commonly used in existing locomotion policies requires a pre-trained teacher and leverages the privilege information to guide the student policy. With the implementation of large-scale models in robotics controllers, especially transformers-based ones, this knowledge distillation technique starts to show its weakness in efficiency, due to the requirement of multiple supervised stages. In this paper, we propose Unified Locomotion Transformer (ULT), a new transformer-based framework to unify the processes of knowledge transfer and policy optimization in a single network while still taking advantage of privilege information. The policies are optimized with reinforcement learning, next state-action prediction, and action imitation, all in just one training stage, to achieve zero-shot deployment. Evaluation results demonstrate that with ULT, optimal teacher and student policies can be obtained at the same time, greatly easing the difficulty in knowledge transfer, even with complex transformer-based models.

Abstract:
We introduce Geometric Retargeting (GeoRT), an ultrafast, and principled neural hand retargeting algorithm for teleoperation, developed as part of our recent Dexterity Gen (DexGen) system [1]. GeoRT converts human finger keypoints to robot hand keypoints at 1KHz, achieving state-of-the-art speed and accuracy with significantly fewer hyperparameters. This high-speed capability enables flexible postprocessing, such as leveraging a foundational controller for action correction like DexGen. GeoRT is trained in an unsupervised manner, eliminating the need for manual annotation of hand pairs. The core of GeoRT lies in novel geometric objective functions that capture the essence of retargeting: preserving motion fidelity, ensuring configuration space (C-space) coverage, maintaining uniform response through high flatness, pinch correspondence and preventing self-collisions. This approach is free from intensive test-time optimization, offering a more scalable and practical solution for real-time hand retargeting.

Abstract:
Three-dimensional local descriptors are crucial for encoding geometric surface properties, making them essential for various point cloud understanding tasks. Among these descriptors, GeDi has demonstrated strong zero-shot 6D pose estimation capabilities but remains computationally impractical for real-world applications due to its expensive inference process. Can we retain GeDi’s effectiveness, while significantly improving its efficiency? In this paper, we explore this question by introducing a knowledge distillation framework that trains an efficient student model to regress local descriptors from a GeDi teacher. Our key contributions include: an efficient large-scale training procedure that ensures robustness to occlusions and partial observations while operating under compute and storage constraints, and a novel loss formulation that handles weak supervision from non-distinctive teacher descriptors. We validate our approach on five BOP Benchmark datasets and demonstrate a significant reduction in inference time while maintaining competitive performance with existing methods, bringing zero-shot 6D pose estimation closer to real-time feasibility. Project website: https://tev-fbk.github.io/dGeDi.

Abstract:
This paper addresses the challenges of Rhythmic Insertion Tasks (RIT), where a robot must repeatedly perform high-precision insertions, such as screwing a nut into a bolt with a wrench. The inherent difficulty of RIT lies in achieving millimeter-level accuracy and maintaining consistent performance over multiple repetitions, particularly when factors like nut rotation and friction introduce additional complexity. We propose a sim-to-real framework that integrates a reinforcement learning-based insertion policy with a failure forecasting module. By representing the wrench’s pose in the nut’s coordinate frame rather than the robot’s frame, our approach significantly enhances sim-to-real transferability. The insertion policy, trained in simulation, leverages real-time 6D pose tracking to execute precise alignment, insertion, and rotation maneuvers. Simultaneously, a neural network predicts potential execution failures, triggering a simple recovery mechanism that lifts the wrench and retries the insertion. Extensive experiments in both simulated and real-world environments demonstrate that our method not only achieves a high one-time success rate but also robustly maintains performance over long-horizon repetitive tasks. For more information please refer to the website: jaysparrow.github.io/rit.

Abstract:
The significant advancements in embodied vision navigation have raised concerns about its susceptibility to adversarial attacks exploiting deep neural networks. Investigating the adversarial robustness of embodied vision navigation is crucial, especially given the threat of 3D physical attacks that could pose risks to human safety. However, existing attack methods for embodied vision navigation often lack physical feasibility due to challenges in transferring digital perturbations into the physical world. Moreover, current physical attacks for object detection struggle to achieve both multi-view effectiveness and visual naturalness in navigation scenarios. To address this, we propose a practical attack method for embodied navigation by attaching adversarial patches to objects, where both opacity and textures are learnable. Specifically, to ensure effectiveness across varying viewpoints, we employ a multi-view optimization strategy based on object-aware sampling, which optimizes the patch’s texture based on feedback from the vision-based perception model used in navigation. To make the patch inconspicuous to human observers, we introduce a two-stage opacity optimization mechanism, in which opacity is fine-tuned after texture optimization. Experimental results demonstrate that our adversarial patches decrease the navigation success rate by an average of 22.39%, outperforming previous methods in practicality, effectiveness, and naturalness. Code is available at: github.com/chen37058/Physical-Attacks-in-Embodied-Nav.

Abstract:
Recent works have shown great potential of Large Language Models (LLMs) in robot task and motion planning (TAMP). Current LLM approaches generate text- or code-based reasoning chains with sub-goals and action plans. However, they do not fully leverage LLMs’ symbolic computing and code generation capabilities. Many robot TAMP tasks involve complex optimization under multiple constraints, where pure textual reasoning is insufficient. While augmenting LLMs with predefined solvers and planners improves performance, it lacks generalization across tasks. Given LLMs’ growing coding proficiency, we enhance their TAMP capabilities by steering them to generate code as symbolic planners for optimization and constraint verification. Unlike prior work that uses code to interface with robot action modules or pre-designed planners, we steer LLMs to generate code as solvers, planners, and checkers for TAMP tasks requiring symbolic computing, while still leveraging textual reasoning to incorporate common sense. With a multi-round guidance and answer evolution framework, the proposed Code-as-Symbolic-Planner improves success rates by average 24.1% over best baseline methods across seven typical TAMP tasks and three popular LLMs. Code-as-Symbolic-Planner shows strong effectiveness and generalizability across discrete and continuous environments, 2D/3D simulations and real-world settings, as well as single- and multi-robot tasks with diverse requirements. See our project website† for prompts, videos, and code.

Abstract:
Eye-gaze stands out as an intuitive interface for hands-free control of robotic devices due to its brief training time, fast calibration, low invasiveness, and reduced complexity and cost. However, current approaches are limited by available screen space, excessive wait times, frequent context switching, inconsistent gaze tracker accuracy, and the trade-off between feature-richness and usability. This article presents Diegetic Graphical User Interfaces, a novel, intuitive, and computationally inexpensive approach for gaze-controlled interfaces applied to a robotic arm for precision pick-and-place tasks. By using customizable symbols paired with fiducial markers, interactive buttons are defined and embedded into the robot, which users can trigger via gaze. Twenty-one participants completed the Yale-CMU-Berkeley (YCB) Block Pick and Place Protocol, reporting good usability and user experience, while achieving comparable workload to similar systems. The resulting system is fast to learn, does not restrain the user’s head, and mitigates context switching, while demonstrating intuitive control continuous Cartesian control of a robot arm in precision tasks.

Abstract:
Bimanual manipulation, fundamental to human daily activities, remains a challenging task due to its inherent complexity of coordinated control. Recent advances have enabled zero-shot learning of single-arm manipulation skills through agent-agnostic visual representations derived from human videos; however, these methods overlook crucial agentspecific information necessary for bimanual coordination, such as end-effector positions. We propose Ag2x2, a computational framework for bimanual manipulation through coordination-aware visual representations that jointly encode object states and hand motion patterns while maintaining agent-agnosticism. Extensive experiments demonstrate that Ag2x2 achieves a 73.5% success rate across 13 diverse bimanual tasks from Bi-DexHands and PerAct2, including challenging scenarios with deformable objects like ropes. This performance outperforms baseline methods and even surpasses the success rate of policies trained with expert-engineered rewards. Furthermore, we show that representations learned through Ag2x2 can be effectively leveraged for imitation learning, establishing a scalable pipeline for skill acquisition without expert supervision. By maintaining robust performance across diverse tasks without human demonstrations or engineered rewards, Ag2x2 represents a step toward scalable learning of complex bimanual robotic skills.

Abstract:
Affordance refers to the functional properties that an agent perceives and utilizes from its environment, and is key perceptual information required for robots to perform actions. This information is rich and multimodal in nature. Existing multimodal affordance methods face limitations in extracting useful information, mainly due to simple structural designs, basic fusion methods, and large model parameters, making it difficult to meet the performance requirements for practical deployment. To address these issues, this paper proposes the BiT-Align image-depth-text affordance mapping framework. The framework includes a Bypass Prompt Module (BPM) and a Text Feature Guidance (TFG) attention selection mechanism. BPM integrates the auxiliary modality depth image directly as a prompt to the primary modality RGB image, embedding it into the primary modality encoder without introducing additional encoders. This reduces the model’s parameter count and effectively improves functional region localization accuracy. The TFG mechanism guides the selection and enhancement of attention heads in the image encoder using textual features, improving the understanding of affordance characteristics. Experimental results demonstrate that the proposed method achieves significant performance improvements on public AGD20K and HICO-IIF datasets. On the AGD20K dataset, compared with the current state-of-the-art method, we achieve a 6.0% improvement in the KLD metric, while reducing model parameters by 88.8%, demonstrating practical application values. The source code will be made publicly available at https://github.com/DAWDSE/BiT-Align.

Abstract:
Open Semantic Mapping (OSM) is a key technology in robotic perception, combining semantic segmentation and SLAM techniques. This paper introduces a dynamically configurable and highly automated LLM/LVLM-powered pipeline for evaluating OSM solutions called OSMa-Bench (Open Semantic Mapping Benchmark). The study focuses on evaluating state-of-the-art semantic mapping algorithms under varying indoor lighting conditions, a critical challenge in indoor environments. We introduce a novel dataset with simulated RGB-D sequences and ground truth 3D reconstructions, facilitating the rigorous analysis of mapping performance across different lighting conditions. Through experiments on leading models such as ConceptGraphs [1], BBQ [2], and OpenScene [3], we evaluate the semantic fidelity of object recognition and segmentation. Additionally, we introduce a scene graph evaluation method to analyze the ability of models to interpret semantic structure. The results provide insights into the robustness of these models, forming future research directions for developing resilient and adaptable robotic systems. Our code is available at https://be2rlab.github.io/OSMa-Bench/.

Abstract:
Efficient and safe trajectory planning plays a critical role in the application of quadrotor unmanned aerial vehicles. Currently, the inherent trade-off between constraint compliance and computational efficiency enhancement in UAV trajectory optimization problems has not been sufficiently addressed. To enhance the performance of UAV trajectory optimization, we propose a spatial-temporal iterative optimization framework. Firstly, B-splines are utilized to represent UAV trajectories, with rigorous safety assurance achieved through strict enforcement of constraints on control points. Subsequently, a set of QP-LP subproblems via spatial-temporal decoupling and constraint linearization is derived. Finally, an iterative optimization strategy incorporating guidance gradients is employed to obtain high-performance UAV trajectories in different scenarios. Both simulation and real-world experimental results validate the efficiency and high performance of the proposed optimization framework in generating safe and fast trajectories. Our source code will be released for community reference. 1

Abstract:
Egocentric pose estimation is a fundamental capability for multi-robot collaborative perception in connected autonomy, such as connected autonomous vehicles. During multi-robot operations, a robot needs to know the relative pose between itself and its teammates with respect to its own coordinates. However, different robots usually observe completely different views that contains similar objects, which leads to wrong pose estimation. In addition, it is unrealistic to allow robots to share their raw observations to detect overlap due to the limited communication bandwidth constraint. In this paper, we introduce a novel method for Non-Overlap-Aware Egocentric Pose Estimation (NOPE), which performs egocentric pose estimation in a multi-robot team while identifying the non-overlap views and satifying the communication bandwidth constraint. NOPE is built upon an unified hierarchical learning framework that integrates two levels of robot learning: (1) high-level deep graph matching for correspondence identification, which allows to identify if two views are overlapping or not, (2) low-level position-aware cross-attention graph learning for egocentric pose estimation. To evaluate NOPE, we conduct extensive experiments in both high-fidelity simulation and real-world scenarios. Experimental results have demonstrated that NOPE enables the novel capability for non-overlapping-aware egocentric pose estimation and achieves state-of-art performance compared with the existing methods.

Abstract:
Unmanned aerial vehicle object detection (UAV-OD) has been widely used in various scenarios. However, most existing UAV-OD algorithms rely on manually designed components, which require extensive tuning. End-to-end models that do not depend on such manually designed components are mainly designed for natural images, which are less effective for UAV imagery. To address such challenges, this paper proposes an efficient detection transformer (DETR) framework tailored for UAV imagery, i.e., UAV-DETR. The framework includes a multi-scale feature fusion with frequency enhancement module, which captures both spatial and frequency information at different scales. In addition, a frequency-focused downsampling module is presented to retain critical spatial details during downsampling. A semantic alignment and calibration module is developed to align and fuse features from different fusion paths. Experimental results demonstrate the effectiveness and generalization of our approach across various UAV imagery datasets. On the VisDrone dataset, our method improves AP by 3.1% and AP50 by 4.2% over the baseline. Similar enhancements are observed on the UAVVaste dataset. The project page is available at https://github.com/ValiantDiligent/UAV-DETR.

Abstract:
3D Gaussian Splatting (3DGS) achieves remarkable results in the field of surface reconstruction. However, when Gaussian normal vectors are aligned within the single-view projection plane, while the geometry appears reasonable in the current view, biases may emerge upon switching to nearby views. To address the distance and global matching challenges in multi-view scenes, we design multi-view normal and distance-guided Gaussian splatting. This method achieves geometric depth unification and high-accuracy reconstruction by constraining nearby depth maps and aligning 3D normals. Specifically, for the reconstruction of small indoor and outdoor scenes, we propose a multi-view distance reprojection regularization module that achieves multi-view Gaussian alignment by computing the distance loss between two nearby views and the same Gaussian surface. Additionally, we develop a multiview normal enhancement module, which ensures consistency across views by matching the normals of pixel points in nearby views and calculating the loss. Extensive experimental results demonstrate that our method outperforms the baseline in both quantitative and qualitative evaluations, significantly enhancing the surface reconstruction capability of 3DGS.

Abstract:
Recent advancements in control of prosthetic hands have focused on increasing autonomy through the use of cameras and other sensory inputs. These systems aim to reduce the cognitive load on the user by automatically controlling certain degrees of freedom. In robotics, imitation learning has emerged as a promising approach for learning grasping and complex manipulation tasks while simplifying data collection. Its application to the control of prosthetic hands remains, however, largely unexplored. Bridging this gap could enhance dexterity restoration and enable prosthetic devices to operate in more unconstrained scenarios, where tasks are learned from demonstrations rather than relying on manually annotated sequences. To this end, we present HannesImitationPolicy, an imitation learning-based method to control the Hannes prosthetic hand, enabling object grasping in unstructured environments. Moreover, we introduce the HannesImitationDataset comprising grasping demonstrations in table, shelf, and human-to-prosthesis handover scenarios. We leverage such data to train a single diffusion policy and deploy it on the prosthetic hand to predict the wrist orientation and hand closure for grasping. Experimental evaluation demonstrates successful grasps across diverse objects and conditions. Finally, we show that the policy outperforms a segmentation-based visual servo controller in unstructured scenarios. Additional material is provided on our project page: https://hsp-iit.github.io/HannesImitation.

Abstract:
Novel Instance Detection and Segmentation (NIDS) aims at detecting and segmenting novel object instances given a few examples of each instance. We propose a unified, simple, yet effective framework (NIDS-Net) comprising object proposal generation, embedding creation for both instance templates and proposal regions, and embedding matching for instance label assignment. Leveraging recent advancements in large vision methods, we utilize Grounding DINO and Segment Anything Model (SAM) to obtain object proposals with accurate bounding boxes and masks. Central to our approach is the generation of high-quality instance embeddings. We utilize foreground feature averages of patch embeddings from the DINOv2 ViT backbone, followed by refinement through a weight adapter mechanism that we introduce.We show experimentally that our weight adapter can adjust the embeddings locally within their feature space and effectively limit overfitting in the few-shot setting. Furthermore, the weight adapter optimizes weights to enhance the distinctiveness of instance embeddings during similarity computation. This methodology enables a straightforward matching strategy that results in significant performance gains. Our framework surpasses current state-of-the-art methods, demonstrating notable improvements in four detection datasets. In the segmentation tasks on seven core datasets of the BOP challenge, our method outperforms the leading published RGB methods and remains competitive with the best RGB-D method. We have also verified our method using real-world images from a Fetch robot and a RealSense camera. 1

Abstract:
Robotic manipulation in dynamic environments often requires seamless transitions between different grasp types to maintain stability and efficiency. However, achieving smooth and adaptive grasp transitions remains a challenge, particularly when dealing with external forces and complex motion constraints. Existing grasp transition strategies often fail to account for varying external forces and do not optimize motion performance effectively. In this work, we propose an Imitation-Guided Bimanual Planning Framework that integrates efficient grasp transition strategies and motion performance optimization to enhance stability and dexterity in robotic manipulation. Our approach introduces Strategies for Sampling Stable Intersections in Grasp Manifolds for seamless transitions between uni-manual and bi-manual grasps, reducing computational costs and regrasping inefficiencies. Additionally, a Hierarchical Dual-Stage Motion Architecture combines an Imitation Learning-based Global Path Generator with a Quadratic Programming-driven Local Planner to ensure real-time motion feasibility, obstacle avoidance, and superior manipulability. The proposed method is evaluated through a series of force-intensive tasks, demonstrating significant improvements in grasp transition efficiency and motion performance. A video demonstrating our simulation results can be viewed at https://youtu.be/3DhbUsv4eDo.

Abstract:
Omnidirectional depth perception is essential for mobile robotics applications that require scene understanding across a full 360° field of view. Camera-based setups offer a cost-effective option by using stereo depth estimation to generate dense, high-resolution depth maps without relying on expensive active sensing. However, existing omnidirectional stereo matching approaches achieve only limited depth accuracy across diverse environments, depth ranges, and lighting conditions, due to the scarcity of real-world data. We present DFI-OmniStereo, a novel omnidirectional stereo matching method that leverages a large-scale pre-trained foundation model for relative monocular depth estimation within an iterative optimization-based stereo matching architecture. We introduce a dedicated two-stage training strategy to utilize the relative monocular depth features for our omnidirectional stereo matching before scale-invariant fine-tuning. DFI-OmniStereo achieves state-of-the-art results on the real-world Helvipad dataset, reducing disparity MAE by approximately 16% compared to the previous best omnidirectional stereo method.

Abstract:
This paper introduces MCTrack, a new 3D multi-object tracking method that achieves performance across KITTI, nuScenes, and Waymo datasets. Addressing the gap in existing tracking paradigms, which often perform well on specific datasets but lack generalizability, MCTrack offers a unified solution. Additionally, we have standardized the format of perceptual results across various datasets, termed BaseVersion, facilitating researchers in the field of MOT) to concentrate on the core algorithmic development without the undue burden of data preprocessing. Finally, recognizing the limitations of current evaluation metrics, we introduce a novel set of metrics designed to evaluate the output of motion information, including velocity and acceleration, which are essential for subsequent tasks. The source codes of the proposed method are available at this link: https://github.com/megvii-research/MCTrack

Abstract:
Vision-and-Language Navigation (VLN) in continuous environments requires agents to interpret natural language instructions while navigating unconstrained 3D spaces. Existing VLN-CE frameworks rely on a two-stage approach: a waypoint predictor to generate waypoints and a navigator to execute movements. However, current waypoint predictors struggle with spatial awareness, while navigators lack historical reasoning and backtracking capabilities, limiting adaptability. We propose a zero-shot VLN-CE framework integrating an enhanced waypoint predictor with a Multi-modal Large Language Model (MLLM)-based navigator. Our predictor employs a stronger vision encoder, masked cross-attention fusion, and an occupancy-aware loss for better waypoint quality. The navigator incorporates history-aware reasoning and adaptive path planning with backtracking, improving robustness. Experiments on R2R-CE and MP3D benchmarks show our method achieves state-of-the-art (SOTA) performance in zero-shot settings, demonstrating competitive results compared to fully supervised methods. Real-world validation on Turtlebot 4 further highlights its adaptability.

Abstract:
Existing LiDAR-Inertial Odometry (LIO) systems typically use sensor-specific or environment-dependent measurement covariances during state estimation, leading to laborious parameter tuning and suboptimal performance in challenging conditions (e.g., sensor degeneracy and noisy observations). Therefore, we propose an Adaptive Kalman Filter (AKF) framework that dynamically estimates time-varying noise covariances of LiDAR and Inertial Measurement Unit (IMU) measurements, enabling context-aware confidence weighting between sensors. During LiDAR degeneracy, the system prioritizes IMU data while suppressing contributions from unreliable inputs like moving objects or noisy point clouds. Furthermore, a compact Gaussian-based map representation is introduced to model environmental planarity and spatial noise. A correlated registration strategy ensures accurate plane normal estimation via pseudo-merge, even in unstructured environments like forests. Extensive experiments validate the robustness of the proposed system across diverse environments, including dynamic scenes and geometrically degraded scenarios. Our method achieves reliable localization results across all MARS-LVIG sequences and ranks 8th on the KITTI Odometry Benchmark. The code will be released at https://github.com/xpxie/AKF-LIO.git.

Abstract:
We propose a tightly-coupled LiDAR/Polarization Vision/Inertial/Magnetometer/Optical Flow Odometry via Smoothing and Mapping (LPVIMO-SAM) framework, which integrates LiDAR, polarization vision, inertial measurement unit, magnetometer, and optical flow in a tightly-coupled fusion. It enables high-precision and robust real-time state estimation and map construction in challenging environments, such as LiDAR-degraded, low-texture regions, and feature-scarce areas. The LPVIMO-SAM comprises two subsystems: a polarized vision-inertial system and a LiDAR/Inertial/Magnetometer/Optical Flow System. The polarized vision enhances the robustness of the Visual/Inertial odometry in low-feature and low-texture scenarios by extracting the polarization information of the scene. The magnetometer acquires the heading angle, and the optical flow obtains the speed and height to reduce the accumulated error. A magnetometer heading prior factor, an optical flow speed observation factor, and a height observation factor are designed to eliminate the cumulative errors of the LiDAR/Inertial odometry through factor graph optimization. Meanwhile, the LPVIMO-SAM can maintain stable positioning even when one of the two subsystems fails, further expanding its applicability in LiDAR-degraded, low-texture, and low-feature environments. Our comparative experiments against existing algorithms reveals that achieves a 43.4% reduction in localization root mean square error relative to LVI-SAM, demonstrating significant precision improvements.

Abstract:
Developing autonomous vehicles that can navigate complex environments with human-level safety and efficiency is a central goal in self-driving research. A common approach to achieving this is imitation learning, where agents are trained to mimic human expert demonstrations collected from real- world driving scenarios. However, discrepancies between human perception and the self-driving car's sensors can introduce an imitation gap, leading to imitation learning failures. In this work, we introduce IGDrivSim, a benchmark built on top of the Waymax simulator, designed to investigate the effects of the imitation gap in learning autonomous driving policy from human expert demonstrations. Our experiments show that this perception gap between human experts and selfdriving agents can hinder the learning of safe and effective driving behaviors. We further show that combining imitation with reinforcement learning, using a simple penalty reward for prohibited behaviors, effectively mitigates these failures. All code developed for this work is released as open source 1.

Abstract:
Deformable object manipulation in robotics presents significant challenges due to uncertainties in component properties, diverse configurations, visual interference, and ambiguous prompts. These factors complicate both perception and control tasks. To address these challenges, we propose a novel method for One-Shot Affordance Grounding of Deformable Objects (OS-AGDO) in egocentric organizing scenes, enabling robots to recognize previously unseen deformable objects with varying colors and shapes using minimal samples. Specifically, we first introduce the Deformable Object Semantic Enhancement Module (DefoSEM), which enhances hierarchical understanding of the internal structure and improves the ability to accurately identify local features, even under conditions of weak component information. Next, we propose the ORB-Enhanced Keypoint Fusion Module (OEKFM), which optimizes feature extraction of key components by leveraging geometric constraints and improves adaptability to diversity and visual interference. Additionally, we propose an instance-conditional prompt based on image data and task context, which effectively mitigates the issue of region ambiguity caused by prompt words. To validate these methods, we construct a diverse real-world dataset, AGDDO15, which includes 15 common types of deformable objects and their associated organizational actions. Experimental results demonstrate that our approach significantly outperforms state-of-the-art methods, achieving improvements of 6.2%, 3.2%, and 2.9% in KLD, SIM, and NSS metrics, respectively, while exhibiting high generalization performance. Source code and benchmark dataset are made publicly available at https://github.com/Dikay1/OS-AGDO.

Abstract:
This paper presents a new method for anomaly detection in automated systems with time and compute sensitive requirements with unparalleled efficiency. As systems like autonomous driving become increasingly popular, ensuring their safety has become more important than ever. With this motivation, this paper focuses on how to quickly and effectively detect various anomalies in the aforementioned systems. Many detection systems have been developed with great success under spatial contexts. However, there is still significant room for improvement when it comes to temporal context. While there is substantial work regarding this task, there is minimal work done regarding the efficiency of models and their ability to be applied to scenarios that require real-time inference. To address this gap, we propose STEAD (Spatio-Temporal Efficient Anomaly Detection), whose backbone is developed using (2+1)D Convolutions and Performer Linear Attention, which ensures computational efficiency without sacrificing performance. When evaluated on the UCF-Crime benchmark, our base model achieves an AUC of 91.34%, outperforming the previous SOTA (state of the art), and our fast version achieves an AUC of 88.87%, while having 99.70% less parameters and outperforming the previous SOTA as well. The code and pretrained models are made publicly available at https://github.com/agao8/STEAD.

Abstract:
Grasp detection is a fundamental robotic task critical to the success of many industrial applications. However, current language-driven models for this task often struggle with cluttered images, lengthy textual descriptions, or slow inference speed. We introduce GraspMamba, a new language-driven grasp detection method that employs hierarchical feature fusion with Mamba vision to tackle these challenges. By leveraging rich visual features of the Mamba-based backbone alongside textual information, our approach effectively enhances the fusion of multimodal features. GraspMamba represents the first Mamba-based grasp detection model to extract vision and language features at multiple scales, delivering robust performance and rapid inference time. Intensive experiments show that GraspMamba outperforms recent methods by a clear margin. We validate our approach through real-world robotic experiments, highlighting its fast inference speed.

Abstract:
3D occupancy prediction (3DOcc) is a rapidly rising and challenging perception task in the field of autonomous driving. Existing 3D occupancy networks (OccNets) are both computationally heavy and label-hungry. In terms of model complexity, OccNets are commonly composed of heavy Conv3D modules or transformers at the voxel level. Moreover, OccNets are supervised with expensive large-scale dense voxel labels. Model and label inefficiencies, caused by excessive network parameters and label annotation requirements, severely hinder the onboard deployment of OccNets. This paper proposes an EFFicient Occupancy learning framework, EFFOcc, that targets minimal network complexity and label requirements while achieving state-of-the-art accuracy. We first propose an efficient fusion-based OccNet that only uses simple 2D operators and improves accuracy to the state-of-the-art on three large-scale benchmarks: Occ3D-nuScenes, Occ3D-Waymo, and OpenOccupancy-nuScenes. On the Occ3D-nuScenes benchmark, the fusion-based model with ResNet-18 as the image backbone has 21.35M parameters and achieves 51.49 in terms of mean Intersection over Union (mIoU). Furthermore, we propose a multi-stage occupancy-oriented distillation to efficiently transfer knowledge to vision-only OccNet. Extensive experiments on occupancy benchmarks show state-of-the-art precision for both fusion-based and vision-based OccNets. For the demonstration of learning with limited labels, we achieve 94.38% of the performance (mIoU = 28.38) of a 100% labeled vision OccNet (mIoU = 30.07) using the same OccNet trained with only 40% labeled sequences and distillation from the fusion-based OccNet. Code is available at https://github.com/synsin0/EFFOcc.

Abstract:
Tracking any point based on image frames is constrained by frame rates, leading to instability in high-speed scenarios and limited generalization in real-world applications. To overcome these limitations, we propose an image-event fusion point tracker, FE-TAP, which combines the contextual information from image frames with the high temporal resolution of events, achieving high frame rate and robust point tracking under various challenging conditions. Specifically, we designed an Evolution Fusion module (EvoFusion) to model the image generation process guided by events. This module can effectively integrate valuable information from both modalities operating at different frequencies. To achieve smoother point trajectories, we employed a transformer-based refinement strategy that updates the point’s trajectories and features iteratively. Extensive experiments demonstrate that our method outperforms state-of-the-art approaches, particularly improving expected feature age by 24% on EDS datasets. Finally, we qualitatively validated the robustness of our algorithm in real driving scenarios using our custom-designed image-event synchronization device.

Abstract:
Multi-Agent Reinforcement Learning (MARL) plays a crucial role in robotic coordination and control, yet existing simulation environments often lack the fidelity and scalability needed for real-world applications. In this work, we extend Isaac Lab to support efficient training of both homogeneous and heterogeneous multi-agent robotic policies in high-fidelity physics simulations. Our contributions include the development of diverse MARL environments tailored for robotic coordination, integration of Heterogeneous Agent Reinforcement Learning (HARL) algorithms, and a scalable GPU-accelerated framework optimized for large-scale training. We evaluate our framework using two state-of-the-art MARL algorithms—Multi-Agent Reinforcement Learning with Proximal Policy Optimization (MAPPO) and Heterogeneous Agent Reinforcement Learning with Proximal Policy Optimization (HAPPO)—across several robotic tasks. Our results confirm the feasibility of training heterogeneous agents in high-fidelity environments while maintaining the scalability and performance benefits of Isaac Lab. By advancing realistic multi-agent learning at scale, our work lays a foundation for more MARL research in physics-driven robotics. The source code and demonstration videos are available at https://some45bucks.github.io/IsaacLab-HARL/.

Abstract:
Traffic Atomic Activity, which describes traffic patterns for topological intersection dynamics, is a crucial topic for the advancement of intelligent driving systems. However, existing atomic activity datasets are collected from an egocentric view, which cannot support the scenarios where traffic activities in an entire intersection are required. Moreover, existing datasets only provide video-level atomic activity annotations, which require exhausting efforts to manually trim the videos for recognition and limit their applications to untrimmed videos. To bridge this gap, we introduce the Aerial Traffic Atomic Activity Recognition and Segmentation (ATARS) dataset, the first aerial dataset designed for multilabel atomic activity analysis. We offer atomic activity labels for each frame, which accurately record the intervals for traffic activities. Moreover, we propose a novel task, Multi-label Temporal Atomic Activity Recognition, enabling the study of accurate temporal localization for atomic activity and easing the burden of manual video trimming for recognition. We conduct extensive experiments to evaluate existing state-of-theart models on both atomic activity recognition and temporal atomic activity segmentation. The results highlight the unique challenges of our ATARS dataset, such as recognizing extremely small objects’ activities. We further provide a comprehensive discussion analyzing these challenges and offer valuable insights for future direction to improve recognition of atomic activity in an aerial view. Our source code and dataset are available at https://github.com/magecliff96/ATARS/.

Abstract:
Vehicle-to-everything (V2X) communication plays a crucial role in autonomous driving, enabling cooperation between vehicles and infrastructure. While simulation has significantly contributed to various autonomous driving tasks, its potential for data generation and augmentation in V2X scenarios remains underexplored. In this paper, we introduce CRUISE, a comprehensive reconstruction-and-synthesis framework designed for V2X driving environments. CRUISE employs decomposed Gaussian Splatting to accurately reconstruct real-world scenes while supporting flexible editing. By decomposing dynamic traffic participants into editable Gaussian representations, CRUISE allows for seamless modification and augmentation of driving scenes. Furthermore, the framework renders images from both ego-vehicle and infrastructure views, enabling large-scale V2X dataset augmentation for training and evaluation. Our experimental results demonstrate that: 1) CRUISE reconstructs real-world V2X driving scenes with high fidelity; 2) using CRUISE improves 3D detection across ego-vehicle, infrastructure, and cooperative views, as well as cooperative 3D tracking on the V2X-Seq benchmark; and 3) CRUISE effectively generates challenging corner cases. The code will be publicly available at https://github.com/SainingZhang/CRUISE.

Abstract:
We introduce a novel grasp representation named the Unified Gripper Coordinate Space (UGCS) for grasp synthesis and grasp transfer. Our representation leverages spherical coordinates to create a shared coordinate space across different robot grippers, enabling it to synthesize and transfer grasps for both novel objects and previously unseen grippers. The strength of this representation lies in the ability to map palm and fingers of a gripper in the unified coordinate space. Grasp synthesis is formulated as predicting the unified spherical coordinates on object surface points via a conditional variational autoencoder. The predicted unified gripper coordinates establish exact correspondences between the gripper and object points, which is used to optimize grasp pose and joint values. Grasp transfer is facilitated through the point-to-point correspondence between any two (potentially unseen) grippers and solved via a similar optimization. Extensive simulation and real-world experiments showcase the efficacy of the unified grasp representation for grasp synthesis in generating stable and diverse grasps. Similarly, we showcase real-world grasp transfer from human demonstrations across different objects.1

Abstract:
Skeleton-based Human Action Recognition (HAR) is a vital technology in robotics and human–robot interaction. However, most existing methods concentrate primarily on full-body movements and often overlook subtle hand motions that are critical for distinguishing fine-grained actions. Recent work leverages a unified graph representation that combines body, hand, and foot keypoints to capture detailed body dynamics. Yet, these models often blur fine hand details due to the disparity between body and hand action characteristics and the loss of subtle features during the spatial-pooling. In this paper, we propose BHaRNet (Body–Hand action Recognition Network), a novel framework that augments a typical body-expert model with a hand-expert model. Our model jointly trains both streams with an ensemble loss that fosters cooperative specialization, functioning in a manner reminiscent of a Mixture-of-Experts (MoE). Moreover, cross-attention is employed via an expertized branch method and a pooling-attention module to enable feature-level interactions and selectively fuse complementary information. Inspired by MMNet, we also demonstrate the applicability of our approach to multi-modal tasks by leveraging RGB information, where body features guide RGB learning to capture richer contextual cues. Experiments on large-scale benchmarks (NTU RGB+D 60, NTU RGB+D 120, PKU-MMD, and Northwestern-UCLA) demonstrate that BHaRNet achieves SOTA accuracies—improving from 86.4% to 93.0% in hand-intensive actions—while maintaining fewer GFLOPs and parameters than the relevant unified methods.

Abstract:
This paper presents a novel tightly coupled Filter-based monocular visual-inertial-wheel odometry (VIWO) system for ground robots, designed to deliver accurate and robust localization in long-term complex outdoor navigation scenarios. As an external sensor, the camera enhances localization performance by introducing visual constraints. However, obtaining a sufficient number of effective visual features is often challenging, particularly in dynamic or low-texture environments. To address this issue, we incorporate the line features for additional geometric constraints. Unlike traditional approaches that treat point and line features independently, our method exploits the geometric relationships between points and lines in 2D images, enabling fast and robust line matching and triangulation. Additionally, we introduce Motion Consistency Check (MCC) to filter out potential dynamic points, ensuring the effectiveness of point feature updates. The proposed system was evaluated on publicly available datasets and benchmarked against state-of-the-art methods. Experimental results demonstrate superior performance in terms of accuracy, robustness, and efficiency. The source code is publicly available at: https://github.com/Happy-ZZX/PL-VIWO.

Abstract:
Manipulation of thin materials is critical for many everyday tasks and remains a significant challenge for robots. While existing research has made strides in tasks like material smoothing and folding, many studies struggle with common failure modes (crumpled corners/edges, incorrect grasp configurations) that a preliminary step of layer detection could solve. We present a novel method for classifying the number of grasped material layers using a custom gripper equipped with DenseTact 2.0 optical tactile sensors. After grasping, the gripper performs an anthropomorphic rubbing motion while collecting optical flow, 6-axis wrench, and joint state data. Using this data in a transformer-based network achieves a test accuracy of 98.21% in classifying the number of grasped cloth layers, and 81.25% accuracy in classifying layers of grasped paper, showing the effectiveness of our dynamic rubbing method. Evaluating different inputs and model architectures highlights the usefulness of tactile sensor information and a transformer model for this task. A comprehensive dataset of 568 labeled trials (368 for cloth and 200 for paper) was collected and made open-source along with this paper.

Abstract:
Accurate 6D pose estimation is key for robotic manipulation, enabling precise object localization for tasks like grasping. We present RAG-6DPose, a retrieval-augmented approach that leverages 3D CAD models as a knowledge base by integrating both visual and geometric cues. Our RAG-6DPose roughly contains three stages: 1) Building a Multi-Modal CAD Knowledge Base by extracting 2D visual features from multi-view CAD rendered images and also attaching 3D points; 2) Retrieving relevant CAD features from the knowledge base based on the current query image via our ReSPC module; and 3) Incorporating retrieved CAD information to refine pose predictions via retrieval-augmented decoding. Experimental results on standard benchmarks and real-world robotic tasks demonstrate the effectiveness and robustness of our approach, particularly in handling occlusions and novel viewpoints. Supplementary material is available on our project website: https://sressers.github.io/RAG-6DPose.

Abstract:
Minimally invasive procedures have been advanced rapidly by the robotic laparoscopic surgery. The latter greatly assists surgeons in sophisticated and precise operations with reduced invasiveness. Nevertheless, it is still safety critical to be aware of even the least tissue deformation during instrument-tissue interactions, especially in 3D space. To address this, recent works rely on NeRF to render 2D videos from different perspectives and eliminate occlusions. However, most of the methods fail to predict the accurate 3D shapes and associated deformation estimates robustly. Differently, we propose Tracking-Aware Deformation Field (TADF), a novel framework which reconstructs the 3D mesh along with the 3D tissue deformation simultaneously. It first tracks the key points of soft tissue by a foundation vision model, providing an accurate 2D deformation field. Then, the 2D deformation field is smoothly incorporated with a neural implicit reconstruction network to obtain tissue deformation in the 3D space. Finally, we experimentally demonstrate that the proposed method provides more accurate deformation estimation compared with other 3D neural reconstruction methods in two public datasets. Our demo is available at https://kasumigaoka-utaha.github.io/TADF-web/. Our code is available at https://github.com/Zing110/TADF.

Abstract:
Effective monitoring of underwater ecosystems is crucial for tracking environmental changes, guiding conservation efforts, and ensuring long-term ecosystem health. However, automating underwater ecosystem management with robotic platforms remains challenging due to the complexities of underwater imagery, which pose significant difficulties for traditional visual localization methods. We propose an integrated pipeline that combines Visual Place Recognition (VPR), feature matching, and image segmentation on images extracted from video sequences. This method enables robust identification of revisited areas, estimation of rigid transformations, and downstream analysis of ecosystem changes. Furthermore, we introduce the SQUIDLE+ VPR BenchmarkÐthe first large-scale underwater VPR benchmark designed to leverage an extensive collection of unstructured data from multiple robotic platforms, spanning time intervals from days to years. The dataset encompasses diverse trajectories with varying overlap and diverse seafloor types captured under different environmental conditions, including differences in depth, lighting, and turbidity. Our code is available at: https://github.com/bev-gorry/underloc.

Abstract:
Quadcopter attitude control involves two tasks: smooth attitude tracking and aggressive stabilization from arbitrary states. Although both can be formulated as tracking problems, their distinct state spaces and control strategies complicate a unified reward function. We propose a multitask deep reinforcement learning framework that leverages parallel simulation with IsaacGym and a Graph Convolutional Network (GCN) policy to address both tasks effectively. Our multitask Soft Actor-Critic (SAC) approach achieves faster, more reliable learning and higher sample efficiency than single-task methods. We validate its real-world applicability by deploying the learned policy—a compact two-layer network with 24 neurons per layer—on a Pixhawk flight controller, achieving 400 Hz control without extra computational resources. We provide our code at https://github.com/ robot-perception-group/GraphMTSAC_UAV/.

Abstract:
Tendon-driven mechanisms are useful from the perspectives of variable stiffness, redundant actuation, and lightweight design, and they are widely used, particularly in hands, wrists, and waists of robots. The design of these wire arrangements has traditionally been done empirically, but it becomes extremely challenging when dealing with complex structures. Various studies have attempted to optimize wire arrangement, but many of them have oversimplified the problem by imposing conditions such as restricting movements to a 2D plane, keeping the moment arm constant, or neglecting wire crossings. Therefore, this study proposes a three-dimensional wire arrangement optimization that takes wire crossings into account. We explore wire arrangements through a multi-objective black-box optimization method that ensures wires do not cross while providing sufficient joint torque along a defined target trajectory. For a 3D link structure, we optimize the wire arrangement under various conditions, demonstrate its effectiveness, and discuss the obtained design solutions.

Abstract:
Operating in unstructured environments like households requires robotic policies that are robust to out-of-distribution conditions. Although much work has been done in evaluating robustness for visuomotor policies, the robustness evaluation of a multisensory approach that includes force-torque sensing remains largely unexplored. This work introduces a novel, factor-based evaluation framework with the goal of assessing the robustness of multisensory policies in a peg-in-hole assembly task. To this end, we develop a multisensory policy framework utilizing the Perceiver IO architecture to learn the task. We investigate which factors pose the greatest generalization challenges in object assembly and explore a simple multisensory data augmentation technique to enhance out-of-distribution performance. We provide a simulation environment enabling controlled evaluation of these factors. Our results reveal that multisensory variations such as Grasp Pose present the most significant challenges for robustness, and naive unisensory data augmentation applied independently to each sensory modality proves insufficient to overcome them. Additionally, we find force-torque sensing to be the most informative modality for our contact-rich assembly task, with vision being the least informative. Finally, we briefly discuss supporting real-world experimental results. For additional experiments and qualitative results, we refer to the project webpage https://rpm-lab-umn.github.io/auginsert/.

Abstract:
Robot navigation in dynamic, crowded environments poses a significant challenge due to the inherent uncertainties in the obstacle model. In this work, we propose a risk-adaptive approach based on the Conditional Value-at-Risk Barrier Function (CVaR-BF), where the risk level is automatically adjusted to accept the minimum necessary risk, achieving a good performance in terms of safety and optimization feasibility under uncertainty. Additionally, we introduce a dynamic zone-based barrier function which characterizes the collision likelihood by evaluating the relative state between the robot and the obstacle. By integrating risk adaptation with this new function, our approach adaptively expands the safety margin, enabling the robot to proactively avoid obstacles in highly dynamic environments. Comparisons and ablation studies demonstrate that our method outperforms existing social navigation approaches, and validate the effectiveness of our proposed framework. [Paper Page] [Video] [Code].

Abstract:
Robot exploration aims at the reconstruction of unknown environments, and it is important to achieve it with shorter paths. Traditional methods focus on optimizing the visiting order of frontiers based on current observations, which may lead to local-minimal results. Recently, by predicting the structure of the unseen environment, the exploration efficiency can be further improved. However, in a cluttered environment, due to the randomness of obstacles, the ability to predict is weak. Moreover, this inaccuracy will lead to limited improvement in exploration. Therefore, we propose FPUNet which can be efficient in predicting the layout of noisy indoor environments. Then, we extract the segmentation of rooms and construct their topological connectivity based on the predicted map. The visiting order of these predicted rooms is optimized which can provide high-level guidance for exploration. The FPUNet is compared with other network architectures which demonstrates it is the SOTA method for this task. Extensive experiments in simulations show that our method can shorten the path length by 2.18% to 34.60% compared to the baselines.

Abstract:
Wheeled-legged robots combine the efficiency of wheels with the versatility of legs, but face significant energy optimization challenges when navigating diverse environments. In this work, we present a hierarchical control framework that integrates predictive power modeling with residual reinforcement learning to optimize omnidirectional locomotion efficiency for wheeled quadrupedal robots. Our approach employs a novel power prediction network that forecasts energy consumption across different gait patterns over a 1-second horizon, enabling intelligent selection of the most energy-efficient nominal gait. A reinforcement learning policy then generates residual adjustments to this nominal gait, fine-tuning the robot’s actions to balance energy efficiency with performance objectives. Comparative analysis shows our method reduces energy consumption by up to 35% compared to fixed-gait approaches while maintaining comparable velocity tracking performance. We validate our framework through extensive simulations and real-world experiments on a modified Unitree Go1 platform, demonstrating robust performance even under external disturbances. Videos and implementation details are available at https://sites.google.com/view/switching-wpg.

Abstract:
Fusion of LiDAR and RGB data has the potential to enhance outdoor 3D object detection accuracy. To address real-world challenges in outdoor 3D object detection, fusion of LiDAR and RGB input has started gaining traction. However, effective integration of these modalities for precise object detection tasks still remains a largely open problem. To address that, we propose a MultiStream Detection (MuStD) network, which meticulously extracts task-relevant information from both data modalities. The network follows a three-stream structure. Its LiDAR-PillarNet stream extracts sparse 2D pillar features from the LiDAR input while the LiDAR-Height Compression stream computes Bird’s-Eye View features. An additional 3D Multimodal stream combines RGB and LiDAR features using UV mapping and polar coordinate indexing. Eventually, the features containing comprehensive spatial, textural, and geometric information are carefully fused and fed to a detection head for 3D object detection. We evaluate our method on the challenging KITTI Object Detection Benchmark, with results available on the official evaluation server. 1. Our approach achieves strong performance, with an average precision (AP) of 85.39% in 3D detection, 91.34% in Bird’s Eye View (BEV) detection, and 96.39% in 2D detection. These results match or surpass existing state-of-the-art methods. In the difficult "Hard" category, our method attains 80.78% AP in 3D detection and 94.04% AP in 2D detection, highlighting its robustness in challenging scenarios. Furthermore, our method runs at 67 ms, demonstrating efficiency and real-time capability. Our code will be released through the MuStD GitHub repository at https://github.com/IbrahimUWA/MuStD.

Abstract:
As end-to-end autonomous driving advances toward real-world deployment, ensuring the safety of autonomous vehicles (AVs) has become a critical requirement for their commercial viability. While rule-based AVs have traditionally undergone rigorous testing in both real-world and simulated environments before deployment, data-driven autonomous models are typically trained on real-world datasets, limiting their generalization to simulation environments. This poses a significant challenge for the development and testing of end-to-end autonomous driving. To address this issue, we propose Retrieval-Augmented Learning for Autonomous Driving (RALAD), a novel framework designed to bridge the real-to-sim gap in a cost-effective manner. RALAD consists of three key components: (1) domain adaptation via an enhanced Optimal Transport (OT) method, which retrieves the most similar scenarios between real and simulated environments; (2) feature fusion across similar scenarios, enabling the construction of a feature mapping between real-world and simulated domains; and (3) feature extraction freezing with fine-tuning on the fused features, allowing the model to learn simulation-specific characteristics through feature mapping. We evaluate RALAD on three monocular 3D object detection models, and the results demonstrate that our approach significantly improves model accuracy in simulation. Additionally, we use real autonomous vehicle for testing in real-world scenarios, and have established simulated scenes similar to reality for further testing, which illustrate the effectiveness of our method.

Abstract:
Tactile sensors have a wide range of applications, from utilization in robotic grippers to human motion measurement. If tactile sensors could be fabricated and integrated more easily, their applicability would further expand. In this study, we propose a tactile sensor―M3D-skin―that can be easily fabricated with high versatility by leveraging the infill patterns of a multi-material fused deposition modeling (FDM) 3D printer as the sensing principle. This method employs conductive and non-conductive flexible filaments to create a hierarchical structure with a specific infill pattern. The flexible hierarchical structure deforms under pressure, leading to a change in electrical resistance, enabling the acquisition of tactile information. We measure the changes in characteristics of the proposed tactile sensor caused by modifications to the hierarchical structure. Additionally, we demonstrate the fabrication and use of a multi-tile sensor. Furthermore, as applications, we implement motion pattern measurement on the sole of a foot, integration with a robotic hand, and tactile-based robotic operations. Through these experiments, we validate the effectiveness of the proposed tactile sensor.

Abstract:
LiDAR scene generation is critical for mitigating real-world LiDAR data collection costs and enhancing the robustness of downstream perception tasks in autonomous driving. However, existing methods commonly struggle to capture geometric realism and global topological consistency. Recent LiDAR Diffusion Models (LiDMs) predominantly embed LiDAR points into the latent space for improved generation efficiency, which limits their interpretable ability to model detailed geometric structures and preserve global topological consistency. To address these challenges, we propose TopoLiDM, a novel framework that integrates graph neural networks (GNNs) with diffusion models under topological regularization for high-fidelity LiDAR generation. Our approach first trains a topological-preserving VAE to extract latent graph representations by graph construction and multiple graph convolutional layers. Then we freeze the VAE and generate novel latent topological graphs through the latent diffusion models. We also introduce 0-dimensional persistent homology (PH) constraints, ensuring the generated LiDAR scenes adhere to real-world global topological structures. Extensive experiments on the KITTI-360 dataset demonstrate TopoLiDM’s superiority over state-of-the-art methods, achieving improvements of 22.6% lower Fréchet Range Image Distance (FRID) and 9.2% lower Minimum Matching Distance (MMD). Notably, our model also enables fast generation speed with an average inference time of 1.68 samples/s, showcasing its scalability for real-world applications. We will release the related codes at https://github.com/IRMVLab/TopoLiDM.

Abstract:
Grasp-based manipulation tasks are fundamental to robots interacting with their environments, yet gripper state ambiguity significantly reduces the robustness of imitation learning policies for these tasks. Data-driven solutions face the challenge of high real-world data costs, while simulation data, despite its low costs, is limited by the sim-to-real gap. We identify the root cause of gripper state ambiguity as the lack of tactile feedback. To address this, we propose a novel approach employing pseudo-tactile as feedback, inspired by the idea of using a force-controlled gripper as a tactile sensor. This method enhances policy robustness without additional data collection and hardware involvement, while providing a noise-free binary gripper state observation for the policy and thus facilitating pure simulation learning to unleash the power of simulation. Experimental results across three real-world grasp-based tasks demonstrate the necessity, effectiveness, and efficiency of our approach. Videos are available on Project Page.

Abstract:
While large multimodal models (LMMs) have demonstrated strong performance across various Visual Question Answering (VQA) tasks, certain challenges require complex multi-step reasoning to reach accurate answers. One particularly challenging task is autonomous driving, which demands thorough cognitive processing before decisions can be made. In this domain, a sequential and interpretive understanding of visual cues is essential for effective perception, prediction, and planning. Nevertheless, common VQA benchmarks often focus on the accuracy of the final answer while overlooking the reasoning process that enables the generation of accurate responses. Moreover, existing methods lack a comprehensive framework for evaluating step-by-step reasoning in realistic driving scenarios. To address this gap, we propose DriveLMM-o1, a new dataset and benchmark specifically designed to advance step-wise visual reasoning for autonomous driving. Our benchmark features over 18k VQA examples in the training set and more than 4k in the test set, covering diverse questions on perception, prediction, and planning, each enriched with step-by-step reasoning to ensure logical inference in autonomous driving scenarios. We further introduce a large multimodal model that is fine-tuned on our reasoning dataset, demonstrating robust performance in complex driving scenarios. In addition, we benchmark various open-source and closed-source methods on our proposed dataset, systematically comparing their reasoning capabilities for autonomous driving tasks. Our model achieves a +7.49% gain in final answer accuracy, along with a 3.62% improvement in reasoning score over the previous best open-source model. Our framework, dataset, and model are available at https://github.com/ayesha-ishaq/DriveLMM-o1.

Abstract:
Digital twins are fundamental to the development of autonomous driving and embodied artificial intelligence. However, achieving high-granularity surface reconstruction and high-fidelity rendering remains a challenge. Gaussian splatting offers efficient photorealistic rendering but struggles with geometric inconsistencies due to fragmented primitives and sparse observational data in robotics applications. Existing regularization methods, which rely on render-derived constraints, often fail in complex environments. Moreover, effectively integrating sparse LiDAR data with Gaussian splatting remains challenging. We propose a unified LiDAR-visual system that synergizes Gaussian splatting with a neural signed distance field. The accurate LiDAR point clouds enable a trained neural signed distance field to offer a manifold geometry field. This motivates us to offer an SDF-based Gaussian initialization for physically grounded primitive placement and a comprehensive geometric regularization for geometrically consistent rendering and reconstruction. Experiments demonstrate superior reconstruction accuracy and rendering quality across diverse trajectories. To benefit the community, the codes are released at https: //github.com/hku-mars/GS-SDF.

Abstract:
We introduce FruitNeRF++, a novel fruit-counting approach that combines contrastive learning with neural radiance fields to count fruits from unstructured input photographs of orchards. Our work is based on FruitNeRF [6], which employs a neural semantic field combined with a fruit-specific clustering approach. The requirement for adaptation for each fruit type limits the applicability of the method, and makes it difficult to use in practice. To lift this limitation, we design a shape-agnostic multi-fruit counting framework, that complements the RGB and semantic data with instance masks predicted by a vision foundation model. The masks are used to encode the identity of each fruit as instance embeddings into a neural instance field. By volumetrically sampling the neural fields, we extract a point cloud embedded with the instance features, which can be clustered in a fruit-agnostic manner to obtain the fruit count. We evaluate our approach using a synthetic dataset containing apples, plums, lemons, pears, peaches, and mangoes, as well as a real-world benchmark apple dataset. Our results demonstrate that FruitNeRF++ is easier to control and compares favorably to other state-of-the-art methods.

Abstract:
Recent advancements in 3D scene understanding have made significant strides in enabling interaction with scenes using open-vocabulary queries, particularly for VR/AR and robotic applications. Nevertheless, existing methods are hindered by rigid offline pipelines and the inability to provide precise 3D object-level understanding given open-ended queries. In this paper, we present OpenGS-Fusion, an innovative open-vocabulary dense mapping framework that improves semantic modeling and refines object-level understanding. OpenGS-Fusion combines 3D Gaussian representation with a Truncated Signed Distance Field to facilitate lossless fusion of semantic features on-the-fly. Furthermore, we introduce a novel multimodal language-guided approach named MLLM-Assisted Adaptive Thresholding, which refines the segmentation of 3D objects by adaptively adjusting similarity thresholds, achieving an improvement 17% in 3D mIoU compared to the fixed threshold strategy. Extensive experiments demonstrate that our method outperforms existing methods in 3D object understanding and scene reconstruction quality, as well as showcasing its effectiveness in language-guided scene interaction. The code is available at https://young-bit.github.io/opengs-fusion.github.io/.

Abstract:
Relative localization is a crucial capability for multi-robot systems operating in GPS-denied environments. Existing approaches for multi-robot relative localization often depend on costly or short-range sensors like cameras and LiDARs. Consequently, these approaches face challenges such as high computational overhead (e.g., map merging) and difficulties in disjoint environments. To address this limitation, this paper introduces MGPRL, a novel distributed framework for multi-robot relative localization using convex-hull of multiple Wi-Fi access points (AP). To accomplish this, we employ co-regionalized multi-output Gaussian Processes for efficient Radio Signal Strength Indicator (RSSI) field prediction and perform uncertainty-aware multi-AP localization, which is further coupled with weighted convex hull-based alignment for robust relative pose estimation. Each robot predicts the RSSI field of the environment by an online scan of APs in its environment, which are utilized for position estimation of multiple APs. To perform relative localization, each robot aligns the convex hull of its predicted AP locations with that of the neighbor robots. This approach is well-suited for devices with limited computational resources and operates solely on widely available Wi-Fi RSSI measurements without necessitating any dedicated pre-calibration or offline fingerprinting. We rigorously evaluate the performance of the proposed MGPRL in ROS simulations and demonstrate it with real-world experiments, comparing it against multiple state-of-the-art approaches. The results showcase that MGPRL outperforms existing methods in terms of localization accuracy and computational efficiency.

Abstract:
Interpreting object-referential language and grounding objects in 3D with spatial relations and attributes is essential for robots operating alongside humans. However, this task is often challenging due to the diversity of scenes, large number of fine-grained objects, and complex free-form nature of language references. Furthermore, in the 3D domain, obtaining large amounts of natural language training data is difficult. Thus, it is important for methods to learn from little data and zero-shot generalize to new environments. To address these challenges, we propose SORT3D, an approach that utilizes rich object attributes from 2D data and merges a heuristics-based spatial reasoning toolbox with the ability of large language models (LLMs) to perform sequential reasoning. Importantly, our method does not require text-to-3D data for training and can be applied zero-shot to unseen environments. We show that SORT3D achieves state-of-the-art zero-shot performance on complex view-dependent grounding tasks on two benchmarks. We also implement the pipeline to run real-time on two autonomous vehicles and demonstrate that our approach can be used for object-goal navigation on previously unseen real-world environments. All source code for the system pipeline is publicly released. 1

Abstract:
We introduce ET-Former, a novel end-to-end algorithm for semantic scene completion using a single monocular camera. Our approach generates a semantic occupancy map from single RGB observation while simultaneously providing uncertainty estimates for semantic predictions. By designing a triplane-based deformable attention mechanism, our approach improves geometric understanding of the scene than other SOTA approaches and reduces noise in semantic predictions. Additionally, through the use of a Conditional Variational AutoEncoder (CVAE), we estimate the uncertainties of these predictions. The generated semantic and uncertainty maps will help formulate navigation strategies that facilitate safe and permissible decision making in the future. Evaluated on the Semantic-KITTI dataset, ET-Former achieves the highest Intersection over Union (IoU) and mean IoU (mIoU) scores while maintaining the lowest GPU memory usage, surpassing state-of-the-art (SOTA) methods. It improves the SOTA scores of IoU from 44.71 to 51.49 and mIoU from 15.04 to 16.30 on SeamnticKITTI test, with a notably low training memory consumption of 10.9 GB, achieving at least a 25% reduction compared to previous methods. Project page: https://github.com/amazon-science/ET-Former.

Abstract:
Image-event joint depth estimation methods leverage complementary modalities for robust perception, yet face challenges in generalizability stemming from two factors: 1) limited annotated image-event-depth datasets causing insufficient cross-modal supervision, and 2) inherent frequency mismatches between static images and dynamic event streams with distinct spatiotemporal patterns, leading to ineffective feature fusion. To address this dual challenge, we propose Frequency-decoupled Unified Self-supervised Encoder (FUSE) with two synergistic components: The Parameter-efficient Self-supervised Transfer (PST) leverages image foundation models for cross-modal knowledge transfer, effectively mitigating data scarcity by enabling joint encoding without depth ground truth. Complementing this, the Frequency-Decoupled Fusion module (FreDFuse) resolves modality-specific frequency mismatches by decoupling features into high- and low-frequency bands and then performing a guided cross-attention fusion, where the modality dominant in each band steers the integration. This combined approach enables FUSE to construct a universal image-event encoder that only requires lightweight decoder adaptation for target datasets. Extensive experiments demonstrate state-of-the-art performance with 14% and 24.9% improvements in Abs.Rel on MVSEC and DENSE datasets. The framework exhibits remarkable robustness and generalization in challenging scenarios, including extreme lighting and motion blur, significantly advancing its real-world deployment capabilities. The source code for our method is publicly available at: https://github.com/sunpihai-up/FUSE.

Abstract:
In this paper, we introduce Haptic-Informed ACT, an advanced robotic system for pseudo oocyte manipulation, integrating multimodal information and Action Chunking with Transformers (ACT). Traditional automation methods for oocyte transfer rely heavily on visual perception, often requiring human supervision due to biological variability and environmental disturbances. Haptic-Informed ACT enhances ACT by incorporating haptic feedback, enabling real-time grasp failure detection and adaptive correction. Additionally, we introduce a 3D-printed TPU soft gripper to facilitate delicate manipulations. Experimental results demonstrate that Haptic-Informed ACT improves the task success rate, robustness, and adaptability compared to conventional ACT, particularly in dynamic environments. These findings highlight the potential of multimodal learning in robotics for biomedical automation.

Abstract:
In order to expand the operational range and payload capacity of robots, wire-driven robots that leverage the external environment have been proposed. It can exert forces and operate in spaces far beyond those dictated by its own structural limits. However, for practical use, robots must autonomously attach multiple wires to the environment based on environmental recognition―an operation so difficult that many wire-driven robots remain restricted to specialized, pre-designed environments. Here, in this study, we propose a robot that autonomously connects multiple wires to the environment by employing a multi-small flying anchor system, as well as an RGB-D camera-based control and environmental recognition method. Each flying anchor is a drone with an anchoring mechanism at the wire tip, allowing the robot to attach wires by flying into position. Using the robot’s RGB-D camera to identify suitable attachment points and a flying anchor position, the system can connect wires in environments that are not specially prepared, and can also attach multiple wires simultaneously. Through this approach, a wire-driven robot can autonomously attach its wires to the environment, thereby realizing the benefits of wire-driven operation at any location.

Abstract:
The social robot’s open API allows users to customize open-domain interactions. However, it remains inaccessible to those without programming experience. We introduce AutoMisty, the first LLM-powered multi-agent framework that converts natural-language commands into executable Misty robot code by decomposing high-level instructions, generating sub-task code, and integrating everything into a deployable program. Each agent employs a two-layer optimization mechanism: first, a self-reflective loop that instantly validates and automatically executes the generated code, regenerating whenever errors emerge; second, human review for refinement and final approval, ensuring alignment with user preferences and preventing error propagation. To evaluate AutoMisty’s effectiveness, we designed a benchmark task set spanning four levels of complexity and conducted experiments in a real Misty robot environment. Extensive evaluations demonstrate that AutoMisty not only consistently generates high-quality code but also enables precise code control, significantly outperforming direct reasoning with ChatGPT-4o and ChatGPT-o1. All code, optimized APIs, and experimental videos will be publicly released through the webpage: AutoMisty.

Abstract:
Navigation is a fundamental capacity for mobile robots, enabling them to operate autonomously in complex and dynamic environments. Conventional approaches use probabilistic models to localize robots and build maps simultaneously using sensor observations. Recent approaches employ human-inspired learning, such as imitation and reinforcement learning, to navigate robots more effectively. However, these methods suffer from high computational costs, global map inconsistency, and poor generalization to unseen environments. This paper presents a novel method inspired by how humans perceive and navigate themselves effectively in novel environments. Specifically, we first build local frames that mimic how humans represent essential spatial information in the short term. Points in local frames are hybrid representations, including spatial information and learned features, so-called spatial-implicit local frames. Then, we integrate spatial-implicit local frames into the global topological map represented as a factor graph. Lastly, we developed a novel navigation algorithm based on Rapid-Exploring Random Tree Star (RRT) that leverages spatial-implicit local frames and the topological map to navigate effectively in environments. To validate our approach, we conduct extensive experiments in real-world datasets and in-lab environments. We open our source code at https://github.com/tuantdang/simn.

Abstract:
Grasping is fundamental to robotic manipulation, and recent advances in large-scale grasping datasets have provided essential training data and evaluation benchmarks, accelerating the development of learning-based methods for robust object grasping. However, most existing datasets exclude deformable bodies due to the lack of scalable, robust simulation pipelines, limiting the development of generalizable models for compliant grippers and soft manipulands. To address these challenges, we present GRIP, a General Robotic Incremental Potential contact simulation dataset for universal grasping. GRIP leverages an optimized Incremental Potential Contact (IPC)-based simulator for multi-environment data generation, achieving up to 48× speedup while ensuring efficient, intersection- and inversion-free simulations for compliant grippers and deformable objects. Our fully automated pipeline generates and evaluates diverse grasp interactions across 1,200 objects and 100,000 grasp poses, incorporating both soft and rigid grippers. The GRIP dataset enables applications such as neural grasp generation and stress field prediction. We release GRIP to advance research in robotic manipulation, soft-gripper control, and physics-driven simulation at: https://bell0o.github.io/GRIP/.

Abstract:
This paper presents GeoFlow-SLAM, a robust and effective Tightly-Coupled RGBD-Inertial and Legged Odometry Fusion SLAM for legged robotics undergoing aggressive and high-frequency motions. By integrating geometric consistency, legged odometry constraints, and dual-stream optical flow (GeoFlow), our method addresses three critical challenges: feature matching and pose initialization failures during fast locomotion and visual feature scarcity in texture-less scenes. Specifically, in rapid motion scenarios, feature matching is notably enhanced by leveraging dual-stream optical flow, which combines prior map points and poses. Additionally, we propose a robust pose initialization method for fast locomotion and IMU error in legged robots, integrating IMU/Legged odometry, inter-frame Perspective-n-Point (PnP), and Generalized Iterative Closest Point (GICP). Furthermore, a novel optimization framework that tightly couples depth-to-map and GICP geometric constraints is first introduced to improve the robustness and accuracy in long-duration, visually texture-less environments. The proposed algorithms achieve state-of-the-art (SOTA) on collected legged robots and open-source datasets. To further promote research and development, the open-source datasets and code will be made publicly available at https://github.com/HorizonRobotics/GeoFlowSlam.

Abstract:
In recent years, advancements in hardware have enabled quadruped robots to operate with high power and speed, while robust locomotion control using reinforcement learning (RL) has also been realized. As a result, expectations are rising for the automation of tasks such as material transport and exploration in unknown environments. However, autonomous locomotion in rough terrains with significant height variations requires vertical movement, and robots capable of performing such movements stably, along with their control methods, have not yet been fully established. In this study, we developed the quadruped robot KLEIYN, which features a waist joint, and aimed to expand quadruped locomotion by enabling chimney climbing through RL. To facilitate the learning of vertical motion, we introduced Contact-Guided Curriculum Learning (CGCL). As a result, KLEIYN successfully climbed walls ranging from 800 mm to 1000 mm in width at an average speed of 150 mm/s, 50 times faster than conventional robots. Furthermore, we demonstrated that the introduction of a waist joint improves climbing performance, particularly enhancing tracking ability on narrow walls.

Abstract:
Online coordination of multi-robot systems in open and unknown environments faces significant challenges, particularly when semantic features detected during operation dynamically trigger new tasks. Recent large language model (LLMs)-based approaches for scene reasoning and planning primarily focus on one-shot, end-to-end solutions in known environments, lacking both dynamic adaptation capabilities for online operation and explainability in the processes of planning. To address these issues, a novel framework (DEXTER-LLM) for dynamic task planning in unknown environments, integrates four modules: (i) a mission comprehension module that resolves partial ordering of tasks specified by natural languages or linear temporal logic formulas (LTL); (ii) an online subtask generator based on LLMs that improves the accuracy and explainability of task decomposition via multi-stage reasoning; (iii) an optimal subtask assigner and scheduler that allocates subtasks to robots via search-based optimization; and (iv) a dynamic adaptation and human-in-the-loop verification module that implements multi-rate, event-based updates for both subtasks and their assignments, to cope with new features and tasks detected online. The framework effectively combines LLMs’ open-world reasoning capabilities with the optimality of model-based assignment methods, simultaneously addressing the critical issue of online adaptability and explainability. Experimental evaluations demonstrate exceptional performances, with 100% success rates across all scenarios, 160 tasks and 480 subtasks completed on average (3 times the baselines), 62% less queries to LLMs during adaptation, and superior plan quality (2 times higher) for compound tasks. Project page at https://tcxm.github.io/DEXTER-LLM/.

Abstract:
We present a novel convex formulation that weakly couples the Material Point Method (MPM) with rigid body dynamics through frictional contact, optimized for efficient GPU parallelization. Our approach features an asynchronous time-splitting scheme to integrate MPM and rigid body dynamics under different time step sizes. We develop a globally convergent quasi-Newton solver tailored for massive parallelization, achieving up to 500× speedup over previous convex formulations without sacrificing stability. Our method enables interactive-rate simulations of robotic manipulation tasks with diverse deformable objects including granular materials and cloth, with strong convergence guarantees. We detail key implementation strategies to maximize performance and validate our approach through rigorous experiments, demonstrating superior speed, accuracy, and stability compared to state-of-the-art MPM simulators for robotics. We make our method available in the open-source robotics toolkit, Drake.

Abstract:
The perception capability of robotic systems relies on the richness of the dataset. Although Segment Anything Model 2 (SAM2), trained on large datasets, demonstrates strong perception potential in perception tasks, its inherent training paradigm prevents it from being suitable for RGB-T tasks. To address these challenges, we propose SHIFNet, a novel SAM2-driven Hybrid Interaction Paradigm that unlocks the potential of SAM2 with linguistic guidance for efficient RGB-Thermal perception. Our framework consists of two key components: (1) Semantic-Aware Cross-modal Fusion (SACF) module that dynamically balances modality contributions through text-guided affinity learning, overcoming SAM2’s inherent RGB bias; (2) Heterogeneous Prompting Decoder (HPD) that enhances global semantic information through a semantic enhancement module and then combined with category embeddings to amplify cross-modal semantic consistency. With 32.27M trainable parameters, SHIFNet achieves state-of-the-art segmentation performance on public benchmarks, reaching 89.8% on PST900 and 67.8% on FMB, respectively. The framework facilitates the adaptation of pre-trained large models to RGB-T segmentation tasks, effectively mitigating the high costs associated with data collection while endowing robotic systems with comprehensive perception capabilities. The source code will be made publicly available at https://github.com/iAsakiT3T/SHIFNet.

Abstract:
Due to the significant effort required for data collection and annotation in 3D perception tasks, mixed sample data augmentation (MSDA) has been widely studied to generate diverse training samples by mixing existing data. Among these methods, MixUp is a prominent approach that generates new samples by linearly combining two existing ones, using a mix ratio sampled from a β distribution. This simple yet powerful method has inspired numerous variations and applications in 2D and 3D data domains. Recently, many MSDA techniques have been developed for point clouds, but they mainly target LiDAR data, leaving their application to radar point clouds largely unexplored. In this paper, we examine the feasibility of applying existing MSDA methods to radar point clouds and identify several challenges in adapting these techniques. These obstacles stem from the radar’s irregular angular distribution, deviations from a single-sensor polar layout in multi-radar setups, and point sparsity. To address these issues, we propose Class-Aware PillarMix (CAPMix), a novel MSDA approach that applies MixUp at the pillar level in 3D point clouds, guided by class labels. Unlike methods that rely a single mix ratio to the entire sample, CAPMix assigns an independent ratio to each pillar, boosting sample diversity. To account for the density of different classes, we use class-specific distributions: for dense objects (e.g., large vehicles), we skew ratios to favor points from another sample, while for sparse objects (e.g., pedestrians), we sample more points from the original. This class-aware mixing retains critical details and enriches each sample with new information, ultimately generating more diverse training data. Experimental results demonstrate that our method not only significantly boosts performance but also outperforms existing MSDA approaches across two datasets (Bosch Street and K-Radar). We believe that this straightforward yet effective approach will spark further investigation into MSDA techniques for radar data.

Abstract:
Dynamic jumping on high platforms and over gaps differentiates legged robots from wheeled counterparts. Compared to walking on rough terrains, dynamic locomotion on abrupt surfaces requires fusing proprioceptive and exteroceptive perception for explosive movements. In this paper, we propose SF-TIM (Simple Framework combining Terrain Imagination and Measurement), a single-policy method that enhances quadrupedal robot jumping agility, while preserving their fundamental blind walking capabilities. In addition, we introduce a terrain-guided reward design specifically to assist quadrupedal robots in high jumping, improving their performance in this task. To narrow the simulation-to-reality gap in quadrupedal robot learning, we introduce a stable and high-speed elevation map generation framework, enabling zero-shot simulation-to-reality transfer of locomotion ability. Our algorithm has been deployed and validated on both the small-/large-size quadrupedal robots, demonstrating its effectiveness in real-world applications: the robot has successfully traversed various high platforms and gaps, showing the robustness of our proposed approach. A demo video has been made available at https://flysoaryun.github.io/SF-TIM.

Abstract:
We introduce J-ORA, a novel multimodal dataset that bridges the gap in robot perception by providing detailed object attribute annotations within Japanese human-robot dialogue scenarios. J-ORA is designed to support three critical perception tasks, object identification, reference resolution, and next-action prediction, by leveraging a comprehensive template of attributes (e.g., category, color, shape, size, material, and spatial relations). Extensive evaluations with both proprietary and open-source Vision Language Models (VLMs) reveal that incorporating detailed object attributes substantially improves multimodal perception performance compared to without object attributes. Despite the improvement, we find that there still exists a gap between proprietary and open-source VLMs. In addition, our analysis of object affordances demonstrates varying abilities in understanding object functionality and contextual relationships across different VLMs. These findings underscore the importance of rich, context-sensitive attribute annotations in advancing robot perception in dynamic environments. Code and data available at https://github.com/jatuhurrra/J-ORA.

Abstract:
3D Gaussian Splatting (3DGS) has emerged as a key rendering pipeline for digital asset creation due to its balance between efficiency and visual quality. To address the issues of unstable pose estimation and scene representation distortion caused by geometric texture inconsistency in large outdoor scenes with weak or repetitive textures, we approach the problem from two aspects: pose estimation and scene representation. For pose estimation, we leverage LiDAR-IMU Odometry to provide prior poses for cameras in large-scale environments. These prior pose constraints are incorporated into COLMAP’s triangulation process, with pose optimization performed via bundle adjustment. Ensuring consistency between pixel data association and prior poses helps maintain both robustness and accuracy. For scene representation, we introduce normal vector constraints and effective rank regularization to enforce consistency in the direction and shape of Gaussian primitives. These constraints are jointly optimized with the existing photometric loss to enhance the map quality. We evaluate our approach using both public and self-collected datasets. In terms of pose optimization, our method requires only one-third of the time while maintaining accuracy and robustness across both datasets. In terms of scene representation, the results show that our method significantly outperforms conventional 3DGS pipelines. Notably, on self-collected datasets characterized by weak or repetitive textures, our approach demonstrates enhanced visualization capabilities and achieves superior overall performance. Codes and data will be publicly available at https://github.com/justinyeah/normaljshape.git.

Abstract:
3D Gaussian Splats (3DGSs) are 3D object models derived from multi-view images. Such “digital twins” are useful for simulations, virtual reality, E-commerce, robot policy fine-tuning, and part inspection. 3D object scanning usually requires multi-camera arrays, precise laser scanners, or robot wrist-mounted cameras, which have restricted workspaces. We propose Omni-Scan, a pipeline for producing high-quality 3D Gaussian Splat models using a bi-manual robot that grasps an object with one gripper and rotates the object with respect to one stationary camera. The object is then re-grasped by a second gripper to expose surfaces that were occluded by the first gripper. We present the Omni-Scan robot pipeline using DepthAnything, Segment Anything, as well as RAFT optical flow models to identify and isolate objects held by a robot gripper while removing the gripper and the background. We then modify the 3DGS training pipeline to support concatenated datasets with gripper occlusion, producing an omni-directional (360°) model of the object. We apply Omni-Scan to part defect inspection, finding that it can identify visual or geometric defects in 12 different industrial and household objects with an average accuracy of 83.3%. More details and interactive videos of Omni-Scan 3DGS models can be found at https://berkeleyautomation.github.io/omni-scan/.

Abstract:
Dynamic manipulation, such as robot tossing or throwing objects, has recently gained attention as a novel paradigm to speed up logistic operations. However, the focus has predominantly been on the object's landing location, irrespective of its final orientation. In this work, we present a method enabling a robot to accurately "throw-flip" objects to a desired landing pose (position and orientation). Conventionally, objects thrown by revolute robots suffer from parasitic rotation, resulting in highly restricted and uncontrollable landing poses. Our approach is based on two key design choices: first, leveraging the impulse-momentum principle, we design a family of throwing motions that effectively decouple the parasitic rotation, significantly expanding the feasible set of landing poses. Second, we combine a physics-based model of free flight with regression-based learning methods to account for unmodeled effects. Real robot experiments demonstrate that our framework can learn to throw-flip objects to a pose target within (±5 cm, ±45 degree) threshold in dozens of trials. Thanks to data assimilation, incorporating projectile dynamics reduces sample complexity by an average of 40% when throw-flipping to unseen poses compared to end-to-end learning methods. Additionally, we show that past knowledge on in-hand object spinning can be effectively reused, accelerating learning by 70% when throwing a new object with a Center of Mass (CoM) shift. A video summarizing the proposed method and the hardware experiments is available at https://youtu.be/txYc9b1oflU.

Abstract:
Learning-based Visual Odometry (VO) has seen significant advancements over the past decades. However, all the existing methods rely on the six degrees of freedom (6-DoF) representation for pose prediction, which is sparse and less conducive for neural network learning. In this work, we introduce a novel dense and distributed representation by modeling VO as ray bundles, referred to as RayVO. This richly parameterized representation is tightly coupled with corresponding spatial features, making it highly effective for neural learning. Additionally, the ray-based approach enables simultaneous prediction of both intrinsic and extrinsic parameters. To prove its effectiveness against the traditional 6-DoF representation, we propose three specialized loss functions for ray’s training: a ray-based loss, a 6-DoF-based loss and a hybrid loss. We extensively evaluate RayVO on both indoor and outdoor benchmark datasets and show that it outperforms the state-of-the-art VO methods.

Abstract:
MAV-capturing-MAV (MCM) is one of the few effective methods for physically countering misused or malicious MAVs. This paper presents a vision-based cooperative MCM system, where multiple pursuer MAVs equipped with onboard vision systems detect, localize, and pursue a target MAV. To enhance robustness, a distributed state estimation and control framework enables the pursuer MAVs to autonomously coordinate their actions. Pursuer trajectories are optimized using Model Predictive Control (MPC) and executed via a low-level SO(3) controller, ensuring smooth and stable pursuit. Once the capture conditions are satisfied, the pursuer MAVs automatically deploy a flying net to intercept the target. These capture conditions are determined based on the predicted motion of the net. To enable real-time decision-making, we propose a lightweight computational method to approximate the net’s motion, avoiding the prohibitive cost of solving the full net dynamics. The effectiveness of the proposed system is validated through simulations and real-world experiments. In real-world tests, our approach successfully captures a moving target traveling at 4 m/s with an acceleration of 1 m/s2, achieving a success rate of 64.7%.

Abstract:
In this paper, we propose a collaborative centralized 3D mapping and localization framework that harnesses the capabilities of both SLAM (Simultaneous Localization And Mapping) and XR (eXtended Reality). On one hand, our framework allows for integrating local maps generated by a multitude of heterogeneous agents (e.g. robots) into a unified map. On the other hand, it allows human intervention at multiple levels: first, humans can inspect and intervene in the mapping process in situ to produce 3D maps, overlay virtual assets, and add annotations, all of which can contribute towards enhanced autonomy and navigation. Second, beyond the mapping aspect, a human can also intervene in the localization task of any collaborating robot by inspecting and correcting its generated paths, and, if necessary, enforcing a desired trajectory. Experiments inside two real settings demonstrated the superiority of the proposed system.

Abstract:
Depth from focus (DFF) is a well-established method for measuring depth in vision systems. However, its efficacy and accuracy are limited by the slow speed required to capture high-quality focal stacks. We address this limitation by leveraging emerging hardware technologies: the event camera and liquid lens. In this paper, we introduce an innovative approach called Event-Based Depth from Focus (EDFF). We present a prototype system and propose Event Cancellation Score (ECS) as a novel metric to efficiently detect event data focus. To validate the effectiveness of our system, we have curated the first EDFF dataset, which comprises event recordings of focal sweeps performed on 3D-printed test targets. Comparative analysis against existing event focus detection algorithms demonstrates the superior performance of our algorithm in the EDFF task.

Abstract:
Transformer-based sequence models have proven effective in offline reinforcement learning for modeling agent trajectories using large-scale datasets. However, applying these models directly to multi-agent offline reinforcement learning introduces additional challenges, especially in managing complex inter-agent dynamics that arise as multiple agents interact with both their environment and each other. To overcome these issues, we propose the context-aware multi-agent trajectory transformer (COMAT), a novel model designed for offline multi-agent reinforcement learning tasks which predicts the future trajectory of each agent by incorporating the history of adjacent agents—referred to as context—into its sequence modeling. COMAT consists of three key modules: the transformer module to process input trajectories, the context encoder to extract relevant information from adjacent agents’ histories, and the context aggregator to integrate this information into the agent’s trajectory prediction process. Built upon these modules, COMAT predicts the agents’ future trajectories and actively leverages this capability as a tool for planning, enabling the search for optimal actions in multi-agent environments. We evaluate COMAT on multi-agent MuJoCo and StarCraft Multi-Agent Challenge tasks, on which it demonstrates superior performance compared to existing baselines.

Abstract:
LiDAR-Inertial Odometry (LIO) is widely used for autonomous navigation, but its deployment on Size, Weight, and Power (SWaP)-constrained platforms remains challenging due to the computational cost of processing dense point clouds. Conventional LIO frameworks rely on a single onboard processor, leading to computational bottlenecks and high memory demands, making real-time execution difficult on embedded systems. To address this, we propose QLIO, a multi-processor distributed quantized LIO framework that reduces computational load and bandwidth consumption while maintaining localization accuracy. QLIO introduces a quantized state estimation pipeline, where a co-processor pre-processes LiDAR measurements, compressing point-to-plane residuals before transmitting only essential features to the host processor. Additionally, an rQ-vector-based adaptive resampling strategy intelligently selects and compresses key observations, further reducing computational redundancy. Real-World evaluations demonstrate that QLIO achieves a 14.1× reduction in perobservation residual data while preserving localization accuracy. Furthermore, we release an open-source implementation to facilitate further research and real-world deployment. These results establish QLIO as an efficient and scalable solution for real-time autonomous systems operating under computational and bandwidth constraints.

Abstract:
A main bottleneck of learning-based robotic scene understanding methods is the heavy reliance on extensive annotated training data, which often limits their generalization ability. In LiDAR panoptic segmentation, this challenge becomes even more pronounced due to the need to simultaneously address both semantic and instance segmentation from complex, high-dimensional point cloud data. In this work, we address the challenge of LiDAR panoptic segmentation with very few labeled samples by leveraging recent advances in label-efficient vision panoptic segmentation. To this end, we propose a novel method, Limited-Label LiDAR Panoptic Segmentation (L3PS), which requires only a minimal amount of labeled data. Our approach first utilizes a label-efficient 2D network to generate panoptic pseudo-labels from a small set of annotated images, which are subsequently projected onto point clouds. We then introduce a novel 3D refinement module that capitalizes on the geometric properties of point clouds. By incorporating clustering techniques, sequential scan accumulation, and ground point separation, this module significantly enhances the accuracy of the pseudo-labels, improving segmentation quality by up to +10.6 PQ and +7.9 mIoU. We demonstrate that these refined pseudo-labels can be used to effectively train off-the-shelf LiDAR segmentation networks. Through extensive experiments, we show that L3PS not only outperforms existing methods but also substantially reduces the annotation burden. We release the code of our work at https://l3ps.cs.uni-freiburg.de.

Abstract:
Static balancing is essential for effective control of robots. The typical approach to obtain the gravity compensation terms is to construct an ad-hoc trigonometric model of the robot. This paper proposes a dual-quaternion representation of centroids that enables streamlined calculation of gravity compensation terms on rigid robots with arbitrary kinematic trees and arbitrary base orientations, valid for revolute and prismatic joints in kinematic chains with rigid links. The method was successfully tested on a six degrees-of-freedom manipulator model and evaluated against third-party software.

Abstract:
Modular robotic structures simplify robot design and manufacturing by using standardized modules, enhancing flexibility and adaptability. However, the need for manual input in design and assembly limit their potential. Current methods to automate this process still require significant human effort and technical expertise. This paper introduces a novel approach that employs Large Language Models (LLMs) as intelligent agents to automate the creation of modular robotic structures. We decompose the modular robot creation task and develop two agents based on LLM to plan and assemble the modular robots from text prompts. By inputting a textual description, users can generate robot designs that are validated in both simulated and real-world environments. This method reduces the need for manual intervention and lowers the technical barrier to creating complex robotic systems.

Abstract:
Driver gaze monitoring is crucial for road safety, but existing methods rely on expensive, cumbersome technologies like wearable eye trackers or fixed-camera setups. To address this, we propose a low-cost approach using dashcams to capture driver gaze data. We introduce DashGaze, a large-scale dataset for training appearance-based gaze estimation models, featuring over 900,000 frames collected over 10 hours with 28 unique drivers. DashGaze includes synchronized views of the road, driver, and driver’s egocentric perspective, along with the driver’s gaze in both the driver and ego views. We also present DashGazeNet, a baseline model that generalizes well to unseen drivers and diverse conditions, achieving gaze angle errors within 8.5 and gaze location errors within 225 pixels. Our code and data are available at https://github.com/ThrupthiAnn/DashGaze

Affiliations: School of Electronic and Information Engineering, South China Normal University, Foshan, China; College of Electronics & Information Engineering, Shanghai Research Institute for Intelligent Autonomous Systems, the State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Tongji University, Shanghai, China; College of Big Data and Internet, Shenzhen Technology University, Shenzhen, China

Abstract:
Motion Object Segmentation (MOS) is crucial for autonomous driving, as it enhances localization, path planning, map construction, scene flow estimation, and future state prediction. While existing methods achieve strong performance, balancing accuracy and real-time inference remains a challenge. To address this, we propose a logits-based knowledge distillation framework for MOS, aiming to improve accuracy while maintaining real-time efficiency. Specifically, we adopt a Bird’s Eye View (BEV) projection-based model as the student and a non-projection model as the teacher. To handle the severe imbalance between moving and non-moving classes, we decouple them and apply tailored distillation strategies, allowing the teacher model to better learn key motion-related features. This approach significantly reduces false positives and false negatives. Additionally, we introduce dynamic upsampling, optimize the network architecture, and achieve a 7.69% reduction in parameter count, mitigating overfitting. Our method achieves a notable IoU of 78.8% on the hidden test set of the SemanticKITTI-MOS dataset and delivers competitive results on the Apollo dataset. The KDMOS implementation is available at https://github.com/SCNU-RISLAB/KDMOS.

Affiliations: Zhejiang University, Hangzhou, China; Huawei Cloud Computing Technologies Co., Ltd., Shenzhen, China; State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, College of Electronics & Information Engineering, Shanghai Institute of Intelligent Science and Technology, Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, Shanghai, China

Abstract:
Open-vocabulary panoptic reconstruction is a challenging task for simultaneous scene reconstruction and understanding. Recently, methods have been proposed for 3D scene understanding based on Gaussian splatting. However, these methods are multi-staged, suffering from the accumulated errors and the dependence of hand-designed components. To streamline the pipeline and achieve global optimization, we propose PanopticSplatting, an end-to-end system for open-vocabulary panoptic reconstruction. Our method introduces query-guided Gaussian segmentation with local cross attention, lifting 2D instance masks without cross-frame association in an end-to-end way. The local cross attention within view frustum effectively reduces the training memory, making our model more accessible to large scenes with more Gaussians and objects. In addition, to address the challenge of noisy labels in 2D pseudo masks, we propose label blending to promote consistent 3D segmentation with less noisy floaters, as well as label warping on 2D predictions which enhances multi-view coherence and segmentation accuracy. Our method demonstrates strong performances in 3D scene panoptic reconstruction on the ScanNet-V2 and ScanNet++ datasets, compared with both NeRF-based and Gaussian-based panoptic reconstruction methods. Moreover, PanopticSplatting can be easily generalized to numerous variants of Gaussian splatting, and we demonstrate its robustness on different Gaussian base models.

Abstract:
Goal-conditioned policies, such as those learned via imitation learning, provide an easy way for humans to influence what tasks robots accomplish. However, these robot policies are not guaranteed to execute safely or to succeed when faced with out-of-distribution goal requests. In this work, we enable robots to know when they can confidently execute a user’s desired goal, and automatically suggest safe alternatives when they cannot. Our approach is inspired by control-theoretic safety filtering, wherein a safety filter minimally adjusts a robot’s candidate action to be safe. Our key idea is to pose alternative suggestion as a safe control problem in goal space, rather than in action space. Offline, we use reachability analysis to compute a goal-parameterized reach-avoid value network which quantifies the safety and liveness of the robot’s pre-trained policy. Online, our robot uses the reach-avoid value network as a safety filter, monitoring the human’s given goal and actively suggesting alternatives that are similar but meet the safety specification. We demonstrate our Safe ALTernatives (SALT) framework in simulation experiments with Franka Panda tabletop manipulation. We find that SALT is able to learn to predict successful and failed closed-loop executions, is a less pessimistic monitor than open-loop uncertainty quantification, and proposes alternatives that consistently align with those that people find acceptable.

Abstract:
In robotic systems, the performance of reinforcement learning depends on the rationality of predefined reward functions. However, manually designed reward functions often lead to policy failures due to inaccuracies. Inverse Reinforcement Learning (IRL) addresses this problem by inferring implicit reward functions from expert demonstrations. Nevertheless, existing methods rely heavily on large amounts of expert demonstrations to accurately recover the reward function. The high cost of collecting expert demonstrations in robotic applications, particularly in multi-robot systems, severely hinders the practical deployment of IRL. Consequently, improving sample efficiency has emerged as a critical challenge in multi-agent inverse reinforcement learning (MIRL). Inspired by the symmetry inherent in multi-agent systems, this work theoretically demonstrates that leveraging symmetry enables the recovery of more accurate reward functions. Building upon this insight, we propose a universal framework that integrates symmetry into existing multi-agent adversarial IRL algorithms, thereby significantly enhancing sample efficiency. Experimental results from multiple challenging tasks have demonstrated the effectiveness of this framework. Further validation in physical multi-robot systems has shown the practicality of our method.

Abstract:
Learning from demonstration (LfD) is a technique that allows expert teachers to teach task-oriented skills to robotic systems. However, the most effective way of guiding novice teachers to approach expert-level demonstrations quantitatively for specific teaching tasks remains an open question. To this end, this paper investigates the use of machine teaching (MT) to guide novice teachers to improve their teaching skills based on reinforcement learning from demonstration (RLfD). The paper reports an experiment in which novices receive MT-derived guidance to train their ability to teach a given motor skill with only 8 demonstrations and generalise this to previously unseen ones. Results indicate that the MT-guidance not only enhances robot learning performance by 89% on the training skill but also causes a 70% improvement in robot learning performance on skills not seen by subjects during training. These findings highlight the effectiveness of MT-guidance in upskilling human teaching behaviours, ultimately improving demonstration quality in RLfD.

Abstract:
Many manipulation tasks pose a challenge since they depend on non-visual environmental information that can only be determined after sustained physical interaction has already begun. This is particularly relevant for effort-sensitive, dynamics-dependent tasks such as tightening a valve. To perform these tasks safely and reliably, robots must be able to quickly adapt in response to unexpected changes during task execution, and should also learn from past experience to better inform future decisions. Humans can intuitively respond and adapt their manipulation strategy to suit such problems, but representing and implementing such behaviors for robots remains a challenge. In this work we show how this can be achieved within the framework of behavior trees. We present the adaptive behavior tree, a scalable and generalizable behavior tree design that enables a robot to quickly adapt to and learn from both visual and non-visual observations during task execution, preempting task failure or switching to a different manipulation strategy. The adaptive behavior tree selects the manipulation strategy that is predicted to optimize task performance, and learns from past experience to improve these predictions for future attempts. We test our approach on a variety of tasks commonly found in industry; the adaptive behavior tree demonstrates safety, robustness (100% success rate) and efficiency in task completion (up to 36% task speedup from the baseline).

Abstract:
One of the largest challenges in the deployment of legged robots in the real world is deriving effective general gaits. In this paper, we present BeeTLe, which is a framework that enables terrain aware locomotion without the need for dedicated terrain sensors. BeeTLe is realised as a multi-expert policy Reinforcement Learning (RL) algorithm. This enables multiple gaits, applicable to different surface types, to be stored and shared in a single policy. Sensor free terrain awareness is incorporated using a Recurrent Neural Network (RNN) to infer surface type purely from actuator positions over time. The RNN achieves an accuracy of 94% in terrain identification out of 8 possible options. We demonstrate that BeeTLe achieves a greater performance than the baselines across a series of challenges including: the traversal of a flat plane, a tilted plane, a sequence of tilted planes and geometry modelling a natural hilly terrain. This is despite not seeing the sequence of tilted planes and the natural hilly terrain during training.The code, policy and simulated environments are available at: https://gitlab.surrey.ac.uk/rf00350/BeeTLe

Abstract:
Preference-Aligned robot navigation in human environments is typically achieved through learning-based approaches, utilizing user feedback or demonstrations for personalization. However, personal preferences are subject to change and might even be context-dependent. Yet traditional reinforcement learning (RL) approaches with static reward functions often fall short in adapting to evolving user preferences, inevitably reflecting demonstrations once training is completed. This paper introduces a structured framework that combines demonstration-based learning with multi-objective reinforcement learning (MORL). To ensure real-world applicability, our approach allows for dynamic adaptation of the robot navigation policy to changing user preferences without retraining. It fluently modulates the amount of demonstration data reflection and other preference-related objectives. Through rigorous evaluations, including a baseline comparison and sim-to-real transfer on two robots, we demonstrate our framework’s capability to adapt to user preferences accurately while achieving high navigational performance in terms of collision avoidance and goal pursuance.

Abstract:
Visual-inertial odometry (VIO) has made significant progress in various applications. However, one of the key challenges in VIO is the efficient and robust fusion of visual and inertial measurements, particularly while mitigating the impact of sensor failures. To address this challenge, we propose a new learning-based VIO system, i.e., DW-VIO, which is able to integrate multiple sensors and provide robust state estimations. To this end, we design a novel deep learning-based data-fusion approach that dynamically associates information from multiple sensors to predict sensor weights for optimization. Moreover, in order to improve the efficiency, we present several real-time optimization techniques including a fast patch graph constructor and an efficient GPU-accelerated multi-factor bundle adjustment layer. Experimental results show that DW-VIO outperforms most state-of-the-art (SOTA) methods on the EuRoC MAV, ETH3D-SLAM, and KITTI-360 benchmarks across various challenging sequences. Additionally, it maintains a minimum of 20 frames per second (FPS) on a single RTX 3060 GPU with high-resolution input, highlighting its efficiency.

Abstract:
Modular Self-Reconfigurable Robot (MSRR) systems are a class of robots capable of forming higher-level robotic systems by altering the topological relationships between modules, offering enhanced adaptability and robustness in various environments. This paper presents a novel MSRR called MODUR, featuring dual-level reconfiguration capabilities designed to integrate reconfigurable mechanisms into MSRR. Specifically, MODUR can perform high-level self-reconfiguration among modules to create different configurations, while each module is also able to change its shape to execute basic motions. The design of MODUR primarily includes a compact connector and scissor linkage groups that provide actuation, forming a parallel mechanism capable of achieving both connector motion decoupling and adjacent position migration capabilities. Furthermore, the workspace, considering the interdependent connectors, is comprehensively analyzed, laying a theoretical foundation for the design of the module’s basic motion. Finally, the motion of MODUR is validated through a series of experiments.

Abstract:
Efficient, accurate, and flexible relative localization is crucial in air-ground collaborative tasks. However, current approaches for robot relative localization are primarily realized in the form of distributed multi-robot SLAM systems with the same sensor configuration, which are tightly coupled with the state estimation of all robots, limiting both flexibility and accuracy. To this end, we fully leverage the high capacity of Unmanned Ground Vehicle (UGV) to integrate multiple sensors, enabling a semi-distributed cross-modal air-ground relative localization framework. In this work, both the UGV and the Unmanned Aerial Vehicle (UAV) independently perform SLAM while extracting deep learning-based keypoints and global descriptors, which decouples the relative localization from the state estimation of all agents. The UGV employs a local Bundle Adjustment (BA) with LiDAR, camera, and an IMU to rapidly obtain accurate relative pose estimates. The BA process adopts sparse keypoint optimization and is divided into two stages: First, optimizing camera poses interpolated from LiDAR-Inertial Odometry (LIO), followed by estimating the relative camera poses between the UGV and UAV. Additionally, we implement an incremental loop closure detection algorithm using deep learning-based descriptors to maintain and retrieve keyframes efficiently. Experimental results demonstrate that our method achieves outstanding performance in both accuracy and efficiency. Unlike traditional multi-robot SLAM approaches that transmit images or point clouds, our method only transmits keypoint pixels and their descriptors, effectively constraining the communication bandwidth under 0.3 Mbps. Codes and data will be publicly available on https://github.com/Ascbpiac/cross-model-relative-localization.git.

Abstract:
We propose ComDrive: the first comfort-oriented end-to-end autonomous driving system to generate temporally consistent and comfortable trajectories. Recent studies have demonstrated that imitation learning-based planners and learning-based trajectory scorers can effectively generate and select safety trajectories that closely mimic expert demonstrations. However, such trajectory planners and scorers face the challenge of generating temporally inconsistent and uncomfortable trajectories. To address these issues, ComDrive first extracts 3D spatial representations through sparse perception, which then serves as conditional inputs. These inputs are used by a Conditional Denoising Diffusion Probabilistic Model (DDPM)-based motion planner to generate temporally consistent multi-modal trajectories. A dual-stream adaptive trajectory scorer subsequently selects the most comfortable trajectory from these candidates to control the vehicle. Experiments demonstrate that ComDrive achieves state-of-the-art performance in both comfort and safety, outperforming UniAD by 17%in driving comfort and reducing collision rates by 25%compared to SparseDrive. More results are available on our project page: https://jmwang0117.github.io/ComDrive/.

Abstract:
The soft-argmax operation is widely adopted in neural network-based stereo matching methods to enable differentiable regression of disparity. However, networks trained with soft-argmax tend to predict multimodal probability distributions due to the absence of explicit constraints on the shape of the distribution. Previous methods leveraged Laplacian distributions and cross-entropy for training but failed to effectively improve accuracy and even increased the network’s processing time. In this paper, we propose a novel method called Sampling-Gaussian as a substitute for soft-argmax. It improves accuracy without increasing inference time. We innovatively interpret the training process as minimizing the distance in vector space and propose a combined loss of L1 loss and cosine similarity loss. We leveraged the normalized discrete Gaussian distribution for supervision. Moreover, we identified two issues in previous methods and proposed extending the disparity range and employing bilinear interpolation as solutions. We have conducted comprehensive experiments to demonstrate the superior performance of our Sampling-Gaussian method. The experimental results prove that we have achieved better accuracy on five baseline methods across four datasets. Moreover, we have achieved significant improvements on small datasets and models with weaker generalization capabilities. Our method is easy to implement, and the code is available online.

Abstract:
The in-context learning ability of Transformer models has brought new possibilities to visual navigation. In this paper, we focus on a novel video navigation setting, where an in-context navigation policy needs to be learned purely from videos in an offline manner, without access to the actual environment. For this setting, we propose Navigate Only Look Once (NOLO), a method for learning a navigation policy that possesses the in-context ability and adapts to new scenes by taking corresponding context videos as input without finetuning or re-training. To enable learning from videos, we first propose a pseudo action labeling procedure using optical flow to recover the action label from egocentric videos. Then, offline reinforcement learning is applied to learn the navigation policy. Through extensive experiments on different scenes both in simulation and the real world, we show that our algorithm outperforms baselines by a large margin, which demonstrates the effectiveness of the learned policy in in-context learning. For videos and more information, visit our project page.

Abstract:
Recent progress in robot autonomy and safety has significantly improved human-robot interactions, enabling robots to work alongside humans on various tasks. However, complex assembly tasks still present significant challenges due to inherent task variability and the need for precise operations. This work explores deploying robots in an assistive role for such tasks, where the robot assists by fetching parts while the skilled worker provides high-level guidance and performs the assembly. We introduce GEAR, a gaze-enabled system designed to enhance human-robot collaboration by allowing robots to respond to the user’s gaze. We evaluate GEAR against a touch-based interface where users interact with the robot through a touchscreen. The experimental study involved 30 participants working on two distinct assembly scenarios of varying complexity. Results demonstrated that GEAR enabled participants to accomplish the assembly with reduced physical demand and effort compared to the touchscreen interface, especially for complex tasks, maintaining great performance, and receiving objects effectively. Participants also reported enhanced user experience while performing assembly tasks. Project page:sites.google.com/view/gear-hri

Abstract:
Robotic surgery offers enhanced precision and adaptability, paving the way for automation in surgical interventions. Cholecystectomy, the gallbladder removal, is particularly well-suited for automation due to its standardized procedural steps and distinct anatomical boundaries. A key challenge in automating this procedure is dissecting with accuracy and adaptability. This paper presents a vision-based autonomous robotic dissection architecture that integrates real-time segmentation, keypoint detection, grasping and stretching the gallbladder with the left arm, and dissecting with the other arm. We introduce an improved segmentation dataset based on videos of robotic cholecystectomy performed by various surgeons, incorporating a new "liver bed" class to enhance boundary tracking after multiple rounds of dissection. Our system employs state-of-the-art segmentation models and an adaptive boundary extraction method that maintains accuracy despite tissue deformations and visual variations. Moreover, we implemented an automated grasping and pulling strategy to optimize tissue tension before dissection upon our previous work. Ex vivo evaluations on porcine livers demonstrate that our framework significantly improves dissection precision and consistency, marking a step toward fully autonomous robotic cholecystectomy.

Abstract:
This paper presents a tightly-coupled LiDAR-Visual-Inertial Odometry (LIVO) system that integrates both LIO and VIO subsystems. The system jointly estimates the state by fusing LiDAR or visual data with Inertial Measurement Units (IMUs). It employs point-to-mesh tracking to optimize LiDAR poses and leverages intensity information from LiDAR point clouds to refine camera pose estimation. The optimized camera pose, derived from VIO, plays a crucial role in texture mapping and 3D geometry synthesis (3DGS) rendering. Our experiments demonstrate a significant improvement in average Peak Signal-to-Noise Ratio (PSNR) compared to existing methods, including R3LIVE, SR-LIVO, and FAST-LIVO. Furthermore, the system features a real-time mapping module implemented on the GPU, utilizing Truncated Signed Distance Function (TSDF) fields for global map maintenance and the Marching Cubes algorithm for mesh extraction. This approach ensures rapid and precise tracking and reconstruction capabilities. Additionally, our system supports real-time remeshing of the global map upon detecting loop closures, thereby enhancing the robustness and accuracy of the overall SLAM process.

Abstract:
Recent advances in skeleton-based action recognition have been primarily driven by Graph Convolutional Networks (GCNs) and skeleton transformers. While conventional approaches focus on modeling joint co-occurrences through skeletal connections, they overlook the inherent positional information in 3D coordinates. Although Hyper-Graphs partially address the limitation of pairwise aggregation in capturing higher-order kinematic dependencies, challenges remain in their topological definitions. To solve these problems, this paper proposes a skeleton-to-point network (Skeleton2Point) to model joints’ position relationships directly in three-dimensional space without fixed topology limitation, which is the first to regard skeleton recognition as point clouds. However, simply considering the raw 3D coordinates would result in the loss of the anatomical identity of each keypoint and its temporal position in the sequence. To address this limitation, we augment the three-dimensional spatial coordinates with two additional dimensions: the anatomical index of each keypoint and its corresponding frame number with a proposed Information Transform Module (ITM). This transformation extends the representation from a three-dimensional to a five-dimensional feature space. Furthermore, we propose a Cluster-Dispatch-Based Interaction module (CDI) to enhance the discrimination of local-global information. In comparison with existing methods on NTU-RGB+D 60 and NTU-RGB+D 120 datasets, Skeleton2Point has demonstrated state-of-the-art performance on both joint modality and stream fusion. Especially, on the challenging NTU-RGB+D 120 dataset under the X-Sub and X-Set setting, the accuracies reach 90.63% and 91.92%.

Abstract:
Miniature amphibious robots are capable of performing various tasks in complex terrestrial and aquatic environments due to their superior adaptability. However, the mobility of existing miniature amphibious robots in such environments is limited by their complex locomotion systems and single mode of motion. This work presents a novel insect-scale amphibious robot, powered by a single piezoelectric actuator. The prototype of the robot is fabricated and preliminarily tested preliminarily. By exploiting the different vibration modes of the piezoelectric actuators, the robot achieves movement in an amphibious environment. The robot employs the acoustic flow generated by the higher-order mode to achieve rapid motion at the water surface. In addition, the robot attains forward and backward motion on the ground by means of friction force between the driving feet and the ground. The findings of this study offer significant insights into the development of amphibious robots that exhibit enhanced flexibility and adaptability. These insights lay the foundation for the future applications of such robots in narrow amphibious settings.

Abstract:
This paper introduces a wheel-claw quadruped robot named Lywal-X, which is capable of omnidirectional movement as well as grasping actions. Firstly, the mechanical structure of Lywal-X is designed with a three-degree-of-freedom leg transformation mechanism and a two-degree-of-freedom wheel-claw structure. Then, movement strategies for different modes such as climbing and grasping are developed. Finally, the mobility performance of Lywal-X is analyzed, and physical experiments are conducted to verify the robot’s ability to pick up and transport target objects in both single-claw and double-claw modes.

Abstract:
This paper presents a soft electrohydraulic actuator integrated with electrically controlled adhesion. Soft electrohydraulic actuators are a type of soft actuation technology known for their versatility and promising features, enabling the creation of diverse soft robotic systems. Integrating electroadhesion functionality into this actuation technology is expected to further enhance its versatility by making it multifunctional. In the actuator proposed in this study, electroadhesion is incorporated by modifying a partial domain of the electrode to have an interdigitated shape, which generates not only actuation but also electrostatic attractive forces to nearby objects simultaneously. Additionally, the geometry of the pouch is modified from a rectangular to a non-rectangular shape to stabilize actuated deformation. The experimental results clarified the actuation performance and electroadhesion forces of the proposed actuator, while a 23% improvement in holding force was observed in the form of a gripper, demonstrating the effectiveness of the actuator with intrinsic electroadhesion.

Abstract:
During the past decade, DNA origami has emerged as a promising technology to construct DNA nanorobots with programmable configurations and excellent biocompatibility. However, existing DNA origami-based nanorobots exhibit weak stability in complex biological environments and lack efficient delivery capabilities when serving as drug carriers. This study introduces a reconfigurable tetrahedral DNA nanorobot, whose conformational transition pathways are validated by multi-resolution molecular dynamics simulations. Building on this, we systematically analyzed the structural stability of the tetrahedral nanorobot using multiple simulation methods. We then fabricated the specific molecule-triggered tetrahedral DNA nanorobot with high structural stability and efficient drug delivery capacity. The proposed nanorobot was further employed for the recognition and inhibition of circulating tumor cells. These results highlight the application potential of the proposed DNA origami-engineered nanorobot in biomedicine and nanosensing.

Abstract:
Recently, centralized receding horizon online multi-robot coverage path planning algorithms have shown remarkable scalability in thoroughly exploring large, complex, unknown workspaces with many robots. In a horizon, the path planning and the path execution interleave, meaning when the path planning occurs for robots with no paths, the robots with outstanding paths do not execute, and subsequently, when the robots with new or outstanding paths execute to reach respective goals, path planning does not occur for those robots yet to get new paths, leading to wastage of both the robotic and the computation resources. As a remedy, we propose a centralized algorithm that is not horizon-based. It plans paths at any time for a subset of robots with no paths, i.e., who have reached their previously assigned goals, while the rest execute their outstanding paths, thereby enabling concurrent planning and execution. We formally prove that the proposed algorithm ensures complete coverage of an unknown workspace and analyze its time complexity. To demonstrate scalability, we evaluate our algorithm to cover eight large 2D grid benchmark workspaces with up to 512 aerial and ground robots, respectively. A comparison with two state-of-the-art horizon-based algorithms shows its superiority in completing the coverage with up to 1.6× speedup. For validation, we perform ROS + Gazebo simulations in six 2D grid benchmark workspaces with 10 Quadcopters and TurtleBots, respectively. We also successfully conducted one outdoor experiment with three quadcopters and one indoor with two TurtleBots.

Abstract:
Most, if not all, robot navigation systems employ a decomposed planning framework that includes global and local planning. To trade-off onboard computation and plan quality, current systems have to limit all robot dynamics considerations only within the local planner, while leveraging an extremely simplified robot representation (e.g., a point-mass holonomic model without dynamics) in the global level. However, such an artificial decomposition based on either full or zero consideration of robot dynamics can lead to gaps between the two levels, e.g., a global path based on a holonomic point-mass model may not be realizable by a non-holonomic robot, especially in highly constrained obstacle environments. Motivated by such a limitation, we propose a novel paradigm, Decremental Dynamics Planning (DDP)1, that integrates dynamic constraints into the entire planning process, with a focus on high-fidelity dynamics modeling at the beginning and a gradual fidelity reduction as the planning progresses. To validate the effectiveness of this paradigm, we augment three different planners with DDP and show overall improved planning performance. We also develop a new DDP-based navigation system, which achieves second place in both the simulation phase and real-world phase of the 2025 BARN Challenge2. Both simulated and physical experiments validate DDP’s hypothesized benefits.

Abstract:
A critical use case of SLAM for mobile robots is to support localization during task-directed navigation. Current SLAM benchmarks overlook the importance of repeatability (precision) despite its impact on real-world deployments. TaskSLAM-Bench, a task-driven approach to SLAM benchmarking, addresses this gap. It employs precision as a key metric, accounts for SLAM’s mapping capabilities, and has easy-to-meet requirements. Simulated and real-world evaluation of SLAM methods provide insights into the navigation performance of modern visual and LiDAR SLAM solutions. The outcomes show that passive stereo SLAM precision may match that of 2D LiDAR SLAM in indoor environments. TaskSLAM-Bench complements existing benchmarks and offers richer assessment of SLAM performance in navigation-focused scenarios. Publicly available code permits in-situ SLAM testing in custom environments with properly equipped robots.

Abstract:
The facial expression generation capability of humanoid social robots is critical for achieving natural and human-like interactions, playing a vital role in enhancing the fluidity of human-robot interactions and the accuracy of emotional expression. Currently, facial expression generation in humanoid social robots still relies on pre-programmed be-havioral patterns, which are manually coded at high human and time costs. To enable humanoid robots to autonomously acquire generalized expressive capabilities, they need to develop the ability to learn human-like expressions through self-training. To address this challenge, we have designed a highly biomimetic robotic face with physical-electronic animated facial units and developed an end-to-end learning framework based on KAN (Kolmogorov-Arnold Network) and attention mechanisms. Unlike previous humanoid social robots, we have also meticulously designed an automated data collection system based on expert strategies of facial motion primitives to construct the dataset. Notably, to the best of our knowledge, this is the first open-source facial dataset for humanoid social robots. Comprehensive evaluations indicate that our approach achieves accurate and diverse facial mimicry across different test subjects.

Abstract:
Dogs can climb onto tables using their front legs for support, enabling them to retrieve objects and significantly expand their workspace by leveraging the external environment. However, the ability of quadrupedal robots to perform similar skills remains largely unexplored. In this work, we introduce a unified, learning-based loco-manipulation framework for quadrupedal robots, allowing them to utilize the external environment as support to extend their workspace and enhance their manipulation capabilities. Specifically, our method proposes a unified policy that takes limited onboard sensors and proprioception as input, generating whole-body actions that enable the robot to manipulate objects. To guide the policy learning for environment-in-the-loop manipulation, we design a set of rewards that address challenges such as imprecise perception and center-of-mass shifts. Additionally, we employ curriculum learning to train both teacher and student policies, ensuring effective skill transfer in complex tasks. We train the policy in simulation and conduct extensive experiments, demonstrating that our approach allows robots to manipulate previously inaccessible objects, opening up new possibilities for enhancing quadrupedal robot capabilities without the need for hardware modifications or additional costs. The project page is available at https://sites.google.com/view/env-mani.

Abstract:
With multi-agent systems increasingly deployed autonomously at scale in complex environments, ensuring safety of the data-driven policies is critical. Control Barrier Functions have emerged as an effective tool for enforcing safety constraints, yet existing learning-based methods often lack in scalability, generalization and sampling efficiency as they overlook inherent geometric structures of the system. To address this gap, we introduce symmetries-infused distributed CBFs, enforcing the satisfaction of intrinsic symmetries on learnable graph-based safety certificates. We theoretically motivate the need for equivariant parametrization of CBFs and policies, and propose a simple, yet efficient and adaptable methodology for constructing such equivariant group-modular networks via the compatible group actions. This approach encodes safety constraints in a distributed data-efficient manner, enabling zero-shot generalization to larger and denser swarms. Through extensive simulations on multi-robot navigation tasks, we demonstrate that our method outperforms state-of-the-art baselines in terms of safety, scalability, and task success rates, highlighting the importance of embedding symmetries in safe distributed neural policies.

Abstract:
Multi-Robot Systems (MRS) are increasingly utilized in applications such as surveillance, environmental monitoring, and search and rescue, where maximizing mission rewards under budget constraints is critical. The Team Orienteering Problem (TOP) provides a framework for optimizing task coverage and resource allocation in such scenarios. However, traditional TOP formulations often overlook real-world constraints, such as limited communication ranges and the necessity of persistent connectivity among robots. These constraints are particularly relevant in environments like disaster zones and remote areas, where communication infrastructure is unreliable or absent. To address this gap, we propose a multi-objective formulation that balances task coverage, communication quality and energy expenditure under a fixed budget. Our approach accommodates teams of any size and heterogeneous vehicles with varying velocities and constant thrust. We validate our approach through extensive experiments across diverse scenarios and team configurations.

Abstract:
Prostate cancer is a major global health concern, requiring advancements in robotic surgery and diagnostics to improve patient outcomes. A phantom is a specially designed object that simulates human tissues or organs. It can be used for calibrating and testing a medical process, as well as for training and research purposes. Existing prostate phantoms fail to simulate dynamic scenarios. This paper presents a pneumatically actuated prostate phantom with multiple independently controlled chambers, allowing for precise volumetric adjustments to replicate asymmetric and symmetric benign prostatic hyperplasia (BPH). The phantom is designed based on shape analysis of magnetic resonance imaging (MRI) datasets, modeled with finite element method (FEM), and validated through 3D reconstruction. The simulation results showed strong agreement with physical measurements, achieving average errors of 3.47% in forward modeling and 1.41% in inverse modeling. These results demonstrate the phantom’s potential as a platform for validating robotic-assisted systems and for further development toward realistic simulation-based medical training.

Abstract:
This paper presents the development and integration of a vision-guided loco-manipulation pipeline for Northeastern University’s snake robot, COBRA. The system leverages a YOLOv8-based object detection model and depth data from an onboard stereo camera to estimate the 6-DOF pose of target objects in real time. We introduce a framework for autonomous detection and control, enabling closed-loop loco-manipulation for transporting objects to specified goal locations. Additionally, we demonstrate open-loop experiments in which COBRA successfully performs real-time object detection and loco-manipulation tasks.

Abstract:
In this work, we highlight vulnerabilities in robotic systems integrating large language models (LLMs) and vision-language models (VLMs) due to input modality sensitivities. While LLM/VLM-controlled robots show impressive performance across various tasks, their reliability under slight input variations remains underexplored yet critical. These models are highly sensitive to instruction or perceptual input changes, which can trigger misalignment issues, leading to execution failures with severe real-world consequences. To study this issue, we analyze the misalignment-induced vulnerabilities within LLM/VLM-controlled robotic systems and present a mathematical formulation for failure modes arising from variations in input modalities. We propose empirical perturbation strategies to expose these vulnerabilities and validate their effectiveness through experiments on multiple robot manipulation tasks. Our results show that simple input perturbations reduce task execution success rates by 22.2% and 14.6% in two representative LLM/VLM-controlled robotic systems. These findings underscore the importance of input modality robustness and motivate further research to ensure the safe and reliable deployment of advanced LLM/VLM-controlled robotic systems.

Abstract:
Social robots offer a promising solution for autonomously guiding patients through physiotherapy exercise sessions, but effective deployment requires advanced decision-making to adapt to patient needs. A key challenge is the scarcity of patient behavior data for developing robust policies. To address this, we engaged 33 expert healthcare practitioners as patient proxies, using their interactions with our robot to inform a patient behavior model capable of generating exercise performance metrics and subjective scores on perceived exertion. We trained a reinforcement learning-based policy in simulation, demonstrating that it can adapt exercise instructions to individual exertion tolerances and fluctuating performance, while also being applicable to patients at different recovery stages with varying exercise plans.

Abstract:
We consider manipulation problems in constrained and cluttered settings, which require several regrasps at unknown locations. We propose to inform an optimization-based task and motion planning (TAMP) solver with possible regrasp areas and grasp sequences to speed up the search. Our main idea is to use a state space abstraction, a regrasp map, capturing the combinations of available grasps in different parts of the configuration space, and allowing us to provide the solver with guesses for the mode switches and additional constraints for the object placements. By interleaving the creation of regrasp maps, their adaptation based on failed refinements, and solving TAMP (sub)problems, we are able to provide a robust search method for challenging regrasp manipulation problems.

Abstract:
Osmosis-driven actuation offers a promising strategy for developing untethered, environmentally responsive soft and shape shifting mechanisms and robots. In this work, we explore the use of superabsorbent polymer (SAP) pellets as large-scale, shape-morphing actuators. Upon exposure to water, these approximately 2mm diameter spherical pellets undergo a dramatic volumetric expansion, up to 300 times their initial volume, generating actuation forces of approximately 10 N under constrained conditions. We further demonstrate reversible cyclic actuation via controlled swelling-deswelling using ethanol-water solutions. Finally, we integrate these systems into a shape-morphing wheel design to enable adaptive locomotion that passively transitions between terrestrial and aquatic environments. Our findings demonstrate SAP-based osmotic actuators as an environmentally-driven solution for soft robotics, and shape-shifting soft hybrid mechanisms.

Affiliations: School of Artificial Intelligence and Robotics, and the National Engineering Research Center for Robot Visual Perception and Control Technology, Hunan University, Changsha, China; College of Electrical and Information Engineering and the National Engineering Research Center for Robot Visual Perception and Control Technology, Hunan University, Changsha, China; Department of Computer Science, University of Liverpool, Liverpool, U.K.

Abstract:
Since its inception, impedance control has emerged as a fundamental framework for robotic interaction control. Recent advancements in geometric impedance control have demonstrated certain advantages over traditional Cartesian impedance control. However, existing geometric impedance control approaches generally lack force regulation capabilities or rigorous stability guarantees. In this paper, we propose a safety-aware geometric force-impedance controller that addresses these limitations. By incorporating an energy tank mechanism, the proposed approach enables precise force tracking while preserving full compatibility with the impedance behavior. Furthermore, an energy injection and freezing mechanism is introduced, allowing dynamic regulation of energy exchange between the tank and the robotic system. Notably, the proposed method eliminates the need for an offline estimation of the initial energy stored in the tank, facilitating real-time adjustments of force controller parameters. To validate the effectiveness of the proposed framework, we conduct extensive polishing experiments on a real robotic platform. The results demonstrate the capability of the proposed controller to achieve stable and precise force regulation.

Abstract:
Thermal imaging can greatly enhance the application of intelligent unmanned aerial vehicles (UAV) in challenging environments. However, the inherent low resolution of thermal sensors leads to insufficient details and blurred boundaries. Super-resolution (SR) offers a promising solution to address this issue, while most existing SR methods are designed for fixed-scale SR. They are computationally expensive and inflexible in practical applications. To address above issues, this work proposes a novel any-scale thermal SR method (AnyTSR) for UAV within a single model. Specifically, a new image encoder is proposed to explicitly assign specific feature code to enable more accurate and flexible representation. Additionally, by effectively embedding coordinate offset information into the local feature ensemble, an innovative any-scale upsampler is proposed to better understand spatial relationships and reduce artifacts. Moreover, a novel dataset (UAV-TSR), covering both land and water scenes, is constructed for thermal SR tasks. Experimental results demonstrate that the proposed method consistently outperforms state-of-the-art methods across all scaling factors as well as generates more accurate and detailed high-resolution images. The code is located at https://github.com/vision4robotics/AnyTSR.

Abstract:
This work presents a flexible and effective micromixer based on optoelectronic tweezers (OET), which leverages both asymmetric induced-charge electro-osmosis (ICEO) and dielectrophoresis (DEP) phenomena on microscale anisotropic NdFeB particles. The asymmetric ICEO phenomenon is generated by symmetry breaking in the induced charge distributions of geometrically anisotropic NdFeB particles under AC electric field polarization. The DEP forces exerted on NdFeB particles are induced by the light-generated non-uniform electric field. Under the combined action of hydrodynamic forces from asymmetric ICEO vortices and positive DEP forces, NdFeB particles can be attracted into light-induced "virtual" electrodes and precisely track along light-defined trajectories. Experimental results demonstrate that the maximum motion speed of the NdFeB particles exceeds 300 μm/s, with the motion speed exhibiting a positive correlation with the applied voltage. Dynamically controlled virtual electrodes enable accurate capture and relocation of microparticles to arbitrary target positions. The stirring and mixing capability of the NdFeB particles is demonstrated by driving yeast cell motion.

Abstract:
Most conventional semantic SLAM approaches concentrate on maintaining 3D semantic consistency while overlooking their reliance on predefined semantic categories, ultimately limiting flexibility in scene understanding. We propose Open-Vocabulary Semantic Gaussian Splatting SLAM (OVSG-SLAM), an approach that integrates multi-modal perception and 3D Gaussian splatting into a semantic SLAM framework. By combining the advantages of Segment Anything (SAM) for open-vocabulary 2D scene understanding with the powerful feature extraction capabilities of vision-language models, our method eliminates the reliance on predefined closed-set categories. Although Vision-Language Models (VLMs) provide open-vocabulary reasoning, integrating them with 3D semantic SLAM poses challenges such as embedding ambiguity and computational overhead. To address these challenges, we present a feature embedding strategy called differentiable identity-aware encoding, which reduces computational cost while ensuring accurate semantic mapping. Furthermore, instead of using a traditional semantic loss, we optimize the scene representation through an identity loss. Extensive experimental evaluations on the Replica and ScanNet datasets demonstrate that the proposed method achieves state-of-the-art performance in mapping, tracking and 3D semantic segmentation tasks.

Abstract:
Few-shot object detection is especially interesting for applications with mobile robots and becomes even more challenging when task-related classes are very similar. This work focuses on such a scenario: detecting different types of household and industrial tools. Such tools can be rare and specific and are usually not covered by existing large datasets, except for common ones such as screwdrivers. Additionally, the target classes might change frequently depending on the robot’s missions. Therefore, we propose DE-fine-ViT, a fine-grained few-shot object detection model that does not require fine-tuning. We build our architecture on top of the elaborate DE-ViT model, extending it with specialized components to improve the fine-grained detection capabilities. The user can construct class and part prototypes tailored to the task in an interactive preparation phase. During inference, our proposed reevaluation module leverages the multi-granularity of prototypes for fine-grained class differentiation. We evaluate our model in multiple realistic experiments, including a specifically created fine-grained dataset, demonstrating its efficacy and suitability for scenarios with little data and low inter-class variance.

Abstract:
Black box neural networks are an indispensable part of modern robots. Nevertheless, deploying such high-stakes systems in real-world scenarios poses significant challenges when the stakeholders, such as engineers and legislative bodies, lack insights into the neural networks’ decision-making process. Presently, explainable AI is primarily tailored to natural language processing and computer vision, falling short in two critical aspects when applied in robots: grounding in decision-making tasks and the ability to assess trustworthiness of their explanations. In this paper, we introduce a trustworthy explainable robotics technique based on human-interpretable, high-level concepts that attribute to the decisions made by the neural network. Our proposed technique provides explanations with associated uncertainty scores for the explanation by matching neural network’s activations with human-interpretable visualizations. To validate our approach, we conducted a series of experiments with various simulated and real-world robot decision-making models, demonstrating the effectiveness of the proposed approach as a post-hoc, human-friendly robot diagnostic tool. Code: https://github.com/aditya-taparia/BaTCAVe

Abstract:
Vision-Language-Action (VLA) models have emerged as a prominent research area, showcasing significant potential across a variety of applications. However, their performance in endoscopy robotics, particularly endoscopy capsule robots that perform actions within the digestive system, remains unexplored. The integration of VLA models into endoscopy robots allows more intuitive and efficient interactions between human operators and medical devices, improving both diagnostic accuracy and treatment outcomes. In this work, we design CapsDT, a Diffusion Transformer model for capsule robot manipulation in the stomach. By processing interleaved visual inputs, and textual instructions, CapsDT can infer corresponding robotic control signals to facilitate endoscopy tasks. In addition, we developed a capsule endoscopy robot system, a capsule robot controlled by a robotic arm-held magnet, addressing different levels of four endoscopy tasks and creating corresponding capsule robot datasets within the stomach simulator. Comprehensive evaluations on various robotic tasks indicate that CapsDT can serve as a robust vision-language generalist, achieving state-of-the-art performance in various levels of endoscopy tasks while achieving a 26.25% success rate in real-world simulation manipulation.

Abstract:
We propose an online motion planner for legged robot locomotion with the primary objective of achieving energy efficiency. The conceptual idea is to leverage a placement set of footstep positions based on the robot's body position to determine when and how to execute steps. In particular, the proposed planner uses virtual placement sets beneath the hip joints of the legs and executes a step when the foot is outside of such placement set. Furthermore, we propose a parameter design framework that considers both energy-efficiency and robustness measures to optimize the gait by changing the shape of the placement set along with other parameters, such as step height and swing time, as a function of walking speed. We show that the planner produces trajectories that have a low Cost of Transport (CoT) and high robustness measure, and evaluate our approach against model-free Reinforcement Learning (RL) and motion imitation using biological dog motion priors as the reference. Overall, within low to medium velocity range, we show a 50.4% improvement in CoT and improved robustness over model-free RL, our best performing baseline. Finally, we show ability to handle slippery surfaces, gait transitions, and disturbances in simulation and hardware with the Unitree A1 robot.

Abstract:
The growing complexity of robotic teleoperation systems necessitates the integration of multiple feedback modalities, including video, audio, force, tactile, and temperature feedback. The concept of supermedia is utilized to describe the aggregation of these feedback streams. By integrating multiple media forms, supermedia can offer a more comprehensive interactive experience for robot teleoperation systems. However, existing transmission protocols struggle to maintain synchronization among these diverse feedback streams, particularly in demanding network environments. In this paper, we present the Tele-Robotic Control Protocol (TRCP), a novel network transmission protocol specifically designed for supermedia-enhanced robotic teleoperation systems. TRCP incorporates an event reference mechanism that coordinates multiple feedback streams based on robot state rather than traditional time-based sampling. It also employs multi-queue management for the independent handling of different feedback types and integrates an adaptive adjustment mechanism that optimizes transmission parameters in response to real-time network conditions. The effectiveness of TRCP is demonstrated through a cross-continental teleoperation experiment between the University of Glasgow and the University of Hong Kong. TRCP achieves superior feedback synchronization and real-time responsiveness, significantly enhancing both task success rates and operator performance.

Abstract:
3D Gaussian Splatting (3DGS) has recently gained popularity as a faster alternative to Neural Radiance Fields (NeRFs) in 3D reconstruction and view synthesis methods. Leveraging the spatial information encoded in 3DGS, this work proposes FOCI (Field Overlap Collision Integral), an algorithm that is able to optimize trajectories directly on the Gaussians themselves. FOCI leverages a novel and interpretable collision formulation for 3DGS using the notion of the overlap integral between Gaussians. Contrary to other approaches, which represent the robot with conservative bounding boxes that underestimate the traversability of the environment, we propose to represent the environment and the robot as Gaussian Splats. This not only has desirable computational properties, but also allows for orientation-aware planning, allowing the robot to pass through very tight and narrow spaces. We extensively test our algorithm in both synthetic and real Gaussian Splats, showcasing that collision-free trajectories for the ANYmal legged robot that can be computed in a few seconds, even with hundreds of thousands of Gaussians making up the environment. The project page and code are available at https://rffr.leggedrobotics.com/works/foci/

Affiliations: Department of Industrial and Manufacturing Systems Engineering, Emerging Technologies Institute, The University of Hong Kong, Hong Kong, China; School of Mechanical Engineering and Automation and School of Materials Science and Engineering, Northeastern University, Shenyang, China; Naval Architecture and Ocean Engineering College of Dalian Maritime University, Dalian, China; State Key Laboratory of Robotics, Shenyang Institute of Automation, Chinese Academy of Sciences, Shenyang, China

Abstract:
Soft artificial muscle actuators have gained attention in robotics for their remote control, fast response, and high compliance. However, replicating the intricate and efficient motions of natural muscles remains a challenge. Existing designs often lack the hierarchical and anisotropic properties of muscle sarcomeres, limiting their ability to achieve biomimetic movements. We developed a novel Biomimetic Magnetic Artificial Actuator (BMAA) inspired by muscle sarcomeres. Using a soft magnetic composite material arranged in a hierarchical structure, the actuator mimics the arrangement of actin and myosin filaments. External magnetic fields enable precise control of contraction and relaxation, emulating natural muscle motion. The driver can achieve the motion performance of muscle like motion characteristics, and its driving ability is verified by reptile experiment and elbow experiment. The actuator demonstrates significant deformation, fast response, and excellent controllability, enabling complex and precise movements. This research advances the development of biomimetic soft actuators, offering potential applications in soft robotics, biomedical devices, and artificial muscles, and paving the way for more versatile and intelligent machines.

Abstract:
UAV-view geo-localization is crucial in many applications, such as material transportation and security inspection, particularly in GPS-denied urban environments. However, most existing methods assume a known drone flight altitude and divide satellite maps into tiles that approximate the scale of drone images, which are often inapplicable to real-world UAV scenarios where flight altitudes vary. In this paper, we propose a novel UAV-view geo-localization method, termed MM-Geo, to address the aforementioned issue. In particular, we partition the satellite imagery map into tiles of uniform size and retrieve the matching tiles in real time using online drone images of smaller field-of-view (FOV) at different altitudes. To address the multi-scale problem due to the varying altitudes, we design the patch vote rerank with match attention, and to tackle the multi-positive sample issue in the continuous, the normalized infoNCE loss is incorporated to provide finer supervision during contrastive learning. The proposed MM-Geo is extensively validated on the our own large-scale urban dataset MT-UAV as well as the public datasets UAV-VisLoc, outperforming the state-of-the-art (SOTA) approaches and achieving remarkable performance in practical drone delivery operations. To benefit the community, we will release the VisLoc-related code at: https://github.com/MM-Geo-2025/MM-Geo.

Abstract:
The control of robots for manipulation tasks generally relies on visual input. Recent advances in vision-language models (VLMs) enable the use of natural language instructions to condition visual input and control robots in a wider range of environments. However, existing methods require a large amount of data to fine-tune VLMs for operating in unseen environments. In this paper, we present a framework that learns object-arrangement tasks from just a few demonstrations. We propose a two-stage framework that divides object-arrangement tasks into a target localization stage, for picking the object, and a region determination stage for placing the object. We present an instance-level semantic fusion module that aligns the instance-level image crops with the text embedding, enabling the model to identify the target objects defined by the natural language instructions. We validate our method on both simulation and real-world robotic environments. Our method, fine-tuned with a few demonstrations, improves generalization capability and demonstrates zero-shot ability in real-robot manipulation scenarios.

Abstract:
Change detection has long been used for various tasks. With advancements in robotic systems and computer vision, change detection techniques can be further explored for diverse applications. Current state-of-the-art methods primarily use either satellite images or street-level images to detect changes. However, the techniques used for these two types of images differ substantially, though their core objective remains identical.We introduce a spectral-temporal attention network capable of detecting changes in both satellite and street-level images across multiple temporal instances. Additionally, we present an indoor environmental dataset featuring significantly more frequent changes. We analyze the impact of temporal and spatial domain shifts on the performance of various methods and demonstrate that performing attention in the spectral domain not only enhances overall performance but also increases robustness against spatial domain shifts.

Abstract:
Torque-restricted control remains a significant challenge in robotics, often necessitating precise modeling or large amounts of data for effective controller design. To address this problem, we introduce a novel training method that utilizes a Reservoir Computing (RC) framework to serve as a model-free controller that can effectively control a nonlinear robot using minimal training. This paper explores the application of the proposed framework to a torque-restricted single pendulum and achieves similar control performance to that of model-free reinforcement learning controllers while utilising just 0.5% of the data and a simple passive data collection method. We analyze 1,000 unique successful reservoir structures, examining their internal connectivity and memory properties, and identify key structural features that enhance control performance. Finally, this paper also explores our proposed controller’s robustness to changes in pendulum dimensionality and torque limit with successful control achieved for a large range of varying properties without any additional training.

Abstract:
Urban intersections with diverse vehicle types, from small cars to large semi-trailers, pose significant challenges for traffic control. This study explores how robot vehicles (RVs) can enhance heterogeneous traffic flow, particularly at unsignalized intersections where traditional methods fail during power outages. Using reinforcement learning (RL) and real-world data, we simulate mixed traffic at complex intersections with RV penetration rates ranging from 10% to 90%. Results show that average waiting times drop by up to 86% and 91% compared to signalized and unsignalized intersections, respectively. We observe a "rarity advantage," where less frequent vehicles benefit the most (up to 87%). Although CO2 emissions and fuel consumption increase with RV penetration, they remain well below those of traditional signalized traffic. Decreased space headways also indicate more efficient road usage. These findings highlight RVs’ potential to improve traffic efficiency and reduce environmental impact in complex, heterogeneous settings.

Abstract:
We consider the uncertain multi-robot motion planning (MRMP) problem with cooperative localization (CL-MRMP), under both motion and measurement noise, where each robot can act as a sensor for its nearby teammates. We formalize CL-MRMP as a chance-constrained motion planning problem, and propose a safety-guaranteed algorithm that explicitly accounts for robot-robot correlations. Our approach extends a sampling-based planner to solve CL-MRMP while preserving probabilistic completeness. To improve efficiency, we introduce novel biasing techniques. We evaluate our method across diverse benchmarks, demonstrating its effectiveness in generating motion plans, with significant performance gains from biasing strategies.

Abstract:
The robustness certification of deep neural networks (DNNs) is crucial in many safety-critical domains. Randomized Smoothing (RS) has emerged as the current state-of-the-art method for DNN robustness verification that successfully scales on large DNNs used in practice, has achieved excellent results, and has been extended for a large variety of adversarial perturbation scenarios. However, an important cost in RS is during inference, since it requires passing tens or hundreds of thousands of perturbed samples through the DNN to perform the verification. In this work we aim to address this, and explore what happens as we decrease the number of samples by orders of magnitude, and the effect on the certified radius. Surprisingly, we find that the performance reduction in terms of average certified radius is not too large, even if we decrease the number of samples by two orders of magnitude, or more. Moreover, we find that the resulting certified radius reduction can be mitigated using off-the-self methods designed to improve RS performance. This can pave the way for dramatically faster robustness certification, unlocking the possibility of performing it in real-time, which we demonstrate. We perform a detailed analysis, both theoretically and experimentally, and show promising results on the standard CIFAR-10 and ImageNet datasets.

Abstract:
Learning based methods for dexterous manipulation have made notable progress in recent years, and they can now produce solutions to complex tasks. However, learned policies often still lack reliability and exhibit limited robustness to important factors of variation. One failure pattern that can be observed across many settings is that policies idle, i.e. they cease to move beyond a small region of states, often indefinitely, when they reach certain states. This policy idling is often a reflection of the training data. For instance, it can occur when the data contains small actions in areas where the robot needs to perform high-precision motions, e.g., when preparing to grasp an object or object insertion. Prior works have tried to mitigate this phenomenon e.g. by filtering the training data or modifying the control frequency. However, these approaches can negatively impact policy performance in other ways. As an alternative, we investigate how to leverage the detectability of idling behavior to inform exploration and policy improvement. Our approach, Pause-Induced Perturbations (PIP), applies perturbations at detected idling states, thus helping it to escape problematic basins of attraction. On a range of challenging simulated dual-arm tasks, we find that this simple approach can already noticeably improve test-time performance, with no additional supervision or training. Furthermore, since the robot tends to idle at critical points in a movement, we also find that learning from the resulting episodes leads to better iterative policy improvement compared to prior approaches. Our perturbation strategy also leads to a 15-35% improvement in absolute success rate on a real-world insertion task that requires complex multi-finger manipulation.

Abstract:
High-performance Transformer trackers have exhibited excellent results, yet they often bear a heavy computational load. Observing that a smaller input can immediately and conveniently reduce computations without changing the model, an easy solution is to adopt a low-resolution input for efficient Transformer tracking. Albeit faster, this hurts tracking accuracy much due to the information loss in low resolution tracking. In this paper, we aim to mitigate such information loss to boost performance of low-resolution Transformer tracking via dual knowledge distillation from a frozen high-resolution (but not a larger) Transformer tracker. The core lies in two simple yet effective distillation modules, including query-key-value knowledge distillation (QKV-KD) and discrimination knowledge distillation (Disc-KD), across resolutions. The former, from the global view, allows the low-resolution tracker to inherit features and interactions from the high-resolution tracker, while the later, from the target-aware view, enhances the target-background distinguishing capacity via imitating discriminative regions from its high-resolution counterpart. With dual knowledge distillation, our Low-Resolution Transformer Tracker, dubbed LoReTrack, enjoys not only high efficiency owing to reduced computation but also enhanced accuracy by distilling knowledge from the high-resolution tracker. In extensive experiments, LoReTrack with a 2562 resolution consistently improves baseline with the same resolution, and shows competitive or better results compared to the 3842 high-resolution Transformer tracker, while running 52% faster and saving 56% MACs. Moreover, LoReTrack is resolution-scalable. With a 1282 resolution, it runs 25 fps on a CPU with SUC scores of 64.9%/46.4% on LaSOT/LaSOText, surpassing other CPU real-time trackers. Code is released at https://github.com/ShaohuaDong2021/LoReTrack.

Abstract:
We propose SemGauss-SLAM, a dense semantic SLAM system utilizing 3D Gaussian representation, that enables accurate 3D semantic mapping, robust camera tracking, and high-quality rendering simultaneously. In this system, we incorporate semantic feature embedding into 3D Gaussian representation, which effectively encodes semantic information within the spatial layout of the environment for precise semantic scene representation. Furthermore, we propose feature-level loss for updating 3D Gaussian representation, enabling higher-level guidance for 3D Gaussian optimization. In addition, to reduce cumulative drift in tracking and improve semantic reconstruction accuracy, we introduce semantic-informed bundle adjustment. By leveraging multi-frame semantic associations, this strategy enables joint optimization of 3D Gaussian representation and camera poses, resulting in low-drift tracking and accurate semantic mapping. Our SemGauss-SLAM demonstrates superior performance over existing radiance field-based SLAM methods in terms of mapping and tracking accuracy on Replica and ScanNet datasets, while also showing excellent capabilities in high-precision semantic segmentation and dense semantic mapping. Code will be available at https://github.com/IRMVLab/SemGauss-SLAM.

Abstract:
With a growing number of robots being deployed across diverse applications, robust multimodal anomaly detection becomes increasingly important. In robotic manipulation, failures typically arise from (1) robot-driven anomalies due to an insufficient task model or hardware limitations, and (2) environment-driven anomalies caused by dynamic environmental changes or external interferences. Conventional anomaly detection methods focus either on the first by low-level statistical modeling of proprioceptive signals or the second by deep learning-based visual environment observation, each with different computational and training data requirements. To effectively capture anomalies from both sources, we propose a mixture-of-experts framework that integrates the complementary detection mechanisms with a visual-language model for environment monitoring and a Gaussian-mixture regression-based detector for tracking deviations in interaction forces and robot motions. We introduce a confidence-based fusion mechanism that dynamically selects the most reliable detector for each situation. We evaluate our approach on both household and industrial tasks using two robotic systems, demonstrating a 60% reduction in detection delay while improving frame-wise anomaly detection performance compared to individual detectors.

Abstract:
Wearable robots offer a promising solution for quantitatively monitoring gait and providing systematic, adaptive assistance to promote patient independence and improve gait. However, due to significant interpersonal and intrapersonal variability in walking patterns, it is important to design robot controllers that can adapt to the unique characteristics of each individual. This paper investigates the potential of human-in-the-loop optimisation (HILO) to deliver personalised assistance in gait training. The Covariance Matrix Adaptation Evolution Strategy (CMA-ES) was employed to continuously optimise an assist-as-needed controller of a lower-limb exoskeleton. Six healthy individuals participated over a two-day experiment. Our results suggest that while the CMA-ES appears to converge to a unique set of stiffnesses for each individual, no measurable impact on the subjects’ performance was observed during the validation trials. These findings highlight the impact of human-robot co-adaptation and human behaviour variability, whose effect may be greater than potential benefits of personalising rule-based assistive controllers. Our work contributes to understanding the limitations of current personalisation approaches in exoskeleton-assisted gait rehabilitation and identifies key challenges for effective implementation of human-in-the-loop optimisation in this domain.

Abstract:
Robots equipped with reinforcement learning (RL) have the potential to learn a wide range of skills solely from a reward signal. However, obtaining a robust and dense reward signal for general manipulation tasks remains a challenge. Existing learning-based approaches require significant data, such as human demonstrations of success and failure, to learn task-specific reward functions. Recently, there is also a growing adoption of large multi-modal foundation models for robotics that can perform visual reasoning in physical contexts and generate coarse robot motions for manipulation tasks. Motivated by this range of capability, in this work, we present Keypoint-based Affordance Guidance for Improvements (KAGI), a method leveraging rewards shaped by vision-language models (VLMs) for autonomous RL. State-of-the-art VLMs have demonstrated impressive zero-shot reasoning about affordances through keypoints, and we use these to define dense rewards that guide autonomous robotic learning. On diverse real-world manipulation tasks specified by natural language descriptions, KAGI improves the sample efficiency of autonomous RL and enables successful task completion in 30K online fine-tuning steps. Additionally, we demonstrate the robustness of KAGI to reductions in the number of indomain demonstrations used for pre-training, reaching similar performance in 45K online fine-tuning steps. †

Abstract:
Behavior cloning (BC) is a widely used method for learning from expert demonstrations due to its simplicity and efficiency. However, the reliability and stability of BC decline when facing data distribution shifts, especially in single-arm robots with limited fields of view. This study introduces a Geometrically and Historically Constrained Behavior Cloning (GHCBC) method, where an HCBC module utilizes visual and action histories to capture temporal dependencies, maximizing the use of available information, and a GCBC module incorporates high-level perceptual data, such as the relative poses of joints and end-effectors, to enhance BC performance. Experiments demonstrate that the GHCBC outperforms current SOTA BC methods, achieving a 31.5% improvement in simulation success rates and 48.4% in real-robot scenarios respectively. To the best of our knowledge, this is the first time that the GHCBC has been introduced in robotic BC where great potential is demonstrated for long-term tasks in real world environments.

Abstract:
Behaviour trees (BTs) are popular within robotics due to their reactivity, reusability, and modularity. BTs are often designed by hand using expert domain knowledge. However, robot environments contain sources of uncertainty which affect robot behaviour. It is challenging for human designers to reason over the effects of uncertainty up to the task horizon, limiting robot performance. For example, the chance of an unexpected blockage late along a robot’s route should encourage the robot to take an alternate path. Therefore, in this paper we refine the task-level behaviour encoded in a BT through planning under uncertainty. The refinement process modifies when action nodes are executed by reasoning over the effects of uncertainty, improving task performance. We first extract a state space from the BT and learn a set of Bayesian networks (BNs) which model the stochastic dynamics of robot actions. We then use the extracted state space and BNs to construct and solve a Markov decision process which captures robot execution. This produces a policy which describes the refined behaviour. We empirically demonstrate how our approach reduces the completion time for robot navigation and search tasks.

Abstract:
The aging and increasing complexity of infrastructures make efficient inspection planning more critical in ensuring safety. Thanks to sampling-based motion planning, many inspection planners are fast. However, they often require huge memory. This is particularly true when the structure under inspection is large and complex, consisting of many struts and pillars of various geometry and sizes. Such structures can be represented efficiently using implicit models, such as neural Signed Distance Functions (SDFs). However, most primitive computations used in sampling-based inspection planner have been designed to work efficiently with explicit environment models, which in turn requires the planner to use explicit environment models or performs frequent transformations between implicit and explicit environment models during planning. This paper proposes a set of primitive computations, called Inspection Planning Primitives with Implicit Models (IPIM), that enable sampling-based inspection planners to entirely use neural SDFs representation during planning. Evaluation on three scenarios, including inspection of a complex real-world structure with over 92M triangular mesh faces, indicates that even a rudimentary sampling-based planner with IPIM can generate inspection trajectories of similar quality to those generated by the state-of-the-art planner, while using up to 70× less memory than the state-of-the-art inspection planner.

Abstract:
The current paradigm for motion planning generates solutions from scratch for every new problem, which consumes significant amounts of time and computational resources. For complex, cluttered scenes, motion planning approaches can often take minutes to produce a solution, while humans are able to accurately and safely reach any goal in seconds by leveraging their prior experience. We seek to do the same by applying data-driven learning at scale to the problem of motion planning. Our approach builds a large number of complex scenes in simulation, collects expert data from a motion planner, then distills it into a reactive neural policy. We then combine this with lightweight optimization to obtain a safe path for real world deployment. We perform a thorough evaluation of our method on 64 motion planning tasks across four diverse environments with randomized poses, scenes and obstacles, in the real world, demonstrating an improvement of 23%, 17% and 79% motion planning success rate over state of the art sampling, optimization and learning based planning methods. All code, models and datasets will be released on acceptance. Video results available at mihdalal.github.io/neuralmotionplanner.

Abstract:
The tracking module of a visual-inertial SLAM system processes incoming image frames and IMU data to estimate the position of the frame in relation to the map. It is important for the tracking to complete in a timely manner for each frame to avoid poor localization or tracking loss. We therefore present a new approach which leverages GPU computing power to accelerate time-consuming components of tracking in order to improve its performance. These components include stereo feature matching and local map tracking. We implement our design inside the ORB-SLAM3 tracking process using CUDA. Our evaluation demonstrates an overall improvement in tracking performance of up to 2.8× on a desktop and Jetson Xavier NX board in stereo-inertial mode, using the well-known SLAM datasets EuRoC and TUM-VI.

Abstract:
For optical cooperative localization, which employs optical beacons with prior features as cooperative targets, a fundamental prerequisite is to ensure that the beacons are always captured by the vision sensors during the entire localization process. In other words, there is an applicability issue of optical cooperative localization with respect to the relative range between beacons and vision sensors, whereas the corresponding analysis method has so far remained a gap. In this work, we propose a general applicability analysis method for optical cooperative localization to fill this gap. We translate this problem into constructing a multi-constraint model incorporating geometrics and radiometrics for describing the relationship between optical sensor parameters and relative range or depth. For parameterized beacons and vision sensors, the geometric constraint is related to the imaging quantities and the radiometric constraint is determined by the radiation properties. Numerical evaluations are performed based on the range of parameters in practice, and real-world experiments are conducted to validate the effectiveness of the proposed applicability analysis. The results demonstrate the effectiveness of the proposed applicability analysis method and are instructive for real-world deployment of optical cooperative localization.

Abstract:
Robotic teleoperation over long communication distances poses challenges due to delays in commands and feedback from network latency. One simple yet effective strategy to reduce errors and increase performance under delay is to downscale the relative motion between the operating surgeon and the robot. The question remains as to what is the optimal scaling factor, and how this value changes depending on the level of latency as well as operator tendencies. We present user studies investigating the relationship between latency, scaling factor, and performance. The results of our studies demonstrate a statistically significant difference in performance between users and across scaling factors for certain levels of delay. These findings indicate that the optimal scaling factor for a given level of delay is specific to each user, motivating the need for personalized models for optimal performance. We present techniques to model the user-specific mapping of latency level to scaling factor for optimal performance, leading to an efficient and effective solution to optimizing performance of robotic teleoperation and specifically telesurgery under large communication delay.

Abstract:
Taking inspiration from the principle of locality in the field theories of physics [1], we aim to generate an impedance-related velocity field on SE(3) that characterizes the local interaction behaviors associated with specific tasks. Thanks to its locality, the desired impedance at each point of SE(3) can be effectively rendered by following the velocity field, without invoking the geometrical inconsistency of active stiffness [3].First, we will introduce a nonlinear impedance equipped with a guided wrench and a velocity output to portray the global interaction behaviors. The guided wrench will be conceived to ensure that the velocity output at each point of SE(3) accurately reflect the local interaction behaviors. Next, we will employ model-matching approaches to regulate the robot’s end-effector based on the velocity output. Finally, we will conduct a case study on the Peg-in-Hole (PiH) task to demonstrate this impedance control technique.

Abstract:
In this paper, we present a transformer-based method to spatio-temporally associate apple fruitlets in stereo-images collected on different days and from different camera poses. State-of-the-art association methods in agriculture are dedicated towards matching larger crops using either high-resolution point clouds or temporally stable features, which are both difficult to obtain for smaller fruit in the field. To address these challenges, we propose a transformer-based architecture that encodes the shape and position of each fruitlet, and propagates and refines these features through a series of transformer encoder layers with alternating self and cross-attention. We demonstrate that our method is able to achieve an F1-score of 92.4% on data collected in a commercial apple orchard and outperforms all baselines and ablations. The code and data can be found at https://kantor-lab.github.io/fruitassociator/

Abstract:
Shared autonomy holds promise for improving the usability and accessibility of assistive robotic arms, but current methods often rely on costly expert demonstrations and remain static after pretraining, limiting their ability to handle real-world variations. Even with extensive training data, unforeseen challenges—especially those that fundamentally alter task dynamics, such as unexpected obstacles or spatial constraints—can cause assistive policies to break down, leading to ineffective or unreliable assistance. To address this, we propose ILSA, an Incrementally Learned Shared Autonomy framework that continuously refines its assistive policy through user interactions, adapting to real-world challenges beyond the scope of pre-collected data. At the core of ILSA is a structured fine-tuning mechanism that enables continual improvement with each interaction by effectively integrating limited new interaction data while helping to preserve prior knowledge, aiming for a balance between adaptation and generalization. A user study with 20 participants demonstrates ILSA’s effectiveness, showing faster task completion and improved user experience compared to static alternatives. Code and videos are available at https://ilsa-robo.github.io/.

Abstract:
Motion control of magnetic microswarms has attracted extensive attention due to its significance in microrobots-based biomedical applications such as targeted drug delivery. However, such reconfigurable microswarms are subject to complex interactions between individuals and environments which make accurate modeling challenging. These complexities of microswarms poses challenges for precise motion control, as traditional controllers often rely on precise mathematical models and manual parameter tuning that limits their scalability and efficiency. Learning-based methods, such as Deep Reinforcement Learning (DRL), offer an alternative but require large datasets (usually on the order of millions) and extensive exploration which may cause the microswarms instability in physical environments due to unreasonable actions during early training therefore results in the sim-to-real gap. Moreover, traditional DRL focuses on instantaneous state-action mappings, neglecting the sequential dependencies critical for accurate motion control, leading to low tracking accuracy in complex scenarios. To address these challenges, we propose a Learning from Demonstration (LfD)-based motion control framework, which inherently encode compensatory behaviors and task-specific adaptability into neural networks, enabling adaptive performance even under unmodeled disturbances. Furthermore, the neural networks consider a time series of microswarm states to determine the future control actions, enabling the system to learn sequential dependencies and transitions between states so as to ensure smooth and accurate motion control. Simulations and comparative experiments validate our framework’s effectiveness and demonstrate superior control accuracy and adaptability to microswarm’s shape changes.

Abstract:
This paper presents an uncertainty-aware shared control and calibration method for micromanipulation using a digital microscope and a tool-mounted, multi-joint robotic arm, integrating real-time human intervention with a visual-motor policy. Our calibration algorithm leverages co-manipulation control to calibrate the hand-eye transformation without requiring knowledge of the kinematics of the microtool mounted on the robot while remaining robust to camera intrinsics errors. Experimental results show that the proposed calibration method achieves a 39.6% improvement in accuracy over established methods. Additionally, our control structure and calibration method reduces the time required to reach single-point targets from 5.74 s (best conventional method) to 1.91 s, and decreases trajectory tracking errors from 392 μm to 40 μm. These findings establish our method as a robust solution for improving reliability in high-precision biomedical micromanipulation.

Abstract:
Teleoperated robotic characters can perform expressive interactions with humans, relying on the operators’ experience and social intuition. In this work, we propose to create autonomous interactive robots, by training a model to imitate operator data. Our model is trained on a dataset of human-robot interactions, where an expert operator is asked to vary the interactions and mood of the robot, while the operator commands as well as the pose of the human and robot are recorded. Our approach learns to predict continuous operator commands through a diffusion process and discrete commands through a classifier, all unified within a single transformer architecture. We evaluate the resulting model in simulation and with a user study on the real system. We show that our method enables simple autonomous human-robot interactions that are comparable to the expert-operator baseline, and that users can recognize the different robot moods as generated by our model. Finally, we demonstrate a zero-shot transfer of our model onto a different robotic platform with the same operator interface.

Abstract:
Mobile robots rely on object detectors for perception and object localization in indoor environments. However, standard closed-set methods struggle to handle the diverse objects and dynamic conditions encountered in real homes and labs. Open-vocabulary object detection (OVOD), driven by Vision Language Models (VLMs), extends beyond fixed labels but still struggles with domain shifts in indoor environments. We introduce a Source-Free Domain Adaptation (SFDA) approach that adapts a pre-trained model without accessing source data. We refine pseudo labels via temporal clustering, employ multi-scale threshold fusion, and apply a Mean Teacher framework with contrastive learning. Our Embodied Domain Adaptation for Object Detection (EDAOD) benchmark evaluates adaptation under sequential changes in lighting, layout, and object diversity. Our experiments show significant gains in zero-shot detection performance and flexible adaptation to dynamic indoor conditions.

Abstract:
Fisheye cameras, with their ultra-wide field of view, offer significant benefits for depth estimation in applications such as autonomous navigation, robotics, and immersive imaging by capturing more scene content from a single viewpoint. However, their strong radial distortion and varying spatial resolution across the image pose substantial challenges for accurate depth prediction. We present a deep learning–based framework for fisheye depth estimation that addresses these challenges while leveraging the wide coverage advantage. During training, rectified and synchronized stereo image pairs are used, with the right image and an estimated initial depth map reconstructing the left image. A refined spatial consistency loss is formulated by combining Structural Similarity Index Measure (SSIM) and L1 loss, with gradient-based weighting to emphasize disparity edges. To overcome the limitations of photometric loss in disparity learning, we normalize pixel intensities to better correlate disparity with appearance features. A fisheye-specific depth refinement module incorporates an uncertainty map derived from an inconsistency mask and a distortion distribution map, mitigating the effects of occlusion and high-distortion regions. This uncertainty map is used to weight the temporal warping loss, enhancing robustness against distortion-prone areas. During inference, only a single fisheye image is required to produce an accurate depth map. Experimental results demonstrate that our method improves reconstruction fidelity and robustness, making it well-suited for real-world fisheye-based depth estimation tasks.

Abstract:
In this study, we focus on enhancing the policy of a musculoskeletal arm to develop grasping abilities for objects of varying weights. The agent is modeled using MyoSuite, a platform with realistic biomechanics where muscles drive skeletal movement. We observed that optimizing only the control policy is insufficient for handling heavy object grasping, highlighting the limitations of traditional control-focused approaches. To address this issue, we shift our focus to muscle development by optimizing the arm’s muscle parameters. However, this remains challenging for two main reasons. First, the high dimensionality of the muscle parameter space makes it difficult to find optimal designs. Second, evaluating new muscle configurations requires training a control policy, leading to high computational costs. To tackle these challenges, we adopt two strategies. First, we simplify the problem by optimizing only the stiffness parameters, as they have the greatest impact on grasping performance. Second, we apply the Bayesian Morphology Optimization Method (BMO) to efficiently search the parameter space. Compared to genetic algorithms(GA), BMO finds better solutions with fewer evaluations. Experimental results show that BMO achieves similar rewards with 20% fewer iterations than GA and improves the success rate by 10%. In summary, muscle optimization provides an effective solution for grasping tasks, and BMO demonstrates efficient, robust, and generalizable performance in optimizing muscle parameters for such tasks.

Abstract:
Shape memory alloy (SMA) is widely employed in developing actuators. However, the lack of sensing capabilities limits its application. This study presents a sensing-actuation integrated device based on SMA and triboelectric nanogenerator (TENG), achieving tactile sensing while maintaining the actuation performance. The proposed core-shell structure not only repurposes the SMA spring as a key component of actuation and sensing, but also effectively isolates the actuation current to prevent interference with the sensing signal. The aerogel-modified silicone composite layer is applied to the SMA to reduce temperature rise by 30.56%, ensuring the sensing performance. With a rapid response time of less than 31 ms and stable sensing performance exceeding 2000 cycles, the SMA-TENG actuator reliably detects dynamically varying forces and bending. Additionally, it generates a maximum actuation force of 3.21 N, which represents a 12.2% increase compared to a standard SMA spring, due to the pre-stress introduced by the composite layer. Moreover, it can actuate a displacement of 7.7 cm and exhibiting a power density of 7.15 × 103 W/m3 (at 0.84 V, 6 A). Finally, we validate its haptic sensing capability during actuation, demonstrating its potential towards interactive robotic systems.

Abstract:
Real-World evaluation of perception-based planning models for robotic systems, such as autonomous vehicles, can be safely and inexpensively conducted offline, ı.e., by computing model prediction error over a pre-collected validation dataset with ground-truth annotations. However, extrapolating from offline model performance to online settings remains a challenge. In these settings, seemingly minor errors can compound and result in test-time infractions or collisions. This relationship is understudied, particularly across diverse closed-loop metrics and complex urban maneuvers. In this work, we revisit this undervalued question in policy evaluation through an extensive set of experiments across diverse conditions and metrics. Based on analysis in simulation, we find an even worse correlation between offline and online settings than reported by prior studies, casting doubts on the validity of current evaluation practices and metrics for driving policies. Next, we bridge the gap between offline and online evaluation. We investigate an offline metric based on epistemic uncertainty, which aims to capture events that are likely to cause errors in closed-loop settings. The resulting metric achieves over 13% improvement in correlation compared to previous offline metrics. We further validate the generalization of our findings beyond the simulation environment in real-world settings, where even greater gains are observed.

Abstract:
In this paper, we present a 3D reconstruction and rendering framework termed Mesh-Learner that is natively compatible with traditional rasterization pipelines. It integrates mesh and spherical harmonic (SH) Texture (i.e., texture filled with SH coefficients) into the learning process to learn each mesh’s view-dependent radiance end-to-end. Images are rendered by interpolating surrounding SH Texels at each pixel’s sampling point using a novel interpolation method. Conversely, gradients from each pixel are back-propagated to the related SH Texels in SH Textures. Mesh-Learner exploits graphic features of rasterization pipeline (texture sampling, deferred rendering) to render, which makes Mesh-Learner naturally compatible with tools (e.g., Blender) and tasks (e.g., 3D reconstruction, scene rendering, reinforcement learning for robotics) that are based on rasterization pipelines. Our system can train vast, unlimited scenes because we transfer only the SH Textures within the frustum to the GPU for training. At other times, the SH Textures are stored in CPU RAM, which results in moderate GPU memory usage. The rendering results on interpolation and extrapolation sequences in the Replica and FAST-LIVO2 datasets achieve state-of-the-art performance compared to existing state-of-the-art methods (e.g., 3D Gaussian Splatting and M2-Mapping). To benefit the society, the code will be available at https://github.com/hku-mars/Mesh-Learner.

Abstract:
Accurate lane detection is critical for autonomous driving safety. In recent years, anchor-based detection methods have made significant progress. However, existing frameworks struggle in complex scenarios such as nighttime or dazzle light environments. Additionally, these methods exhibit limited geometric modeling and extrapolation capabilities for curvature variations in curved lanes. To tackle these challenges, we propose LaneMind, an innovative framework that combines human visual perception principles with advanced geometric modeling. Our approach features a dual-path architecture with cross-path attention mechanism, enabling simultaneous local feature extraction and global structure modeling. The network outputs confidence heatmap, followed by a skeleton-guided regression module that extracts medial-axis skeletons from high-probability lane regions to precisely localize lanes while maintaining topological continuity. Experimental results demonstrate that LaneMind achieves competitive performance across various benchmarks, particularly excelling in challenging curved lane scenarios and adverse lighting conditions. The framework’s robust performance and accurate detection quality highlight its potential for real-world autonomous driving applications.

Abstract:
We introduce Massively Multi-Task Model-Based Policy Optimization (M3PO), a scalable model-based reinforcement learning (MBRL) framework designed to address the challenges of sample efficiency in single-task settings and generalization in multi-task domains. Existing model-based approaches like DreamerV3 rely on generative world models that prioritize pixel-level reconstruction, often at the cost of control-centric representations, while model-free methods such as PPO suffer from high sample complexity and limited exploration. M3PO integrates an implicit world model, trained to predict task outcomes without reconstructing observations, with a hybrid exploration strategy that combines model-based planning and model-free uncertainty-driven bonuses. This approach eliminates the bias-variance trade-off inherent in prior methods (e.g., POME’s exploration bonuses) by using the discrepancy between model-based and model-free value estimates to guide exploration while maintaining stable policy updates via a trust-region optimizer. M3PO is introduced as an advanced alternative to existing model-based policy optimization methods.

Abstract:
Nowadays, laparoscopic surgery procedures face a trade-off between expensive, complex robotic systems and manual instruments with limited functionality. Fully robotic solutions offer precision but lack portability and intuitive control, while manual tools rely solely on the surgeon’s dexterity, limiting maneuverability and depth perception in confined spaces. To bridge this, we propose a Human-Guided Robotic-Assistance Handheld Continuum Medical Robot System (HRHC). This system simulates intuitive manual operation with robotic precision, extending the surgeon’s capabilities while maintaining portability. Additionally, a stereo vision system enhances real-time depth perception, improving spatial awareness in minimally invasive procedures.

Abstract:
Collective decision-making is a key function of autonomous robot swarms, enabling them to reach a consensus on actions based on environmental features. Existing strategies require the participation of all robots in the decision-making process, which is resource-intensive and prevents the swarm from allocating the robots to any other tasks. We propose Subset-Based Collective Decision-Making (SubCDM), which enables decisions using only a swarm subset. The construction of the subset is dynamic and decentralized, relying solely on local information. Our method allows the swarm to adaptively determine the size of the subset for accurate decision-making, depending on the difficulty of reaching a consensus. Simulation results using one hundred robots show that our approach achieves accuracy comparable to using the entire swarm while reducing the number of robots required to perform collective decision-making, making it a resource-efficient solution for collective decision-making in swarm robotics.

Abstract:
While recent advances in suction grasping have shown remarkable progress, significant challenges persist particularly in cluttered and complex parcel handling scenarios. Current approaches are limited by (1) the lack of comprehensive parcel-specific suction grasp datasets and (2) poor adaptability to diverse object properties, including size, geometry, and texture. We address these challenges through two main contributions. Firstly, we introduce the Parcel-Suction-Dataset, a large-scale synthetic dataset containing 25 thousand cluttered scenes with 410 million precision-annotated suction grasp poses, generated via our novel geometric sampling algorithm. Secondly, we propose Diffusion-Suction, a framework that innovatively reformulates suction grasp prediction as a conditional generation task using denoising diffusion probabilistic models. Our method iteratively refines random noise into suction grasping score through visual-conditioned guidance from point cloud observations, effectively learning spatial point-wise affordances from our synthetic dataset. Extensive experiments demonstrate that the simple yet efficient Diffusion-Suction achieves new state-of-the-art performance compared to previous models on both Parcel-Suction-Dataset and the public SuctionNet-1Billion benchmark. This work provides a robust foundation for advancing automated parcel handling systems in real-world applications.

Abstract:
This paper develop a scheduling protocol for a team of autonomous robots that operate on long-term persistent tasks. The proposed framework, called meSch, accounts for the limited battery capacity of the robots and ensures that the robots return to charge their batteries one at a time at the single charging station. The protocol is applicable to general nonlinear robot models under certain assumptions, does not require robots to be deployed at different times, and can handle robots with different discharge rates. We further consider the case when the charging station is mobile and its state information is subject to uncertainty. The feasibility of the algorithm in terms of ensuring persistent charging is given under certain assumptions, while the efficacy of meSch is validated through simulation and hardware experiments. [Code]a[Video]b

Abstract:
We consider the problem of building the map of an unknown environment using multiple mobile robots that have physical limitations arising from dynamics and a limited onboard battery. We consider the setting where the unknown environment has a set of charging stations that the robots must discover and visit often to recharge their battery during the map building process. We propose an iterative approach to solve the resulting energy-constrained multi-robot exploration problem. Our approach uses a combination of frontier-based exploration, graph-based path planning, and multi-robot task assignment. We show that our algorithm admits a computationally inexpensive implementation that enables rapid replanning, and propose sufficient conditions for recursive feasibility and finite-time termination. We validate our approach in several Gazebo-based realistic simulations.

Abstract:
Quadruped robots are proliferating in industrial environments where they carry sensor payloads and serve as autonomous inspection platforms. Despite the advantages of legged robots over their wheeled counterparts on rough and uneven terrain, they are still unable to reliably negotiate a ubiquitous feature of industrial infrastructure: ladders. Inability to traverse ladders prevents quadrupeds from inspecting dangerous locations, puts humans in harm’s way, and reduces industrial site productivity. In this paper, we learn quadrupedal ladder climbing via a reinforcement learning-based control policy and a complementary hooked end effector. We evaluate the robustness in simulation across different ladder inclinations, rung geometries, and inter-rung spacings. On hardware, we demonstrate zero-shot transfer with an overall 90% success rate at ladder angles ranging from 70° to 90°, consistent climbing performance during unmodeled perturbations, and climbing speeds 232 × faster than the state of the art. This work expands the scope of industrial quadruped robot applications beyond inspection on nominal terrains to challenging infrastructural features in the environment, highlighting synergies between robot morphology and control policy when performing complex skills. More information can be found at the project website: https://sites.google.com/leggedrobotics.com/climbingladders.

Abstract:
This paper proposes a novel method for multi-lane convoy formation control that uses large language models (LLMs) to tackle coordination challenges in dynamic highway environments. Each connected and autonomous vehicle in the convoy uses a knowledge-driven approach to make real-time adaptive decisions based on various scenarios. Our method enables vehicles to dynamically perform tasks, including obstacle avoidance, convoy joining/leaving, and escort formation switching, all while maintaining the overall convoy structure. We design a Interlaced formation control strategy based on locally dynamic distributed graphs, ensuring the convoy remains stable and flexible. We conduct extensive experiments in the SUMO simulation platform across multiple traffic scenarios, and the results demonstrate that the proposed method is effective, robust, and adaptable to dynamic environments. The code is available at: https://github.com/chuduanfeng/ConvoyLLM.

Abstract:
By combining classical planning methods with large language models (LLMs), recent research such as LLM+P has enabled agents to plan for general tasks given in natural language. However, scaling these methods to general-purpose service robots remains challenging: (1) classical planning algorithms generally require a detailed and consistent specification of the environment, which is not always readily available; and (2) existing frameworks mainly focus on isolated planning tasks, whereas robots are often meant to serve in long-term continuous deployments, and therefore must maintain a dynamic memory of the environment which can be updated with multi-modal inputs and extracted as planning knowledge for future tasks. To address these two issues, this paper introduces L3M+P (Lifelong LLM+P), a framework that uses an external knowledge graph as a representation of the world state. The graph can be updated from multiple sources of information, including sensory input and natural language interactions with humans. L3M+P enforces rules for the expected format of the absolute world state graph to maintain consistency between graph updates. At planning time, given a natural language description of a task, L3M+P retrieves context from the knowledge graph and generates a problem definition for classical planners. Evaluated on household robot simulators and on a real-world service robot, L3M+P achieves significant improvement over baseline methods both on accurately registering natural language state changes and on correctly generating plans, thanks to the knowledge graph retrieval and verification.

Abstract:
Learning from human demonstrations is an emerging trend for designing intelligent robotic systems. However, previous methods typically regard videos as instructions, simply dividing videos into action sequences for robotic repetition, which pose obstacles to generalization to diverse tasks or object instances. In this paper, we propose a different perspective, considering human demonstration videos not as mere instructions, but as a source of knowledge for robots. Motivated by this perspective and the remarkable comprehension and generalization capabilities exhibited by large language models (LLMs), we propose DigKnow, a method that DIstills Generalizable KNOWledge with a hierarchical structure. Specifically, DigKnow begins by converting human demonstration video frames into observation knowledge. This knowledge is then subjected to analysis to extract human action knowledge and further distilled into pattern knowledge that comprises task and object instances, resulting in the acquisition of generalizable knowledge with a hierarchical structure. In settings with different tasks or object instances, DigKnow retrieves relevant knowledge for the current task and object instances. Subsequently, the LLM-based planner conducts planning based on the retrieved knowledge, and the policy executes actions in line with the plan to achieve the designated task. Utilizing the retrieved knowledge, we validate and rectify planning and execution outcomes, resulting in a substantial enhancement of the success rate. Experimental results across a range of tasks and scenes demonstrate the effectiveness of this approach in facilitating real-world robots to accomplish tasks with the knowledge derived from human demonstrations.

Abstract:
Humanoid robots capable of autonomous operation in diverse environments have long been a goal for roboticists. However, autonomous manipulation by humanoid robots has largely been restricted to one specific scene, primarily due to the difficulty of acquiring generalizable skills and the expensiveness of in-the-wild humanoid robot data. In this work, we build a real-world robotic system to address this challenging problem. Our system is mainly an integration of 1) a whole-upper-body robotic teleoperation system to acquire human-like robot data, 2) a 25-DoF humanoid robot platform with a height-adjustable cart and a 3D LiDAR sensor, and 3) an improved 3D Diffusion Policy learning algorithm for humanoid robots to learn from noisy human data. We run more than 2000 episodes of policy rollouts on the real robot for rigorous policy evaluation. Empowered by this system, we show that using only data collected in one single scene and with only onboard computing, a full-sized humanoid robot can autonomously perform skills in diverse real-world scenarios. Videos are available at humanoid-manipulation.github.io.

Abstract:
Learning robust and generalizable manipulation skills from few demonstrations remains a key challenge in robotics, with broad applications in industrial automation and service robotics. Although recent imitation learning methods have achieved impressive results, they often require a large amount of demonstration data and struggle to generalize across different spatial variants. In this work, we propose a framework that learns 3D manipulation policies from only 10 demonstrations while achieving robust generalization to unseen spatial configurations through semantic-guided perception and spatial-equivariant policy learning. Our framework consists of two key modules: a Semantic Guided Perception module that extracts task-aware 3D representations from RGB-D inputs using semantic priors and a Spatial Generalized Decision module implementing a diffusion-based policy that preserves spatial equivariance through denoising. Central to our framework is a spatially equivariant training strategy, which adapts 2D data augmentation principles to 3D manipulation by maintaining gripper-object spatial relationships during trajectory augmentation. We validate our framework through extensive experiments on both simulation benchmarks and real-world robotic systems. Our method demonstrates a significant improvement in success rates over state-of-the-art approaches on a series of challenging tasks, particularly under significant object pose variations. This work shows significant potential to advance efficient and generalizable manipulation skill learning in real-world applications.

Abstract:
This paper proposes a novel force feedback system based on visual processing and Moiré patterns. The system uses a force sensor with a simple and efficient structure to eliminate the need for cables or other electronic components. A brass flexure plate was employed as the primary elastic element, leveraging the Moiré fringe principle for force measurements. High-speed cameras are used to capture real-time images that are then processed using advanced algorithms to accurately extract force data. Force feedback experiments were conducted to evaluate the performance of the proposed system, and the results were compared with those taken from conventional force sensors. The experimental results indicated that the proposed device consistently delivered precise target force outputs with outcomes that closely matched those obtained using traditional sensors. In addition, the system demonstrated robust performance in high-frequency measurements at 500 Hz, achieving an average force- feedback convergence time of approximately 0.1s.

Abstract:
Plant roots typically exhibit a highly complex and dense architecture, incorporating numerous slender lateral roots and branches, which significantly hinders the precise capture and modeling of the entire root system. Additionally, roots often lack sufficient texture and color information, making it difficult to identify and track root traits using visual methods. Previous research on roots has been largely confined to 2D studies; however, exploring the 3D architecture of roots is crucial in botany. Since roots grow in real 3D space, 3D phenotypic information is more critical for studying genetic traits and their impact on root development. We have introduced a 3D root skeleton extraction method that efficiently derives the 3D architecture of plant roots from a few images. This method includes the detection and matching of lateral roots, triangulation to extract the skeletal structure of lateral roots, and the integration of lateral and primary roots. We developed a highly complex root dataset and tested our method on it. The extracted 3D root skeletons showed considerable similarity to the ground truth, validating the effectiveness of the model. This method can play a significant role in automated breeding robots. Through precise 3D root structure analysis, breeding robots can better identify plant phenotypic traits, especially root structure and growth patterns, helping practitioners select seeds with superior root systems. This automated approach not only improves breeding efficiency but also reduces manual intervention, making the breeding process more intelligent and efficient, thus advancing modern agriculture.

Abstract:
Reinforcement learning combined with sim-to-real transfer offers a general framework for developing locomotion controllers for legged robots. To facilitate successful deployment in the real world, smoothing techniques, such as low-pass filters and smoothness rewards, are often employed to develop policies with smooth behaviors. However, because these techniques are non-differentiable and usually require tedious tuning of a large set of hyperparameters, they tend to require extensive manual tuning for each robotic platform. To address this challenge and establish a general technique for enforcing smooth behaviors, we propose a simple and effective method that imposes a Lipschitz constraint on a learned policy, which we refer to as Lipschitz-Constrained Policies (LCP). We show that the Lipschitz constraint can be implemented in the form of a gradient penalty, which provides a differentiable objective that can be easily incorporated with automatic differentiation frameworks. We demonstrate that LCP effectively replaces the need for smoothing rewards or low-pass filters and can be easily integrated into training frameworks for many distinct humanoid robots. We extensively evaluate LCP in both simulation and real-world humanoid robots, producing smooth and robust locomotion controllers. All simulation and deployment code, along with complete checkpoints, is available on our project page: https://lipschitz-constrained-policy.github.io.

Abstract:
Omnidirectional aerial robots offer full 6-DoF independent control over position and orientation, making them popular for aerial manipulation. Although advancements in robotic autonomy, human operation remains essential in complex aerial environments. Existing teleoperation approaches for multirotors fail to fully leverage the additional DoFs provided by omnidirectional rotation. Additionally, the dexterity of human fingers should be exploited for more engaged interaction. In this work, we propose an aerial teleoperation system that brings the rotational flexibility of human hands into the unbounded aerial workspace. Our system includes two motion-tracking marker sets—one on the shoulder and one on the hand—along with a data glove to capture hand gestures. Using these inputs, we design four interaction modes for different tasks, including Spherical Mode and Cartesian Mode for long-range moving, Operation Mode for precise manipulation, as well as Locking Mode for temporary pauses, where the hand gestures are utilized for seamless mode switching. We evaluate our system on a vertically mounted valve-turning task in the real world, demonstrating how each mode contributes to effective aerial manipulation. This interaction framework bridges human dexterity with aerial robotics, paving the way for enhanced aerial teleoperation in unstructured environments.

Abstract:
Workspace limitations restrict the operational capabilities and range of motion for systems with robotic arms. Maximizing workspace utilization has the potential to provide better solutions for aerial manipulation tasks, increasing the system’s flexibility and operational efficiency. In this paper, we introduce a novel planning framework for aerial grasping that maximizes workspace utilization. We formulate an optimization problem to optimize the aerial manipulator’s trajectory, incorporating task constraints to achieve efficient manipulation. To address the challenge of incorporating the delta arm’s non-convex workspace into optimization constraints, we leverage a Multilayer Perceptron (MLP) to map the point positions to feasibility probabilities. Furthermore, we employ Reversible Residual Networks (RevNet) to approximate the complex forward kinematics of the delta arm, utilizing its efficient model gradients to further eliminate workspace constraints. We validate our methods in simulations and real-world experiments to demonstrate their effectiveness.

Abstract:
Motivated by human-chains in rescue missions, this paper proposes a scalable path planning algorithm for multiple mobile robots that are tethered to one another in a chain topology with finite-length tethers. Specifically, our approach trades off optimality for scalability and computational tractability by adding some simplifying, yet realistic constraints that can significantly reduce computation. In particular, by maintaining the existence of tether configurations that coincide with collision-free, feasible paths for the robots, we remove the need to check that the tether configurations are collision-free, which is often a bottleneck since the tethers are infinite-dimensional. Our proposed path planning framework for tethered robot chains builds upon sampling-based algorithms such as RRT, BIT, and ABIT. Finally, we prove the probabilistic completeness of the approach, ensuring reliable path generation, and demonstrate the effectiveness of our approach in simulation experiments.

Abstract:
Intelligent devices for supporting persons with vision impairment are becoming more widespread, but they are lacking behind the advancements in intelligent driver assistant system. To make a first step forward, this work discusses the integration of the risk model technology, previously used in autonomous driving and advanced driver assistance systems, into an assistance device for persons with vision impairment. The risk model computes a probabilistic collision risk given object trajectories which has previously been shown to give better indications of an object's collision potential compared to distance or time-to-contact measures in vehicle scenarios. In this work, we show that the risk model is also superior in warning persons with vision impairment about dangerous objects. Our experiments demonstrate that the warning accuracy of the risk model is 67% while both distance and time-to-contact measures reach only 51% accuracy for real-world data.

Abstract:
Microscale droplet-based robotic systems have emerged as a promising platform for targeted drug delivery, minimally invasive surgery, and lab-on-a-chip applications. Here, we report a novel microrobotic swarm based on microdroplets of silicone oil-based ferrofluid, which exhibits excellent biocompatibility and chemical inertness. By modulating three-dimensional magnetic fields, we achieved reconfigurable self-organized patterns of an aggregated state, a dispersed state, and a chain state. We established a dynamic model and reproduced the three states via numerical simulations. Furthermore, we discovered two locomotion modes: sliding and rolling. Utilizing the sliding mode, we navigated the swarm through narrow and complex channels and accomplished directional transport of bubbles, enabling both translational and rotational movements.

Abstract:
High-resolution tactile sensing and advanced computational models have accelerated progress in robotic grasping; however, real-time, stable manipulation of smooth and fragile objects still lags behind. The challenges are twofold: first, the robot must detect incipient slip at sub-millimeter scales in real time; second, the system must issue millisecond-level early warnings before true instability occurs so that the controller has sufficient time to react. To address these challenges, we propose PANDAS (Prediction AND Detection of Accurate Slippage), a framework that integrates a physics-informed, multimodal spatiotemporal network for slip detection with a probabilistic temporal reasoning module for forecasting near-future risk. Experimental results demonstrate that the proposed method achieves a slip sensitivity of 94.6%, a response latency of 28ms, and an early-warning lead time of 32ms. Moreover, under 5dB Gaussian noise, it maintains a high F1-score of 92.3%, validating its robustness, predictive capability, and suitability for edge deployment in dynamic, high-noise environments.

Abstract:
We present MoE-Loco, a Mixture of Experts (MoE) framework for multitask locomotion for legged robots. Our method enables a single policy to handle diverse terrains, including bars, pits, stairs, slopes, and baffles, while supporting quadrupedal and bipedal gaits. Using MoE, we mitigate the gradient conflicts that typically arise in multitask reinforcement learning, improving both training efficiency and performance. Our experiments demonstrate that different experts naturally specialize in distinct locomotion behaviors, which can be leveraged for task migration and skill composition. We further validate our approach in both simulation and real-world deployment, showcasing its robustness and adaptability.

Abstract:
Humanoid robots are engineered to navigate terrains akin to those encountered by humans, which necessitates human-like locomotion and perceptual abilities. Currently, the most reliable controllers for humanoid motion rely exclusively on proprioception, a reliance that becomes both dangerous and unreliable when coping with rugged terrain. Although the integration of height maps into perception can enable proactive gait planning, robust utilization of this information remains a significant challenge, especially when exteroceptive perception is noisy. To surmount these challenges, we propose a solution based on a teacher-student distillation framework. In this paradigm, an oracle policy accesses noise-free data to establish an optimal reference policy, while the student policy not only imitates the teacher’s actions but also simultaneously trains a world model with a variational information bottleneck for sensor denoising and state estimation. Extensive evaluations demonstrate that our approach markedly enhances performance in scenarios characterized by unreliable terrain estimations. Moreover, we conducted rigorous testing in both challenging urban settings and off-road environments, the model successfully traverse 2 km of varied terrain without external intervention.

Abstract:
An open problem in mobile manipulation is how to represent objects and scenes in a unified manner so that robots can use both for navigation and manipulation. The latter requires capturing intricate geometry while understanding fine-grained semantics, whereas the former involves capturing the complexity inherent at an expansive physical scale. In this work, we present GeFF (Generalizable Feature Fields), a scene-level generalizable neural feature field that acts as a unified representation for both navigation and manipulation that performs in real-time. To do so, we treat generative novel view synthesis as a pre-training task, and then align the resulting rich scene priors with natural language via CLIP feature distillation. We demonstrate the effectiveness of this approach by deploying GeFF on a quadrupedal robot equipped with a manipulator. We quantitatively evaluate GeFF’s ability for open-vocabulary object-/part-level manipulation and show that GeFF outperforms point-based baselines in runtime and storage-accuracy trade-offs, with qualitative examples of semantics-aware navigation and articulated object manipulation.

Abstract:
Imitation learning enables robots to learn new tasks from human examples. One fundamental limitation while learning from humans is causal confusion. Causal confusion occurs when the robot’s observations include both task-relevant and extraneous information: for instance, a robot’s camera might see not only the intended goal, but also clutter and changes in lighting within its environment. Because the robot does not know which aspects of its observations are important a priori, it often misinterprets the human’s examples and fails to learn the desired task. To address this issue, we highlight that — while the robot learner may not know what to focus on — the human teacher does. In this paper we propose that the human proactively marks key parts of their task with small, lightweight beacons. Under our framework (RECON) the human attaches these beacons to task-relevant objects before providing demonstrations: as the human shows examples of the task, beacons track the position of marked objects. We then harness this offline beacon data to train a task-relevant state embedding. Specifically, we embed the robot’s observations to a latent state that is correlated with the measured beacon readings: in practice, this causes the robot to autonomously filter out extraneous observations and make decisions based on features learned from the beacon data. Our simulations and a real robot experiment suggest that this framework for human-placed beacons mitigates causal confusion. Indeed, we find that using RECON significantly reduces the number of demonstrations needed to convey the task, lowering the overall time required for human teaching. See videos here: https://youtu.be/oy85xJvtLSU

Abstract:
While most optical tactile sensors rely on measuring surface displacement, insights from continuum mechanics suggest that measuring shear strain provides key information for tactile sensing. In this work, we introduce an optical tactile sensing principle based on shear strain detection. A silicone rubber layer, dyed with color inks, is used to quantify the shear magnitude of the sensing layer. This principle was validated using the NUSense camera-based tactile sensor. The wide-angle camera captures the elongation of the soft pad under mechanical load, a phenomenon attributed to the Poisson effect. We tested the robustness of the sensor by subjecting the outermost layer to multiple load (8 N) cycles using a 5 mm in radius ball head indenter. The physical and optical properties of the inked pad proved essential and remained stable over time, exhibiting only low variance.

Abstract:
Conventional crop production, which is essential for providing food, feed, fuel, and fiber for our society, relies heavily on harmful herbicides to control weeds. Instead, agricultural robots could remove weeds more sustainably. However, these robots require a generalizable perception system that can locate weeds, enabling automatic removal of weeds. Specifically, they need to perform crop-weed semantic segmentation, which locates and distinguishes between the crop and the weed plants with pixel-level resolution. However, most existing crop-weed semantic segmentation methods are fully supervised and require expensive and labor-intensive pixel-wise labeling of the training data. To avoid the costly labeling process, we address the problem of unsupervised crop-weed segmentation in this paper. Unlike previous approaches, we leverage the idea that weeds are "weird" plants that occur less frequently and are highly variable in appearance, and reframe the problem as an anomaly segmentation problem. We propose an approach to segment weeds as anomalous plants by categorizing plants in the feature space of a pretrained foundation model. Our approach curates a bag-of-features representation of crop features and models the manifold of crop plants as hyperspheres. During inference, it classifies vegetation segments of the image with features within this manifold as crop plants and all other plants as weeds. Our experiments show that our zero-shot anomaly segmentation method can perform crop-weed segmentation on several datasets from real crop fields.

Abstract:
This paper studies coordinated behaviors which arise when a team of robots must traverse hazardous environments in the presence of an adversary. We formulate the scenario as a novel non-cooperative stochastic game in which the "blue" team of robots moves in an environment modeled by a time-varying graph, attempting to reach some goal with minimum cost, while the "red" player controls how the graph changes to maximize the cost. In addition to a numerical method to compute the Nash equilibrium, we also present novel theoretical analysis on security strategies that provides performance bounds in a more computationally efficient way. Through numerical simulations, we demonstrate the emergence of beneficial coordinated behavior, where the robots split up and/or synchronize to traverse risky edges.

Abstract:
While agricultural robotics has made great strides in recent years, manipulation of plants for tasks such as staking and harvesting remains highly challenging due to the high variability in dynamics and deformable nature of plants. To address the challenges created by dynamics uncertainty, we develop a system applying an occupancy-belief planning concept to plant manipulation for staking. We first train a dynamics model that predicts a per-pixel probability that the plant occupies the corresponding slice in space after a drag action using a large set of simulators. This model is then used to plan a manipulation action that maximizes the probability areas swept by the stake tying tool’s operating region are occupied by the plant, and minimize the probability areas swept by the non-operating side regions of the tool are occupied. We demonstrate our method both in simulation and with zero-shot sim-to-real transfer to a physical implementation. We show that adding consideration of belief through use of occupancy-belief allows our method to outperform both the visual foresight type approaches it is based on and other baselines and ablations, especially in the real-world case.

Abstract:
Equipped with Large Language Models (LLMs), human-centered robots are now capable of performing a wide range of tasks that were previously deemed challenging or unattainable. However, merely completing tasks is insufficient for cognitive robots, who should learn and apply human preferences to future scenarios. In this work, we propose a framework that combines human preferences with physical constraints, requiring robots to complete tasks while considering both. Firstly, we developed a benchmark of everyday household activities, which are often evaluated based on specific preferences. We then introduced In-Context Learning from Human Feedback (ICLHF), where human feedback comes from direct instructions and adjustments made intentionally or unintentionally in daily life. Extensive sets of experiments, testing the ICLHF to generate task plans and balance physical constraints with preferences, have demonstrated the efficiency of our approach.

Abstract:
Similarly to aerial drones, small-scale underwater robots are prone to external wrenches resulting from disturbances such as water currents or collisions. Estimating the external wrench acting on an underwater robot is challenging due to non-linear hydrodynamic effects and the bottleneck of being limited to onboard sensing.We build on a model-based approach for aerial wrench estimation and extend it to the underwater domain. Various modifications are applied, such as capturing hydrodynamic effects, and new sensory information is integrated, for example, via Doppler velocity log (DVL).We evaluate the performance of the proposed approach through a series of experiments. Moreover, we assess the effect of fusing various sensor configurations and their respective influence on the wrench estimate, including low-cost vs. high-end IMU and DVL. Our adapted approach from the aerial domain delivers good results in estimating external wrenches on underwater robots. While the IMU quality is found to be less important, considering the underwater domain-specific damping terms is critical.

Abstract:
Multi-agent collaborative perception enhances each agent’s perceptual capabilities by sharing sensing information to cooperatively perform robot perception tasks. This approach has proven effective in addressing challenges such as sensor deficiencies, occlusions, and long-range perception. However, existing representative collaborative perception systems transmit intermediate feature maps, such as bird’s-eye view (BEV) representations, which contain a significant amount of non-critical information, leading to high communication bandwidth requirements. To enhance communication efficiency while preserving perception capability, we introduce CoCMT, an object-query-based collaboration framework that optimizes communication bandwidth by selectively extracting and transmitting essential features. Within CoCMT, we introduce the Efficient Query Transformer (EQFormer) to effectively fuse multi-agent object queries and implement a synergistic deep supervision to enhance the positive reinforcement between stages, leading to improved overall performance. Experiments on OPV2V and V2V4Real datasets show CoCMT outperforms state-of-the-art methods while drastically reducing communication needs. On V2V4Real, our model (Top-50 object queries) requires only 0.416 Mb bandwidth—83 times less than SOTA methods—while improving AP@70 by 1.1%. This efficiency breakthrough enables practical collaborative perception deployment in bandwidth-constrained environments without sacrificing detection accuracy. The code and models are open-sourced through the following link: https://github.com/taco-group/COCMT.

Abstract:
In recent years, capsule robots have gained wide acceptance among doctors and patients for the examination of gastrointestinal diseases due to their non-invasive, safe, and painless advantages. However, the image resolution captured by capsule robots is limited by space size and power, which hinders doctors' ability to accurately assess patients' stomach conditions and real-time control of the capsule robot. This paper proposes the design of two super-resolution networks for capsule robot videos. The first network, EndoVSR, is a high-performance offline video super-resolution network based on a generative adversarial network. It is designed to enhance the resolution of captured videos during offline processing. The second network, Bi-RUN, is a real-time video super-resolution network based on recurrent neural networks. It is designed to enhance the resolution of videos in real-time, enabling doctors to have a clearer view of the stomach condition during the examination. Extensive training and verification of these networks have been conducted using different datasets. All the performance indicators achieved leading positions. Furthermore, simulation experiments were carried out on pig stomachs in vitro to further validate the performance of the proposed networks in practical applications.

Abstract:
Robotic systems are routinely used in the logistics industry to enhance operational efficiency, but the design of robot workspaces remains a complex and manual task, which limits the system’s flexibility to changing demands. This paper aims to automate robot workspace design by proposing a computational framework to generate a budget-minimizing layout by selectively placing stationary robots on a floor grid to sort packages from given input and output locations. Finding a good layout that minimizes the hardware budget while ensuring motion feasibility is a challenging combinatorial problem with nonconvex motion constraints. We propose a new optimization-based approach that models layout planning as a subgraph optimization problem subject to network flow constraints. Our core insight is to abstract away motion constraints from the layout optimization by precomputing a kinematic reachability graph and then extract the optimal layout on this ground graph. We validate the motion feasibility of our approach by proposing a simple task assignment and motion planning technique. We benchmark our algorithm on problems with various grid resolutions and number of outputs and show improvements in memory efficiency over a heuristic search algorithm. In addition, we demonstrate that our algorithm can be extended to handle various types of robot manipulators and conveyor belts, box payload constraints, and cost assignments.

Affiliations: Insect Robotics Group - Institute for Perception, Action and Behaviour, School of Informatics, University of Edinburgh. Informatics Forum, Edinburgh, United Kingdom; Research Office, Mohamed bin Zayed University of Artificial Intelligence, Abu Dhabi, United Arab Emirates; School of Engineering, Coventry University, Cairo, Egypt; Robotics Group, Faculty of Science and Technology, Norwegian University of Life Sciences (NMBU), Norway

Abstract:
In this paper, a novel prototype for harvesting table-top grown strawberries is presented, that is minimalist in its footprint interacting with the fruit. In our methodology, a smooth trapper manipulates the stem into a precise groove location at which a distant laser beam is focused. The tool reaches temperatures as high as 188° Celsius and as such killing germs and preventing the spread of local plant diseases. The burnt stem wound preserves water content and in turn the fruit shelf life. Cycle and cut times achieved are 5.56 and 2.88 seconds respectively in successful in-door harvesting demonstration. Extensive experiments are performed to optimize the laser spot diameter and lateral speed against the cutting time.

Abstract:
Vision-based control relies on accurate perception to achieve robustness. However, image distribution changes caused by sensor noise, adverse weather, and dynamic lighting can degrade perception, leading to suboptimal control decisions. Existing approaches, including domain adaptation and adversarial training, improve robustness but struggle to generalize to unseen corruptions while introducing computational overhead. To address this challenge, we propose a real-time image repair module that restores corrupted images before they are used by the controller. Our method leverages generative adversarial models, specifically CycleGAN and pix2pix, for image repair. CycleGAN enables unpaired image-to-image translation to adapt to novel corruptions, while pix2pix exploits paired image data when available to improve the quality. To ensure alignment with control performance, we introduce a control-focused loss function that prioritizes perceptual consistency in repaired images. We evaluated our method in a simulated autonomous racing environment with various visual corruptions. The results show that our approach significantly improves performance compared to baselines, mitigating distribution shift and enhancing controller reliability.

Abstract:
This paper presents ETA-IK, a novel Execution-Time-Aware Inverse Kinematics method tailored for dual-arm robotic systems. The primary goal is to optimize motion execution time by leveraging the redundancy of the entire system, specifically in tasks where only the relative pose of the robots is constrained, such as dual-arm scanning of unknown objects. Unlike traditional IK methods using surrogate metrics, our approach directly optimizes execution time while implicitly considering collisions. A neural network based execution time approximator is employed to predict time-efficient joint configurations while accounting for potential collisions. Through experimental evaluation on a system composed of a UR5 and a KUKA iiwa robot, we demonstrate significant reductions in execution time. The proposed method outperforms conventional approaches, showing improved motion efficiency without sacrificing positioning accuracy.

Abstract:
The specification and validation of robotics applications require bridging the gap between formulating requirements and systematic testing. This often involves manual and error-prone tasks that become more complex as requirements, design, and implementation evolve. To address this challenge systematically, we propose extending behaviour-driven development (BDD) to define and verify acceptance criteria for robotic systems. In this context, we use domain-specific modelling and represent composable BDD models as knowledge graphs for robust querying and manipulation, facilitating the generation of executable testing models. A domain-specific language helps to efficiently specify robotic acceptance criteria. We explore the potential for automated generation and execution of acceptance tests through a software architecture that integrates a BDD framework, Isaac Sim, and model transformations, focusing on acceptance criteria for pick-and-place applications. We tested this architecture with an existing pick-and-place implementation and evaluated the execution results, which shows how this application behaves and fails differently when tested against variations of the agent and environment. This research advances the rigorous and automated evaluation of robotic systems, contributing to their reliability and trustworthiness.

Abstract:
Learning robotic manipulation skills from vision is a promising approach for developing robotics applications that can generalize broadly to real-world scenarios. As such, many approaches to enable this vision have been explored with fruitful results. Particularly, object-centric representation methods have been shown to provide better inductive biases for skill learning, leading to improved performance and generalization. Nonetheless, we show that object-centric methods can struggle to learn simple manipulation skills in multi-object environments.Thus, we propose DOCIR, an object-centric framework that introduces a disentangled representation for objects of interest, obstacles, and robot embodiment. We show that this approach leads to state-of-the-art performance for learning pick and place skills from visual inputs in multi-object environments and generalizes at test time to changing objects of interest and distractors in the scene. Furthermore, we show its efficacy both in simulation and zero-shot transfer to the real world.

Abstract:
Vision and touch are two fundamental sensory modalities for robots, offering complementary information that enhances perception and manipulation tasks. Previous research has attempted to jointly learn visual-tactile representations to extract more meaningful information. However, these approaches often rely on direct combination, such as feature addition and concatenation, for modality fusion, which tend to result in poor feature integration. In this paper, we propose ConViTac, a visual-tactile representation learning network designed to enhance the alignment of features during fusion using contrastive representations. Our key contribution is a Contrastive Embedding Conditioning (CEC) mechanism that leverages a contrastive encoder pretrained through self-supervised contrastive learning to project visual and tactile inputs into unified latent embeddings. These embeddings are used to couple visual-tactile feature fusion through cross-modal attention, aiming at aligning the unified representations and enhancing performance on downstream tasks. We conduct extensive experiments to demonstrate the superiority of ConViTac in real world over current state-of-the-art methods and the effectiveness of our proposed CEC mechanism, which improves accuracy by up to 12.0% in material classification and grasping prediction tasks. More details are on our project website.

Abstract:
How can robots learn dexterous grasping skills efficiently and apply them adaptively based on user instructions? This work tackles two key challenges: efficient skill acquisition from limited human demonstrations and context-driven skill selection. We introduce AdaDexGrasp, a framework that learns a library of grasping skills from a single human demonstration per skill and selects the most suitable one using a vision-language model (VLM). To improve sample efficiency, we propose a trajectory following reward that guides reinforcement learning (RL) toward states close to a human demonstration while allowing flexibility in exploration. To learn beyond the single demonstration, we employ curriculum learning, progressively increasing object pose variations to enhance robustness. At deployment, a VLM retrieves the appropriate skill based on user instructions, bridging low-level learned skills with high-level intent. We evaluate AdaDexGrasp in both simulation and real-world settings, showing that our approach significantly improves RL efficiency and enables learning human-like grasp strategies across varied object configurations. Finally, we demonstrate zero-shot transfer of our learned policies to a real-world PSYONIC Ability Hand, with a 90% success rate across objects, significantly outperforming the baseline.

Abstract:
Generating high-quality motion plans for multiple robot arms is challenging due to the high dimensionality of the system and the potential for inter-arm collisions. Traditional motion planning methods often produce motions that are suboptimal in terms of smoothness and execution time for multi-arm systems. Post-processing via shortcutting is a common approach to improve motion quality for efficient and smooth execution. However, in multi-arm scenarios, optimizing one arm’s motion must not introduce collisions with other arms. Although existing multi-arm planning works often use some form of shortcutting techniques, their exact methodology and impact on performance are often vaguely described. In this work, we present a comprehensive study quantitatively comparing existing shortcutting methods for multi-arm trajectories across diverse simulated scenarios. We carefully analyze the pros and cons of each shortcutting method and propose two simple strategies for combining these methods to achieve the best performance-runtime tradeoff. Video, code, and dataset are available at https://philip-huang.github.io/mr-shortcut/.

Abstract:
Diffusion models are increasingly employed in visuomotor policies to achieve promising performance of behavior cloning. However, the slow inference caused by iterative denoising is a notorious disadvantage, which greatly limits its application in resource-limited and real-time interactive robot systems. The prevailing strategy to this problem is distillation, but it still requires considerable resources to retrain a student model. To this end, we take another training-free view to develop a novel Fast Policy (termed FP), which can be regarded as a powerful and accelerated alternative to Diffusion Policy for learning visuomotor robot control. Specifically, our comprehensive study of UNet encoder shows that its features change little during inference, prompting us to reuse encoder features in non-critical denoising steps. In addition, we design strategies based on Fourier energy to screen critical and non-critical steps dynamically according to different tasks. Importantly, to mitigate performance degradation caused by the repeated use of non-critical steps, we further introduce a noise correction strategy. Our FP is evaluated on multiple simulation benchmarks and the comparison results with existing speed-up methods demonstrate our effectiveness and superiority with state-of-the-art success rates in visuomotor inference speed. The code is available at https://github.com/xwccchong/Fast-Policy

Abstract:
Reinforcement Learning (RL) offers a promising solution to enable evolutionary automated driving. However, conventional RL methods often struggle with risk performance, as updated policies may fail to enhance performance or even lead to deterioration. To address this challenge, this research introduces a High Confidence Policy Improvement Reinforcement Learning-based (HCPI-RL) planner, designed to achieve the monotonic evolution of automated driving. The HCPI-RL planner features a novel RL policy update paradigm, ensuring that each newly learned policy outperforms previous policies, achieving monotonic performance enhancement. Hence, the proposed HCPI-RL planner has the following features: i) Evolutionary automated driving with guaranteed monotonic performance enhancement; ii) Capability of handling scenarios with emergency; iii) Enhanced decision-making optimality. Experimental results demonstrate that the proposed HCPI-RL planner enhances policy return by at least 20.1% and driving efficiency by at least 15.6%, compared to the conventional RL-based planners.

Abstract:
A long-cherished vision of drones is to autonomously traverse through clutter to reach every corner of the world using onboard sensing and computation. In this paper, we combine onboard 3D lidar sensing and sim-to-real reinforcement learning (RL) to enable autonomous flight in cluttered environments. Compared to vision sensors, lidars appear to be more straightforward and accurate for geometric modeling of surroundings, which is one of the most important cues for successful obstacle avoidance. On the other hand, sim-to-real RL approach facilitates the realization of low-latency control, without the hierarchy of trajectory generation and tracking. We demonstrate that, with design choices of practical significance, we can effectively combine the advantages of 3D lidar sensing and RL to control a quadrotor through a low-level control interface at 50Hz. The key to successfully learn the policy in a lightweight way lies in a specialized surrogate of the lidar’s raw point clouds, which simplifies learning while retaining a fine-grained perception to detect narrow free space and thin obstacles. Simulation statistics demonstrate the advantages of the proposed system over alternatives, such as performing easier maneuvers and higher success rates at different speed constraints. With lightweight simulation techniques, the policy trained in the simulator can control a physical quadrotor, where the system can dodge thin obstacles and safely traverse randomly distributed obstacles.

Abstract:
We consider the problem of optimizing neural implicit surfaces for 3D reconstruction using acoustic images collected with drifting sensor poses. The accuracy of current state-of-the-art 3D acoustic modeling algorithms is highly dependent on accurate pose estimation; small errors in sensor pose can lead to severe reconstruction artifacts. In this paper, we propose an algorithm that jointly optimizes the neural scene representation and sonar poses. Our algorithm does so by parameterizing the 6DoF poses as learnable parameters and backpropagating gradients through the neural renderer and implicit representation. We validated our algorithm on both real and simulated datasets. It produces high-fidelity 3D reconstructions even under significant pose drift.

Abstract:
We present Look-Back and Look-Ahead Adaptive Model Predictive Control (LLA-MPC), a real-time adaptive control framework for autonomous racing that addresses the challenge of rapidly changing tire-surface interactions. Unlike existing approaches requiring substantial data collection or offline training, LLA-MPC employs parallelization over a bank of models for rapid adaptation with no training. It integrates two key mechanisms: a look-back window that uses recent vehicle behavior to optimize the model used in a look-ahead stage for trajectory optimization and control. The optimized model and its associated parameters are then incorporated into an adaptive path planner to optimize reference racing paths in real time. Experiments across diverse racing scenarios demonstrate that LLA-MPC outperforms state-of-the-art methods in adaptation speed and handling, even during sudden friction transitions. Its learning-free, computationally efficient design enables rapid adaptation, making it ideal for high-speed autonomous racing in multi-surface environments.

Abstract:
Depth estimation has been widely studied and serves as the fundamental step of 3D perception for robotics and autonomous driving. Though significant progress has been made in monocular depth estimation in the past decades, these attempts are mainly conducted on the KITTI benchmark with only front-view cameras, which ignores the correlations across surround-view cameras. In this paper, we propose an Adjacent-View Transformer for Supervised Surround-view Depth estimation (AVT-SSDepth), to jointly predict the depth maps across multiple surrounding cameras. Specifically, we employ a global-to-local feature extraction module that combines CNN with transformer layers for enriched representations. Further, the adjacent-view attention mechanism is proposed to enable the intra-view and inter-view feature propagation. The former is achieved by the self-attention module within each view, while the latter is realized by the adjacent attention module, which computes the attention across multi-cameras to exchange the multi-scale representations across surround-view feature maps. In addition, AVT-SSDepth has strong cross-dataset generalization. Extensive experiments show that our method achieves superior performance over existing state-of-the-art methods on both DDAD and nuScenes datasets. Code is available at https://github.com/XiandaGuo/SSDepth.

Abstract:
Real-time open-vocabulary scene understanding is essential for efficient 3D perception in applications such as vision-language navigation, embodied intelligence, and augmented reality. However, existing methods suffer from imprecise instance segmentation, static semantic updates, and limited handling of complex queries. To address these issues, we present OpenFusion++, a TSDF-based real-time 3D semantic-geometric reconstruction system. Our approach refines 3D point clouds by fusing confidence maps from foundational models, dynamically updates global semantic labels via an adaptive cache based on instance area, and employs a dual-path encoding framework that integrates object attributes with environmental context for precise query responses. Experiments on the ICL, Replica, ScanNet, and ScanNet++ datasets demonstrate that OpenFusion++ significantly outperforms the baseline in both semantic accuracy and query responsiveness.

Abstract:
Simultaneous localization and mapping (SLAM) technology has recently achieved photorealistic mapping capabilities thanks to the real-time, high-fidelity rendering enabled by 3D Gaussian Splatting (3DGS). However, due to the static representation of scenes, current 3DGS-based SLAM encounters issues with pose drift and failure to reconstruct accurate maps in dynamic environments. To address this problem, we present D4DGS-SLAM, the first SLAM method based on 4DGS map representation for dynamic environments. By incorporating the temporal dimension into scene representation, D4DGS-SLAM enables high-quality reconstruction of dynamic scenes. Utilizing the dynamics-aware InfoModule, we can obtain the dynamics, visibility, and reliability of scene points, and filter out unstable dynamic points for tracking accordingly. When optimizing Gaussian points, we apply different isotropic regularization terms to Gaussians with varying dynamic characteristics. Experimental results on real-world dynamic scene datasets demonstrate that our method outperforms state-of-the-art approaches in both camera pose tracking and map quality.

Abstract:
Light Detection and Ranging (LiDAR) sensors have become the sensor of choice for many robotic state estimation tasks. Because of this, in recent years there has been significant work done to find the most accurate method to perform state estimation using these sensors. In each of these prior works, an explosion of possible technique combinations has occurred, with each work comparing LiDAR Odometry (LO) "pipelines" to prior "pipelines". Unfortunately, little work up to this point has performed the significant amount of ablation studies comparing the various building-blocks of a LO pipeline. In this work, we summarize the various techniques that go into defining a LO pipeline and empirically evaluate these LO components on an expansive number of datasets across environments, LiDAR types, and vehicle motions. Finally, we make empirically-backed recommendations for the design of future LO pipelines to provide the most accurate and reliable performance.

Abstract:
This paper introduces PlugAndFilter, a framework designed to enhance the performance of multi-modal image registration, particularly for real-time video registration tasks. The improvements provided by PlugAndFilter include not only better registration quality for individual image pairs but also the transformation of the image registration methods into a more robust video registration system. These enhancements are made possible by three proposed contributions: Spatial and Temporal outlier detection, along with Confidence-based keypoint accumulation. PlugAndFilter is compatible with a wide range of thermal-visible registration models, and any registration method capable of producing keypoint matches can be integrated. The proposed implementation is optimized for real-time video registration on edge devices, with key design decisions highlighted to support this goal.

Abstract:
Tele-ultrasound has the potential greatly to improve health equity for countless remote communities. However, practical scenarios involve potentially large time delays which cause current implementations of telerobotic ultrasound (US) to fail. Using a local model of the remote environment to provide haptics to the expert operator can decrease teleoperation instability, but the delayed visual feedback remains problematic. This paper introduces a robotic tele-US system in which the local model is not only haptic, but also visual, by re-slicing and rendering a pre-acquired US sweep in real time to provide the operator a preview of what the delayed image will resemble. A prototype system is presented and tested with 15 volunteer operators. It is found that visual-haptic model-mediated teleoperation (MMT) compensates completely for time delays up to 1000 ms round trip in terms of operator effort and completion time while conventional MMT does not. Visual-haptic MMT also significantly outperforms MMT for longer time delays in terms of motion accuracy and force control. This proof-of-concept study suggests that visual-haptic MMT may facilitate remote robotic tele-US.

Abstract:
Traditional magnetic tactile sensors are highly susceptible to external magnetic field interference, limiting their reliability in practical applications. To address this challenge, we propose a dual-modal soft magnetic skin capable of simultaneously acquiring magnetic and force tactile information across spatiotemporal domains, inspired by the sensory mechanisms of human skin. The system integrates a Convolutional Neural Network-Convolutional Neural Network-Multilayer Perceptron (CNN-CNN-MLP) architecture to fuse these dual-modal signals effectively. Furthermore, we introduce a novel Dynamic Weighting Coefficient Layer (DWCL) to dynamically optimize fusion weights for each modality based on real-time input characteristics, thereby enhancing robustness against magnetic interference. The DWCL leverages temporal discrepancies between modalities during pre-contact sensing and quantifies the magnetic field strength of target objects to autonomously adjust fusion ratios, prioritizing the more reliable modality under varying interference conditions. Extensive experimental evaluations demonstrate that the proposed DWCL significantly improves interference resistance compared to conventional fusion methods, advancing the feasibility of magnetic tactile sensing in real-world environments.

Abstract:
With the rapid development of AI chips, underwater snake robots hold significant promise for navigating complex underwater environments, offering unique advantages in exploration, monitoring, and inspection tasks due to their flexible body and high mobility. However, existing underwater snake robots predominantly employ bulky mechanical configurations with expensive manufacturing costs, resulting in excessive power consumption and limited operational endurance with standard batteries, which impede their widespread adoption and limit their operational flexibility. Moreover, most path following methods used in underwater snake robots inadequately account for the dynamic changes in path curvature, leading to serious tracking error in scenarios involving sharp turns or complex environments, which does not address the demands of more intricate trajectories. To address above issues, this work introduces the U-Snake, a small-sized smart underwater snake robot with a simple lightweight structure and a highly maneuverable controller, adapted to various complicated path following tasks. In particular, each joint of U-Snake is designed to be small and lightweight to achieve higher spatial utilization, which is covered by a convenient and efficient 3D printed waterproof casing to achieve robust water resistance. In addition, a path following method based on curvature of the path is designed to achieve high performance in various complicated trajectories. Furthermore, an integrated controller combines the method with the kinematics and dynamics models of U-Snake, enabling precise following of straight and curved paths. The experimental results demonstrate that the proposed control structure effectively guides U-Snake to follow the desired path.

Abstract:
Legged robots with closed-loop kinematic chains are increasingly prevalent due to their increased mobility and efficiency. Yet, most motion generation methods rely on serial-chain approximations, sidestepping their specific constraints and dynamics. This leads to suboptimal motions and limits the adaptability of these methods to diverse kinematic structures. We propose a comprehensive motion generation method that explicitly incorporates closed-loop kinematics and their associated constraints in an optimal control problem (OCP), integrating kinematic closure conditions and their analytical derivatives. This allows the solver to leverage the non-linear transmission effects inherent to closed-chain mechanisms, reducing peak actuator efforts and expanding their effective operating range. Unlike previous methods, our framework does not require serial approximations, enabling more accurate and efficient motion strategies. We also are able to generate the motion of more complex robots for which an approximate serial chain does not exist. We validate our approach through simulations and experiments, demonstrating superior performance in complex tasks such as rapid locomotion and stair negotiation. This method enhances the capabilities of current closed-loop robots and broadens the design space for future kinematic architectures.

Abstract:
We study a pursuit-evasion game between two players with car-like dynamics and sensing limitations by formalizing it as a partially observable stochastic zero-sum game. The partial observability caused by the sensing constraints is particularly challenging. As an example, in a situation where the agents have no visibility of each other, they would need to extract information from their sensor coverage history to reason about potential locations of their opponents. However, keeping historical information greatly increases the size of the state space. To mitigate the challenges encountered with such partially observable problems, we develop a new learning-based method that encodes historical information to a belief state and uses it to generate agent actions. Through experiments we show that the learned strategies improve over existing multi-agent RL baselines by up to 16% in terms of capture rate for the pursuer. Additionally, we present experimental results showing that learned belief states are strong state estimators for extending existing game theory solvers and demonstrate our method’s competitiveness for problems where existing fully observable game theory solvers are computationally feasible. Finally, we deploy the learned policies on physical robots for a game between the F1TENTH and JetRacer platforms moving as fast as 2 m/s in indoor environments, showing that they can be executed on real-robots.

Abstract:
To address the intricate challenges of cooperative scheduling and motion planning in Autonomous Mobility-on-Demand (AMoD) systems, this paper introduces LMMCoDrive, a novel cooperative driving framework that leverages a Large Multimodal Model (LMM) to improve traffic efficiency and passenger experience in dynamic urban environments. This framework seamlessly integrates scheduling and motion planning processes to ensure the effective operation of Cooperative Autonomous Vehicles (CAVs). The spatial relationship between CAVs and passenger requests is abstracted into a Bird’s-Eye View (BEV) image to fully exploit the potential of the multimodal understanding ability of LMMs. Besides, trajectories are cautiously refined for each CAV while ensuring collision avoidance through safety constraints. A decentralized optimization strategy, facilitated by the Alternating Direction Method of Multipliers (ADMM) within the LMM framework, is proposed to drive the graph evolution of CAVs. Simulation results in diverse urban scenarios demonstrate the pivotal role and significant impact of LMM in optimizing CAV scheduling and seamlessly serving a decentralized cooperative optimization process for each CAV. This marks a substantial stride towards practical, efficient, and safe AMoD systems that are poised to revolutionize urban transportation. The code is available at https://github.com/henryhcliu/LMMCoDrive.

Abstract:
In-context imitation learning (ICIL) is a new paradigm that enables robots to generalize from demonstrations to unseen tasks without retraining. A well-structured action representation is the key to capturing demonstration information effectively, yet action tokenizer (the process of discretizing and encoding actions) remains largely unexplored in ICIL. In this work, we first systematically evaluate existing action tokenizer methods in ICIL and reveal a critical limitation: while they effectively encode action trajectories, they fail to preserve temporal smoothness, which is crucial for stable robotic execution. To address this, we propose LipVQ-VAE, a variational autoencoder that enforces the Lipschitz condition in the latent action space via weight normalization. By propagating smoothness constraints from raw action inputs to a quantized latent codebook, LipVQ-VAE generates smoother actions. When integrating into ICIL, LipVQ-VAE improves performance by more than 5.3% in high-fidelity simulators, with real-world experiments confirming its ability to produce smoother, more reliable trajectories. Code and checkpoints are available at https://action-tokenizer-matters.github.io/.

Abstract:
The ability to perform generalizable and precise grasping on functional object parts is a prerequisite for robotic manipulation in open environments. Recent foundation models have demonstrated promising semantic correspondence capabilities in guiding robots to grasp similar parts across objects with resembling shapes and poses. However, existing works struggle to generalize precise grasp poses when the target objects exhibit substantial geometric and positional variations. To tackle this challenge, we present PartGrasp, a method that achieves precise part grasping through hierarchical integration of highly generalizable semantic correspondence and precise geometric registration. Specifically, we first build a grasp knowledge bank by extracting grasp poses and object meshes from demonstrations. Upon retrieving a reference from this bank, we initially perform a coarse alignment using semantic correspondence, followed by a fine registration that adapts to geometric variations. This approach achieves fine-grained generalization of part grasping that is robust to both shape and pose variations. Extensive experiments demonstrate the efficacy of our method in terms of both generalization capability and accuracy. Videos and more details are available on our project site: https://part-grasp.github.io/partgrasp/.

Abstract:
In Multi-Agent Pickup and Delivery (MAPD), a team of agents must find collision-free paths to service an online stream of tasks, which are composed of pickup and delivery locations that have to be visited sequentially. This paper addresses the novel problem of MAPD with mobile pickups, which involves two types of agents, the suppliers and the deliverers. Suppliers are large robots that can transport many items, but cannot navigate tight spaces or manipulate objects, while deliverers can navigate to rooms to deliver items, but can only carry one item at a time. Deliverers have to collect items from the suppliers, and bring them to the assigned delivery locations. This introduces a new challenge which is not tackled in classical MAPD: deciding where and when the exchange of items should happen. We propose Token Passing with Exchange Locations (TP-EL), an extension of the widely used Token Passing (TP) algorithm with a task allocation mechanism that considers which supplier to pick items from, and when and where to do so. We experiment in several simulated domains, demonstrating the superiority of TP-EL over baselines that do not consider mobile pickups or use alternative methods to decide pickup locations.

Abstract:
Visual-language models (VLMs) have recently been introduced in robotic mapping using the latent representations, i.e., embeddings, of the VLMs to represent semantics in the map. They allow moving from a limited set of human-created labels toward open-vocabulary scene understanding, which is very useful for robots when operating in complex real-world environments and interacting with humans. While there is anecdotal evidence that maps built this way support downstream tasks, such as navigation, rigorous analysis of the quality of the maps using these embeddings is missing. In this paper, we propose a way to analyze the quality of maps created using VLMs. We investigate two critical properties of map quality: queryability and distinctness. The evaluation of queryability addresses the ability to retrieve information from the embeddings. We investigate intra-map distinctness to study the ability of the embeddings to represent abstract semantic classes and inter-map distinctness to evaluate the generalization properties of the representation. We propose metrics to evaluate these properties and evaluate two state-of-the-art mapping methods, VLMaps and OpenScene, using two encoders, LSeg and OpenSeg, using real-world data from the Matterport3D data set. Our findings show that while 3D features improve queryability, they are not scale invariant, whereas image-based embeddings generalize to multiple map resolutions. This allows the image-based methods to maintain smaller map sizes, which can be crucial for using these methods in real-world deployments. Furthermore, we show that the choice of the encoder has an effect on the results. The results imply that properly thresholding open-vocabulary queries is an open problem.

Abstract:
Reliable human-robot communication is essential for effective underwater human-robot interaction (U-HRI), yet traditional methods such as acoustic signaling and predefined gesture-based models suffer from limitations in adaptability and robustness. In this work, we propose One-Shot Gesture Recognition (OSG), a novel method that enables real-time, pose-based, temporal gesture recognition underwater from a single demonstration, eliminating the need for extensive dataset collection or model retraining. OSG leverages shape-based classification techniques, including Hu moments, Zernike moments, and Fourier descriptors, to robustly recognize gestures in visually-challenging underwater environments. Our system achieves high accuracy on real-world underwater video data and operates efficiently on embedded hardware commonly found on autonomous underwater robots (AUVs), demonstrating its feasibility for deployment on-board robots. Compared to deep learning approaches, OSG is lightweight, computationally efficient, and highly adaptable, making it ideal for diver-to-robot communication. We evaluate OSG’s performance on an augmented gesture dataset and real-world underwater video data, comparing its accuracy against deep learning methods. Our results show OSG’s potential to enhance U-HRI by enabling the immediate deployment of user-defined gestures without the constraints of predefined gesture languages.

Abstract:
This paper investigates the importance and design implications for use of rectangular blocks in collective robotic construction systems with distributed control. Specifically, we introduce an automated solver for optimizing the overlaps in user-specified structures; a new robot design capable of manipulating, fastening, and climbing over blocks as wide as the robot; detailed analysis of robot primitives and demonstration of rectilinear, curved, cantilever, and corbeled arch structures; and results from a physics simulator showing how overlaps improve structural integrity when the depositions are noisy. This work represents an important step towards efficient and versatile large-scale robotic construction.

Abstract:
Ultra-wideband (UWB)-vision fusion localization has achieved extensive applications in the domain of multiagent relative localization. The challenging matching problem between robots and visual detection renders existing methods highly dependent on identity-encoded hardware or delicate tuning algorithms. Overconfident yet erroneous matches may bring about irreversible damage to the localization system. To address this issue, we introduce Mr. Virgil, an end-to-end learning multi-robot visual-range relative localization framework, consisting of a graph neural network for data association between UWB rangings and visual detections, and a differentiable pose graph optimization (PGO) back-end. The graph-based front-end supplies robust matching results, accurate initial position predictions, and credible uncertainty estimates, which are subsequently integrated into the PGO back-end to elevate the accuracy of the final pose estimation. Additionally, a decentralized system is implemented for real-world applications. Experiments spanning varying robot numbers, simulation and real-world, occlusion and non-occlusion conditions showcase the stability and exactitude under various scenes compared to conventional methods. Our code is available at: https://github.com/HiOnes/Mr-Virgil.

Abstract:
Locating human-object interaction (HOI) actions within video serves as the foundation for multiple downstream tasks, such as human behavior analysis and human-robot skill transfer. Current temporal action localization methods typically rely on annotated action and object categories of interactions for optimization, which leads to domain bias and low deployment efficiency. Although some recent works have achieved zero-shot temporal action localization (ZS-TAL) with large vision-language models (VLMs), their coarse-grained estimations and open-loop pipelines hinder further performance improvements for temporal interaction localization (TIL). To address these issues, we propose a novel zero-shot TIL approach dubbed EgoLoc to locate the timings of grasp actions for human-object interaction in egocentric videos. EgoLoc introduces a self-adaptive sampling strategy to generate reasonable visual prompts for VLM reasoning. By absorbing both 2D and 3D observations, it directly samples high-quality initial guesses around the possible contact/separation timestamps of HOI according to 3D hand velocities, leading to high inference accuracy and efficiency. In addition, EgoLoc generates closed-loop feedback from visual and dynamic cues to further refine the localization results. Comprehensive experiments on the publicly available dataset and our newly proposed benchmark demonstrate that EgoLoc achieves better temporal interaction localization for egocentric videos compared to state-of-the-art baselines. We will release our code and relevant data as open-source at https://github.com/IRMVLab/EgoLoc.

Abstract:
This paper presents DNAct, a language-conditioned multi-task policy framework that integrates neural rendering pre-training and diffusion training to enforce multi-modality learning in action sequence spaces. To learn a generalizable multi-task policy with few demonstrations, the pre-training phase of DNAct leverages neural rendering to distill 2D semantic features from foundation models such as Stable Diffusion to a 3D space, which provides a comprehensive semantic understanding regarding the scene. Consequently, it allows various applications for challenging robotic tasks requiring rich 3D semantics and accurate geometry. Furthermore, we introduce a novel approach utilizing diffusion training to learn a vision and language feature that encapsulates the inherent multi-modality in the multi-task demonstrations. By reconstructing the action sequences from different tasks via the diffusion process, the model is capable of distinguishing different modalities and thus improving the robustness and the generalizability of the learned representation. DNAct significantly surpasses SOTA NeRF-based multi-task manipulation approaches with over 30% improvement in success rate. Videos are available on dnact.github.io

Abstract:
Efficient and accurate motion prediction is crucial for ensuring safety and informed decision-making in autonomous driving, particularly under dynamic real-world conditions that necessitate multi-modal forecasts. We introduce TrajFlow, a novel flow matching-based motion prediction framework that addresses the scalability and efficiency challenges of existing generative trajectory prediction methods. Unlike conventional generative approaches that employ i.i.d. sampling and require multiple inference passes to capture diverse outcomes, TrajFlow predicts multiple plausible future trajectories in a single pass, significantly reducing computational overhead while maintaining coherence across predictions. Moreover, we propose a ranking loss based on the Plackett-Luce distribution to improve uncertainty estimation of predicted trajectories. Additionally, we design a self-conditioning training technique that reuses the model’s own predictions to construct noisy inputs during a second forward pass, thereby improving generalization and accelerating inference. Extensive experiments on the large-scale Waymo Open Motion Dataset (WOMD) demonstrate that TrajFlow achieves state-of-the-art performance across various key metrics, underscoring its effectiveness for safety-critical autonomous driving applications. The code and other details are available on the project website https://traj-flow.github.io/.

Affiliations: School of Artificial Intelligence and Automation, Huazhong University of Science and Technology (HUST), Wuhan, China; School of Electronic Information and Communications, Huazhong University of Science and Technology (HUST), Wuhan, China; Department of Computer Science, The University of Hong Kong (HKU), Pokfulam, Hong Kong; School of Computer Science and Technology, Huazhong University of Science and Technology (HUST), Wuhan, China

Abstract:
Transformer-based methods have demonstrated remarkable capabilities in 3D semantic segmentation through their powerful attention mechanisms, but the quadratic complexity limits their modeling of long-range dependencies in large-scale point clouds. While recent Mamba-based approaches offer efficient processing with linear complexity, they struggle with feature representation when extracting 3D features. However, effectively combining these complementary strengths remains an open challenge in this field. In this paper, we propose HybridTM, the first hybrid architecture that integrates Transformer and Mamba for 3D semantic segmentation. In addition, we propose the Inner Layer Hybrid Strategy, which combines attention and Mamba at a finer granularity, enabling simultaneous capture of long-range dependencies and fine-grained local features. Extensive experiments demonstrate the effectiveness and generalization of our HybridTM on diverse indoor and outdoor datasets. Furthermore, our HybridTM achieves state-of-the-art performance on ScanNet, ScanNet200, and nuScenes benchmarks. The code will be made available at https://github.com/deepinact/HybridTM.

Abstract:
In this article, we address the problem of collaborative task assignment, sequencing, and multi-agent pathfinding (TSPF), where a team of agents must visit a set of task locations without collisions while minimizing flowtime. TSPF incorporates agent-task compatibility constraints and ensures that all tasks are completed. We propose a Conflict-Based Search with Task Sequencing (CBS-TS), an optimal and complete algorithm that alternates between finding new task sequences and resolving conflicts in the paths of current sequences. CBS-TS uses a mixed-integer linear program (MILP) to optimize task sequencing and employs Conflict-Based Search (CBS) with Multi-Label A (MLA) for collision-free path planning within a search forest. By invoking MILP for the next-best sequence only when needed, CBS-TS efficiently limits the search space, enhancing computational efficiency while maintaining optimality.We compare the performance of our CBS-TS against Conflict-based Steiner Search (CBSS), a baseline method that, with minor modifications, can address the TSPF problem. Experimental results demonstrate that CBS-TS outperforms CBSS in most testing scenarios, achieving higher success rates and consistently optimal solutions, whereas CBSS achieves near-optimal solutions in some cases. The supplementary video is available at https://youtu.be/QT8BYgvefmU.

Abstract:
Grasping large and flat objects (e.g., a book or a pan) is often regarded as an ungraspable task, which poses significant challenges due to the unreachable grasping poses. Prior research has exploited environmental interactions through Extrinsic Dexterity, utilizing external structures such as walls or table edges to facilitate object grasping. However, they are confined to task-specific policies while neglecting semantic perception and planning to identify optimal pre-grasp configurations. This limits their operational versatility, impeding effective adaptation to varied extrinsic dexterity constraints. In this work, we present ExDiff, a robot manipulation approach for extrinsic dexterity grasping in unrestricted environments. It utilizes Vision-Language Models (VLMs) to perceive the environmental state and generate instructions, followed by a Goal-Conditioned Action Diffusion (GCAD) model to predict the sequence of low-level actions. This diffusion model learns the low-level policy, conditioned on high-level instructions and cumulative rewards, which improves the generation of robot actions. Simulation experiments and real-world deployment results demonstrate that ExDiff effectively performs ungraspable tasks and generalizes to previously unseen target objects and scenes. Videos at - https://exdiff.github.io/index.html

Abstract:
Electrosurgery is a surgical technique that can improve tissue cutting by reducing cutting force and bleeding. However, electrosurgery adds a risk of thermal injury to surrounding tissue. Expert surgeons estimate desirable cutting velocities based on experience but have no quantifiable reference to indicate if a particular velocity is optimal. Furthermore, prior demonstrations of autonomous electrosurgery have primarily used constant tool velocity, which is not robust to changes in electrosurgical tissue characteristics, power settings, or tool type. Thermal imaging feedback provides information that can be used to reduce thermal injury while balancing cutting force by controlling tool velocity. We introduce Thermography for Electrosurgical Rate Modulation via Optimization (ThERMO) to autonomously reduce thermal injury while balancing cutting force by intelligently controlling tool velocity. We demonstrate ThERMO in tissue phantoms and compare its performance to the constant velocity approach. Overall, ThERMO improves cut success rate by a factor of three and can reduce peak cutting force by a factor of two. ThERMO responds to varying environmental disturbances, reduces damage to tissue, and completes cutting tasks that would otherwise result in catastrophic failure for the constant velocity approach.

Abstract:
We propose a framework to compute the optimal routes for multi-drones to minimize the delivery time in the last-mile delivery service. We mainly focus on a notion of the landing exclusion zone that appears during the landing phase; an area around the drop-off site is blocked until a drop-off is completed. Such zones affect the delivery time as other drones need to detour or hover around the site unnecessarily. We formulate the Mixed-Integer Linear Programming (MILP) problem by explicitly modeling the landing phase. Then, we present the heuristic algorithm that iteratively solves a sequence of single-drone delivery problems according to the delivery priorities. A delivery priority is determined according to the spatiotemporal occupancy that quantifies the significance of the size of the landing exclusion zone and its blocking period. We designed the experiment for 48 urban delivery scenarios with varying density and distribution of delivery destinations, departure points, and order quantities. Our experiment results show that the heuristic computes the routes significantly faster than the original MILP, and the delivery time is 5% higher from the optimal solution (lower-bound), and 60% lower from the general requirement of a single package per round-trip (upper-bound).

Abstract:
Recent robotic task planning frameworks have integrated large multimodal models (LMMs) such as GPT-4o. To address grounding issues of such models, it has been suggested to split the pipeline into perceptional state grounding and subsequent state-based planning. As we show in this work, the state grounding ability of LMM-based approaches is still limited by weaknesses in granular, structured, domain-specific scene understanding. To address this shortcoming, we develop a more structured state grounding framework that features a domain-conditioned scene graph as its scene representation. We show that such representation is actionable in nature as it is directly mappable to a symbolic state in planning languages such as the Planning Domain Definition Language (PDDL). We provide an instantiation of our state grounding framework where the domain-conditioned scene graph generation is implemented with a lightweight vision-language approach that classifies domain-specific predicates on top of domain-relevant object detections. Evaluated across three domains, our approach achieves significantly higher state grounding accuracy and task planning success rates compared to LMM-based approaches. https://github.com/Vision-Kek/DC-SGG

Abstract:
Open-vocabulary scene understanding is critical for robotics, yet existing 3D Gaussian Splatting (3DGS) methods rely on compressed feature embeddings, compromising semantic fidelity and fine-grained interpretation. Although utilizing uncompressed high-dimensional features offers a potential solution, their direct integration imposes prohibitive memory and computational costs. To address this challenge, we propose OpenMIGS, a novel 3DGS-based framework for multi-granularity, information-preserving open-vocabulary understanding across both object and part levels. Specifically, OpenMIGS first constructs object-level Gaussian fields as structured carriers where a two-stage clustering strategy ensures global consistency in object labeling, and a code-book subsequently associates these object label with their uncompressed high-dimensional features. Building on this, a lightweight implicit field processes the geometric coordinates of object Gaussians to regress part-level high-dimensional features, enabling multi-granularity understanding. Experimental results on multiple datasets show that OpenMIGS outperforms existing methods in open-vocabulary understanding and retrieval tasks. It also supports multi-granularity scene editing for flexible semantic manipulation. The code is available at https://github.com/jingyuzhao1010/OpenMIGS.

Abstract:
In this paper, we unleash the potential of the powerful monodepth model in camera-LiDAR calibration and propose CLAIM, a novel method of aligning data from the camera and LiDAR. Given the initial guess and pairs of images and LiDAR point clouds, CLAIM utilizes a coarse-to-fine searching method to find the optimal transformation minimizing a patched Pearson correlation-based structure loss and a mutual information-based texture loss. These two losses serve as good metrics for camera-LiDAR alignment results and require no complicated steps of data processing, feature extraction, or feature matching like most methods, rendering our method simple and adaptive to most scenes. We validate CLAIM on public KITTI, Waymo, and MIAS-LCEC datasets, and the experimental results demonstrate its superior performance compared with the state-of-the-art methods. The code is available at https://github.com/Tompson11/claim.

Abstract:
Accurate vehicle trajectory prediction is critical for safe and efficient autonomous driving, especially in mixed traffic environments when both human-driven and autonomous vehicles co-exist. However, uncertainties introduced by inherent driving behaviors—such as acceleration, deceleration, and left and right maneuvers—pose significant challenges for reliable trajectory prediction. We introduce a Maneuver-Intention-Aware Transformer (MIAT) architecture, which integrates a maneuver intention awareness control mechanism with spatiotemporal interaction modeling to enhance long-horizon trajectory predictions. We systematically investigate the impact of varying awareness of maneuver intention on both short-and long-horizon trajectory predictions. Evaluated on the real-world NGSIM dataset and benchmarked against various transformer- and LSTM-based methods, our approach achieves an improvement of up to 4.7% in short-horizon predictions and a 1.6% in long-horizon predictions compared to other intention-aware benchmark methods. Moreover, by leveraging intention awareness control mechanism, MIAT realizes an 11.1% performance boost in long-horizon predictions, with a modest drop in short-horizon performance. The source code and datasets are available at https://github.com/cpraskoti/MIAT.

Abstract:
Road curbs are considered as one of the crucial and ubiquitous traffic features, which are essential for ensuring the safety of autonomous vehicles. Current methods for detecting curbs primarily rely on camera imagery or LiDAR point clouds. Image-based methods are vulnerable to fluctuations in lighting conditions and exhibit poor robustness, while methods based on point clouds circumvent the issues associated with lighting variations. However, it is the typical case that significant processing delays are encountered due to the voluminous amount of 3D points contained in each frame of the point cloud data. Furthermore, the inherently unstructured characteristics of point clouds poses challenges for integrating the latest deep learning advancements into point cloud data applications. To address these issues, this work proposes an annotation-free curb detection method leveraging Altitude Difference Image (ADI) (as shown in Fig. 1), which effectively mitigates the aforementioned challenges. Given that methods based on deep learning generally demand extensive, manually annotated datasets, which are both expensive and labor-intensive to create, we present an Automatic Curb Annotator (ACA) module. This module utilizes a deterministic curb detection algorithm to automatically generate a vast quantity of training data. Consequently, it facilitates the training of the curb detection model without necessitating any manual annotation of data. Finally, by incorporating a post-processing module, we manage to achieve state-of-the-art results on the KITTI 3D curb dataset [1] with considerably reduced processing delays compared to existing methods, which underscores the effectiveness of our approach in curb detection tasks. Our code and data will be open-sourced at: https://sites.google.com/view/adi-curb-detection.

Abstract:
The mammalian hippocampal formation plays a critical role in efficient and flexible navigation. Hippocampal place cells exhibit spatial tuning, characterized by increased firing rates when an animal occupies specific locations in its environment. The mechanisms underlying the encoding of spatial information by hippocampal place cells remain not fully resolved. Evidence suggests that spatial preferences are shaped by multimodal sensory inputs. Yet, existing hippocampal-inspired models typically rely on a single sensory information source. Here, we developed a hippocampus-inspired model that combines motivational and spatial encoding and is based on the fundamental principle of biological autonomy that behavior serves a purpose. That is, in foraging tasks, an agent’s trajectories must be deployed considering the fact that the reward value of environmental stimuli is tied to the agent’s motivational state. In this paper, we introduce a "motivational hippocampal autoencoder" (MoHA) that integrates both interoceptive (motivational) and exteroceptive (visual) information. The MoHA model reproduces hippocampal firing correlates for different motivational states. We show that the representations of MoHA allow a synthetic agent to learn and deploy efficient trajectories in a foraging task, laying the foundation for self-regulated multipurpose reinforcement learning.

Abstract:
Accurate ego-motion estimation is a critical component of any autonomous system. Conventional ego-motion sensors, such as cameras and LiDARs, may be compromised in adverse environmental conditions, such as fog, heavy rain, or dust. Automotive radars, known for their robustness to such conditions, present themselves as complementary sensors or a promising alternative within the ego-motion estimation frameworks. In this paper we propose a novel Radar-Inertial Odometry (RIO) system that integrates an automotive radar and an inertial measurement unit. The key contribution is the integration of online temporal delay calibration within the factor graph optimization framework that compensates for potential time offsets between radar and IMU measurements. To validate the proposed approach we have conducted thorough experimental analysis on real-world radar and IMU data. The results show that, even without scan matching or target tracking, integration of online temporal calibration significantly reduces localization error compared to systems that disregard time synchronization, thus highlighting the important role of, often neglected, accurate temporal alignment in radar-based sensor fusion systems for autonomous navigation. Project website: https://rio-online-t.github.io/.

Abstract:
In recent years, vision-language models (VLMs) have advanced open-vocabulary mapping, enabling mobile robots to simultaneously achieve environmental reconstruction and high-level semantic understanding. While integrated object cognition helps mitigate semantic ambiguity in point-wise feature maps, efficiently obtaining rich semantic understanding and robust incremental reconstruction at the instance-level remains challenging. To address these challenges, we introduce OpenVox, a real-time incremental open-vocabulary probabilistic instance voxel representation. In the front-end, we design an efficient instance segmentation and comprehension pipeline that enhances language reasoning through encoding captions. In the back-end, we implement probabilistic instance voxels and formulate the cross-frame incremental fusion process into two subtasks: instance association and live map evolution, ensuring robustness to sensor and segmentation noise. Extensive evaluations across multiple datasets demonstrate that OpenVox achieves state-of-the-art performance in zero-shot instance segmentation, semantic segmentation, and open-vocabulary retrieval. The project page of OpenVox is available at https://open-vox.github.io/.

Abstract:
With the rapid integration of robotics across diverse sectors, human interaction with these technologies is becoming inevitable. Ensuring safety is increasingly crucial to prevent injuries and maintain effective interactions. Accurate force estimation enables robots to sense contact forces and respond appropriately. This paper presents a vision-based estimation method for multi-contact physical human-robot interaction. Utilizing an RGB-D sensor, it detects 3D hand positions to identify contact points and employs a generalized momentum observer to distinguish joint torques from external wrenches. A long short-term memory network compensates for uncertainties arising from unmodelled dynamics. Addressing challenges like wrench null space and Jacobian singularities, the approach identifies computable external wrench components. The method achieves a 0.9 N estimation error in complex, multi-contact interactions, enhancing safety and responsiveness. Key contributions include a novel wrench identification method leveraging robot configuration and contact points, derived from a vision-based system, to enhance real-time estimation.

Abstract:
Many existing visual SLAM methods can achieve high localization accuracy in dynamic environments by leveraging deep learning to mask moving objects. However, these methods incur significant computational overhead as the camera tracking needs to wait for the deep neural network to generate mask at each frame, and they typically require GPUs for realtime operation, which restricts their practicality in real-world robotic applications. Therefore, this paper proposes a real-time dynamic SLAM system that runs exclusively on a CPU. Our approach incorporates a mask propagation mechanism that decouples camera tracking and deep learning-based masking for each frame. We also introduce a hybrid tracking strategy that integrates ORB features with optical flow methods, enhancing both robustness and efficiency by selectively allocating computational resources to input frames. Compared to previous methods, our system maintains high localization accuracy in dynamic environments while achieving a tracking frame rate of 60 FPS on a laptop CPU. These results demonstrate the feasibility of utilizing deep learning for dynamic SLAM without GPU support. Since most existing dynamic SLAM systems are not open-source, we make our code publicly available at: https://github.com/yuhaozhang7/NGD-SLAM

Abstract:
We investigate the sampling-based optimal path planning problem for robotics in complex and dynamic environments. Most existing sampling-based algorithms neglect environmental information or the information from previous samples. Yet, these pieces of information are highly informative, as leveraging them can provide better heuristics when sampling the next state. In this paper, we propose a novel sampling-based planning algorithm, called RRTformer, which integrates the standard RRT algorithm with a Transformer network in a novel way. Specifically, the Transformer is used to extract features from the environment and leverage information from previous samples to better guide the sampling process. Our extensive experiments demonstrate that, compared to existing sampling-based approaches such as RRT, Neural RRT, and their variants, our algorithm achieves considerable improvements in both the optimality of the path and sampling efficiency. The code for our implementation is available on https://github.com/fengmingyang666/RRTformer.

Abstract:
Task-oriented grasping (TOG) involves grasping specific parts of an object based on a task instruction. Existing methods generally integrate semantic analysis with grasp detection, resulting in low efficiency and poor generalization when dealing with new tasks and hardware. To address these problems, we propose SemSegGrasp, which reformulates TOG as a semantic segmentation problem based on point cloud and text matching. By decomposing TOG into semantic segmentation and grasp detection, SemSegGrasp can significantly enhance both the efficiency and generalization performance of TOG. Moreover, it can be combined with any off-the-shelf grasp detection algorithms in a plug-and-play manner. For semantic segmentation, SemSegGrasp first utilizes a Vision-Language Model (VLM) to generate local geometric descriptions of the target object. These descriptions are then fed into a Large Language Model (LLM) along with user instructions to obtain operational guidance. Subsequently, we separately encode the input point cloud and the operational guidance and obtain their features. Leveraging multi-head cross-attention, we conduct a matching process between these two types of features to predict the probability of each point serving as a TOG grasp point, i.e. semantic segmentation. Finally, the grasp pose is determined by fusing the segmentation results with the candidate poses generated by an existing grasp detection algorithm. Experimental results on the publicly available TaskGrasp dataset and a real-world setting show that our SemSegGrasp method achieves state-of-the-art performance, outperforming existing methods by at least 5% and 10% on new tasks, respectively.

Abstract:
Hybrid aerial-underwater vehicles (HAUVs) are attracting significant interest for their unique capability to operate across both air and water. However, achieving a lightweight design coupled with efficient cross-domain performance remains a formidable challenge. This paper introduces Nezha-T, a novel, ultra-lightweight HAUV featuring a dual-stable floating state capability. This is achieved through an innovative center-of-gravity (CoG) arrangement method, which enables seamless transitions between upright and horizontal floating postures. This ability to stably floating in either orientation is crucial for stable water entry and exit, mitigating impact forces on the vehicle and its payload. Furthermore, to counteract the residual buoyancy inherent in this design, a zero-lift pitch angle is incorporated into the control system, improving depth-keeping and pitch control performance. The proposed design were rigorously validated through computational fluid dynamics (CFD) simulations, pool experiments, and open-water field tests. The results confirm the feasibility of the dual-stable floating states, demonstrate stable cross-domain traversal, verify the effectiveness of the depth-keeping control system, and validate the vehicle’s fixed-wing flight capability.

Abstract:
This study proposes a novel solution for transferring the working environment of remotely operated vehicle (ROV) operators from the support ship at sea to land, including the establishment of a satellite-based communication link between the ocean and the land-based control center (LCC), which is used to transfer information efficiently. To alleviate the cognitive burden of latency on operators on land, a cross-domain underwater intervention hierarchical control architecture is designed to assist operators by introducing a shared control strategy. The effectiveness of the designed system and control strategy in realizing cross-domain underwater interventions are verified through field experiments.

Abstract:
Soft robots and smart materials have seen rapid advancements in recent years, with significant potential applications in medical devices. Liquid crystal elastomers (LCEs) exhibit unique attributes of large deformations and diverse actuation modes, facilitating controllable bending in soft medical catheters and thereby enhancing their maneuverability during medical procedures. However, LCEs exhibit strong hysteresis, which makes their modeling and control challenging. In this paper, we develop a dynamic model of a light-stimulated LCE to describe its nonlinear time-dependent behavior. We first derive the relationship between the input laser power and the resulting temperature change of the LCE actuator, and then analyze the viscoelastic behavior by taking advantage of a spring-dashpot frame. For both the linear contraction actuator and the bending actuator, the dynamic equations can describe their behavior with acceptable errors. In the future, we will further test the LCE-based bending actuator of optimal design, and then perform real-time control of soft catheters with assistance of LCE actuators.

Abstract:
Accurate trajectory prediction is crucial for autonomous driving, yet uncertainty in agent behavior and perception noise makes it inherently challenging. While multi-modal trajectory prediction models generate multiple plausible future paths with associated probabilities, effectively quantifying uncertainty remains an open problem. In this work, we propose a novel multi-modal trajectory prediction approach based on evidential deep learning that estimates both positional and mode probability uncertainty in real time. Our approach leverages a Normal Inverse Gamma distribution for positional uncertainty and a Dirichlet distribution for mode uncertainty. Unlike sampling-based methods, it infers both types of uncertainty in a single forward pass, significantly improving efficiency. Additionally, we experimented with uncertainty-driven importance sampling to improve training efficiency by prioritizing underrepresented high-uncertainty samples over redundant ones. We perform extensive evaluations of our method on the Argoverse 1 and Argoverse 2 datasets, demonstrating that it provides reliable uncertainty estimates while maintaining high trajectory prediction accuracy.

Abstract:
Performance evaluation is critical for ensuring the accuracy, efficiency, and reliability of industrial robotic arms. Traditional measurement methods, including contact-based techniques (e.g., coordinate measuring machines and ball-bar systems) and non-contact systems (e.g., laser trackers and optical coordinate measuring machines), offer high precision but are often costly, complex to install, and constrained by environmental factors. To address these limitations, this study proposes a SLAM-based performance evaluation method that leverages LiDAR to track robotic motion without requiring external calibration references. This approach provides a cost-effective and flexible alternative to conventional metrology techniques. However, integrating SLAM into the ISO 9283 framework presents challenges related to accuracy, stability, and measurement consistency. To assess its feasibility, this study evaluates the SLAM-based system by analyzing key performance parameters, ensuring its alignment with industrial requirements. The results demonstrate that the LiDAR-based SLAM system achieves an RMSE of 0.0353 mm in trajectory estimation, confirming its precision and stability. These findings validate the system’s capability as a reliable benchmarking tool for robotic arm performance assessment.

Abstract:
Piano-playing tasks, which effectively demonstrate bimanual coordination capabilities in humanoid robots, are increasingly becoming a research focus. However, prior research has predominantly focused on Cartesian space trajectory planning without adequately addressing real-world obstacle avoidance constraints and manipulator acceleration limits. This paper proposes a hierarchical trajectory planning framework that systematically incorporates both obstacle avoidance and acceleration constraints. Firstly, discrete Cartesian path points are generated using a dynamic programming approach; secondly, joint space path points are derived considering obstacle avoidance and joint limit constraints through dynamic programming; thirdly, the joint space trajectory is interpolated using a Jacobian inverse-based method; finally, the trajectory is refined using Model Predictive Control (MPC). Experimental results demonstrate that the proposed method produces trajectories satisfying both obstacle avoidance and acceleration constraints, enabling fluent piano piece execution in real-world environments.

Abstract:
Developing 3D semantic occupancy prediction models often relies on dense 3D annotations for supervised learning, a process that is both labor and resource-intensive, underscoring the need for label-efficient or even label-free approaches. To address this, we introduce MinkOcc, a multi-modal 3D semantic occupancy prediction framework for cameras and LiDARs that proposes a two-step semi-supervised training procedure. Here, a small dataset of explicitly 3D annotations warm-starts the training process; then, the supervision is continued by simpler-to-annotate accumulated LiDAR sweeps and images – semantically labelled through vision foundational models. MinkOcc effectively utilizes these sensor-rich supervisory cues and reduces reliance on manual labeling by 90% while maintaining competitive accuracy. In addition, the proposed model incorporates information from LiDAR and camera data through early fusion and leverages sparse convolution networks for real-time prediction. With its efficiency in both supervision and computation, we aim to extend MinkOcc beyond curated datasets, enabling broader real-world deployment of 3D semantic occupancy prediction in autonomous driving.

Abstract:
Fully autonomous vehicles promise enhanced safety and efficiency. However, ensuring reliable operation in challenging corner cases requires control algorithms capable of performing at the vehicle limits. We address this requirement by considering the task of autonomous racing and propose solving it by learning a racing policy using Reinforcement Learning (RL). Our approach leverages domain randomization, actuator dynamics modeling, and policy architecture design to enable reliable and safe zero-shot deployment on a real platform. Evaluated on the F1TENTH race car, our RL policy not only surpasses a state-of-the-art Model Predictive Control (MPC), but, to the best of our knowledge, also represents the first instance of an RL policy outperforming expert human drivers in RC racing. This work identifies the key factors driving this performance improvement, providing critical insights for the design of robust RL-based control strategies for autonomous vehicles.

Abstract:
Quadrotor Morpho-Transition, or the act of transitioning from air to ground through mid-air transformation, involves complex aerodynamic interactions and a need to operate near actuator saturation, complicating controller design. In recent work, morpho-transition has been studied from a model-based control perspective, but these approaches remain limited due to unmodeled dynamics and the requirement for planning through contacts. Here, we train an end-to-end Reinforcement Learning (RL) controller to learn a morpho-transition policy and demonstrate successful transfer to hardware. We find that the RL control policy achieves agile landing, but only transfers to hardware if motor dynamics and observation delays are taken into account. On the other hand, a baseline MPC controller transfers out-of-the-box without knowledge of the actuator dynamics and delays, at the cost of reduced recovery from disturbances in the event of unknown actuator failures. Our work opens the way for more robust control of agile in-flight quadrotor maneuvers that require mid-air transformation. Video; Code.

Abstract:
Industrial part reorientation remains a critical challenge in automated manufacturing workflows, particularly with parallel-jaw grippers lacking the dexterity for complex manipulations. This paper presents a systematic flipping strategy for structured environments. A quasi-static force equilibrium model is developed to characterize multi-contact manipulation systems, and stability criteria are derived through wrench space analysis, enabling A-based optimal trajectory generation within the derived stable configuration space. To ensure persistent fingertip-object contact, adaptive impedance control dynamically adjusts gripper stiffness based on real-time force thresholds, preventing unintended detachment. Experimental validation demonstrates robust performance in two representative scenarios:1) cube flipping on a compliant surface(84.3g, 90% success over 50 trials), 2) vision-free continuous pivoting of an irregular part on a rigid substrate(56g, 88% success over 50 trials). The methodology requires neither environmental modification nor expensive tactile sensing, showing promise for practical deployment in structured manufacturing systems.

Abstract:
Existing Vehicle-to-Everything (V2X) cooperative perception methods rely on accurate multi-agent 3D annotations. Nevertheless, it is time-consuming and expensive to collect and annotate real-world data, especially for V2X systems. In this paper, we present a self-supervised learning framwork for V2X cooperative perception, which utilizes the vast amount of unlabeled 3D V2X data to enhance the perception performance. Specifically, multi-agent sensing information is aggregated to form a holistic view and a novel proxy task is formulated to reconstruct the LiDAR point clouds across multiple connected agents to better reason multi-agent spatial correlations. Besides, we develop a V2X bird-eye-view (BEV) guided masking strategy which effectively allows the model to pay attention to 3D features across heterogeneous V2X agents (i.e., vehicles and infrastructure) in the BEV space. Noticeably, such a masking strategy effectively pretrains the 3D encoder with a multi-agent LiDAR point cloud reconstruction objective and is compatible with mainstream cooperative perception backbones. Our approach, validated through extensive experiments on representative datasets (i.e., V2X-Real, V2V4Real, and OPV2V) and multiple state-of-the-art cooperative perception methods (i.e., AttFuse, F-Cooper, and V2X-ViT), leads to a performance boost across all V2X settings. Notably, CooPre achieves a 4% mAP improvement on V2X-Real dataset and surpasses baseline performance using only 50% of the training data, highlighting its data efficiency. Additionally, we demonstrate the framework’s powerful performance in cross-domain transferability and robustness under challenging scenarios. The code will be made publicly available at https://github.com/ucla-mobility/CooPre.

Abstract:
We present MPDG-SLAM, a novel 3D Gaussian point cloud rendering SLAM method based on Motion Probability (MP) for dynamic interference handling. Current 3DGSSLAM approaches for dynamic environments often rely on optical flow estimation masks. However, these deep learning-based optical flow models are computationally intensive and limited by processing speed, posing challenges for deployment on mobile devices in real-world scenarios. Moreover, existing systems depend on precise mask segmentation and corresponding loss functions for artifact removal, yet the pixel accuracy of optical flow estimation is constrained by real-world lighting conditions. To address these issues, we introduce a mobile-deployable Yolo and a mathematically derived Motion Probability (MP) attribute to label Gaussian points, which are then inversely mapped to the front-end feature tracking system to correct for dynamic object influences. By incorporating an MP-based penalty term, dynamic Gaussians corresponding to moving entities are explicitly removed to minimize their effect. Additionally, we design an edge warp loss based on MP estimation, enabling accurate artifact removal even with coarse segmentation masks. The experiments show that our approach notably improves the reconstruction quality of dynamic scenes, surpassing baseline methods and reaching speeds over 30 FPS on high-end GPUs, which suggests its potential for real-time use on mobile platforms after further optimization.

Abstract:
The primary objective of this paper is to introduce a new monocular depth estimation (MDE) model targeting under-represented environments using a novel dataset combining synthetic and real images. The proposed model is small and fast to allow use for UAS navigation and data collection in rainforest environments. Prior works on MDEs target outdoor environments while focusing on urban, ground-level viewpoints due to interest in self-driving or autonomous package delivery applications and data availability. However, under-represented environments, such as rainforests, can benefit from targeted, environment-specific MDEs because existing general MDEs cannot adapt to extreme environmental differences, leading to high error rates. Our model is trained using a distinct rainforest dataset that combines images generated using a synthetic dataset pipeline and depth images collected from aerial robot deployments in the Children’s Eternal Rainforest in Costa Rica. The proposed model will allow for improved rainforest navigation without using expensive LIDAR sensors and can improve the navigation of a UAS in rainforest environments by providing more accurate and useful measurements for object manipulation, such as leaf sampling. Our model outperforms MiDaS across the board and has over a 75% improvement, specifically in the relative error metrics, while maintaining a low runtime. The resulting model matches the performance of state-of-the-art monocular depth estimation models designed for common environments, i.e., urban and indoor environments, and outperforms them when used in a rainforest environment.

Abstract:
Recent advancements in 3D Gaussian Splatting (3DGS) have made a significant impact on rendering and reconstruction techniques. Current research predominantly focuses on improving rendering performance and reconstruction quality using high-performance desktop GPUs, largely overlooking applications for embedded platforms like micro air vehicles (MAVs). These devices, with their limited computational resources and memory, often face a trade-off between system performance and reconstruction quality. In this paper, we improve existing methods in terms of GPU memory usage while enhancing rendering quality. Specifically, to address redundant 3D Gaussian primitives in SLAM, we propose merging them in voxel space based on geometric similarity. This reduces GPU memory usage without impacting system runtime performance. Furthermore, rendering quality is improved by initializing 3D Gaussian primitives via Patch-Grid (PG) point sampling, enabling more accurate modeling of the entire scene. Quantitative and qualitative evaluations on publicly available datasets demonstrate the effectiveness of our improvements.

Abstract:
Realistic animatable human avatars from monocular videos are crucial for advancing human-robot interaction and enhancing immersive virtual experiences. While recent research on 3DGS-based human avatars has made progress, it still struggles with accurately representing detailed features of non-rigid objects (e.g., clothing deformations) and dynamic regions (e.g., rapidly moving limbs). To address these challenges, we present STG-Avatar, a 3DGS-based framework for high-fidelity animatable human avatar reconstruction. Specifically, our framework introduces a rigid-nonrigid coupled deformation framework that synergistically integrates Spacetime Gaussians (STG) with linear blend skinning (LBS). In this hybrid design, LBS enables real-time skeletal control by driving global pose transformations, while STG complements it through spacetime-adaptive optimization of 3D Gaussians. Furthermore, we employ optical flow to identify high-dynamic regions and guide the adaptive densification of 3D Gaussians in these regions. Experimental results demonstrate that our method consistently outperforms state-of-the-art baselines in both reconstruction quality and operational efficiency, achieving superior quantitative metrics while retaining real-time rendering capabilities. Our code is available at https://github.com/jiangguangan/STG-Avatar

Abstract:
Event-based camera has emerged as a promising paradigm for robot perception, offering advantages with high temporal resolution, high dynamic range, and robustness to motion blur. However, existing deep learning-based event processing methods often fail to fully leverage the sparse nature of event data, complicating their integration into resource-constrained edge applications. While neuromorphic computing provides an energy-efficient alternative, spiking neural networks struggle to match of performance of state-of-the-art models in complex event-based vision tasks, like object detection and optical flow. Moreover, achieving high activation sparsity in neural networks is still difficult and often demands careful manual tuning of sparsity-inducing loss terms. Here, we propose Context-aware Sparse Spatiotemporal Learning (CSSL), a novel framework that introduces context-aware thresholding to dynamically regulate neuron activations based on the input distribution, naturally reducing activation density without explicit sparsity constraints. Applied to event-based object detection and optical flow estimation, CSSL achieves comparable or superior performance to state-of-the-art methods while maintaining extremely high neuronal sparsity. Our experimental results highlight CSSL’s crucial role in enabling efficient event-based vision for neuromorphic processing.

Abstract:
Metric-semantic 3D mapping is the process of creating class-labeled 3D maps by fusing the information from images captured by a moving camera. The memory usage required by standard solutions grows linearly with the number of semantic classes being considered, which can pose a bottleneck in large and many-class scenes. This paper proposes two novel methods for compressing the memory used by semantic fusion: calibrated top-k histogram and encoded fusion. The first method maintains, for each voxel, only the counts of the k most likely classes, while the second method uses a neural network to encode all-class probability vectors into a k-dimensional latent space in which per-voxel fusion is performed. The fused result is then decoded, at query time, using another neural network. Experiments show that both methods preserve map accuracy and calibration even at low values of k, and per-voxel memory usage is linear in k. The proposed methods can achieve real-time semantic fusion with 150 classes on commodity GPUs in building-scale scenes where prior approaches run out of memory.

Abstract:
Vision-based tactile sensing, an economical and widely utilized methodology, has the potential to offer crucial contact geometry information for localizing objectives even in cases of visual occlusion. However, this kind of fingertip sensor is limited. When a person picks up a relatively small object placed on a flat surface with two fingers, they may not only use the pads of their fingers depending on the size of the object but also use their fingernails for small or thin objects. Fingers with nail structures have been shown to be effective in picking up objects like this in robot hands as well. Moreover, in actual work, accidental contact between sensors and surrounding objects such as tables often occurs. Sensors with fingernails can avoid this situation in advance by having the fingernails touch the object before the fingertip touches the object. In this work, we present the NailTact, which can detect the force applied to both the fingertip part and the nail part from the same camera image using a single camera. Using the prototype robot finger, we will verify the sensor response characteristics to the load on the nail and the sensor response when grasping an object with the nail and the situation when the finger makes contact with a table. We also present a simple model that illustrates the relationship between the force applied to the nail and the movement of the marker. In the card-grasping experiment, we not only successfully grasped a very thin object but also measured the grasping force.

Abstract:
In this paper, we introduce a sensor designed for robotic fingers which can provide information on the displacements induced by external forces. Our sensor uses LEDs to sense the displacement between two plates connected by a transparent elastomer; when a force is applied to the finger, the elastomer displaces and the LED signals change. We show that using LEDs as both light emitters and receivers in this context provides high sensitivity, allowing such an emitter and receiver pairs to detect very small displacements. We characterize the standalone performance of the sensor by testing the ability of a supervised learning model to predict complete force and torque data from its raw signals, and obtain a mean error between 0.05 and 0.07 N across the three directions of force applied to the finger. Our method allows for compact packaging (fitting at the base of a finger) with no amplification electronics, low cost manufacturing, easy integration into a complete hand, and high overload shear forces and bending torques, suggesting future applicability to complete manipulation tasks.

Abstract:
Simulating the complex interactions between soft tissues and rigid anatomy is critical for applications in surgical training, planning, and robotic-assisted interventions. Traditional Finite Element Method (FEM)-based simulations, while accurate, are computationally expensive and impractical for real-time scenarios. Learning-based approaches have shown promise in accelerating predictions but have fallen short in modeling soft-rigid interactions effectively. We introduce MIXPINN, a physics-informed Graph Neural Network (GNN) framework for mixed-material simulations, explicitly capturing soft-rigid interactions using graph-based augmentations. Our approach integrates Virtual Nodes (VNs) and Virtual Edges (VEs) to enhance rigid body constraint satisfaction while preserving computational efficiency. By leveraging a graph-based representation of biomechanical structures, MIXPINN learns high-fidelity deformations from FEM-generated data and achieves real-time inference with sub-millimeter accuracy. We validate our method in a realistic clinical scenario, demonstrating superior performance compared to baseline GNN models and traditional FEM methods. Our results show that MIXPINN reduces computational cost by an order of magnitude while maintaining high physical accuracy, making it a viable solution for real-time surgical simulation and robotic-assisted procedures.

Abstract:
Existing robotic trajectory planning frameworks typically approximate the robot’s geometry and environmental constraints. While this improves computational efficiency, it sacrifices the solution space and frequently encounters failure in confined environments. However, attaining a precise geometric representation and a continuous collision-free trajectory usually necessitates greater computational expenditure. This paper proposes a methodology that utilizes the concept of swept volume to address the identified limitations. The implementation of an efficient Swept Volume Signed Distance Field computation algorithm and a B-spline trajectory representation results in a significant increase in computational efficiency while maintaining strict safety guarantees. The proposed method combines the advantages of efficiency and maximal exploitation of the solution space. Additionally, it ensures continuous obstacle avoidance, achieving real-time 10Hz replanning performance on i50000 NUC11TNK for arbitrarily shaped rigid objects in complex, unstructured environments.

Abstract:
Most current LiDAR-based odometry methods use point-to-local plane registration to constrain poses, ignoring the explicit plane structure in the environment. Due to noise interference and uneven distribution of point cloud, local planes are prone to tilt, resulting in registration errors. Therefore, we propose MSPA-LIO, a LiDAR-Inertial odometry with multi-scale plane adjustment, which uses geometric constraints and plane adjustment at both local voxel plane scale and large plane scale to improve odometry accuracy and enhance map consistency. In order to make full use of the planar structure in the environment, we propose an explicit large plane extraction method based on the voxel-based. We use large planes to correct the direction of the associated voxel planes, thereby overcoming the misregistration problem caused by local plane tilt. To further improve the odometry accuracy, we perform plane adjustments at the voxel plane scale and the large plane scale to make the pose and map more consistent. Experiments conducted on the VECtor Dataset and the Newer College Dataset demonstrate that our proposed algorithm outperforms four state-of-the-art algorithms.

Abstract:
General-purpose object placement is a fundamental capability of an intelligent generalist robot: being capable of rearranging objects following precise human instructions even in novel environments. This work is dedicated to achieving general-purpose object placement with "something something" instructions. Specifically, we break the entire process down into three parts, including object localization, goal imagination and robot control, and propose a method named SPORT. SPORT leverages a pre-trained large vision model for broad semantic reasoning about objects, and learns a diffusion-based pose estimator to ensure physically-realistic results in 3D space. Only object types (movable or reference) are communicated between these two parts, which brings two benefits. One is that we can fully leverage the powerful ability of open-set object recognition and localization since no specific fine-tuning is needed for the robotic scenario. Moreover, the diffusion-based estimator only need to "imagine" the object poses after the placement, while no necessity for their semantic information. Thus the training burden is greatly reduced and no massive training is required. The training data for the goal pose estimation is collected in simulation and annotated by using GPT-4. Experimental results demonstrate the effectiveness of our approach. SPORT can not only generate promising 3D goal poses for unseen simulated objects, but also be seamlessly applied to real-world settings.

Abstract:
Tactile perception is essential for real-world manipulation tasks, yet the high cost and fragility of tactile sensors can limit their practicality. In this work, we explore BeadSight (a low-cost, open-source tactile sensor) alongside a tactile pre-training approach, an alternative method to precise, pre-calibrated sensors. By pre-training with the tactile sensor and then disabling it during downstream tasks, we aim to enhance robustness and reduce costs in manipulation systems. We investigate whether tactile pre-training, even with a low-fidelity sensor like BeadSight, can improve the performance of an imitation learning agent on complex manipulation tasks. Through visuo-tactile pre-training on both similar and dissimilar tasks, we analyze its impact on a longer-horizon downstream task. Our experiments show that visuo-tactile pre-training improved performance on a USB cable plugging task by up to 65% with vision-only inference. Additionally, on a longer-horizon drawer pick-and-place task, pre-training — whether on a similar, dissimilar, or identical task — consistently improved performance, highlighting the potential for a large-scale visuo-tactile pre-trained encoder. Code for this project is available at: https://github.com/selamie/beadsight.

Abstract:
Current methods to learn controllers for autonomous vehicles (AVs) focus on behavioural cloning. Being trained only on exact historic data, the resulting agents often generalize poorly to novel scenarios. Simulators provide the opportunity to go beyond offline datasets, but they are still treated as complicated black boxes, only used to update the global simulation state. As a result, these RL algorithms are slow, sample-inefficient, and prior-agnostic. In this work, we leverage a differentiable simulator and design an analytic policy gradients (APG) approach to training AV controllers on the large-scale Waymo Open Motion Dataset. Our proposed framework brings the differentiable simulator into an end-to-end training loop, where gradients of the environment dynamics serve as a useful prior to help the agent learn a more grounded policy. We combine this setup with a recurrent architecture that can efficiently propagate temporal information across long simulated trajectories. This APG method allows us to learn robust, accurate, and fast policies, while only requiring widely-available expert trajectories, instead of scarce expert actions. We compare to behavioural cloning and find significant improvements in performance and robustness to noise in the dynamics, as well as overall more intuitive human-like handling.

Abstract:
A deformable agent can continuously adjust its morphology during training, allowing it to discover more suitable structures and outperform fixed-morphology counterparts in terrain-specific tasks. This adaptability is achieved through a joint optimization process consisting of two stages: the Skeleton Transform stage which modifies the agent’s morphology and the Execution stage which optimizes the control policy. However, enabling a deformable agent to continuously learn new policies for different terrains without forgetting previous tasks remains a major challenge. Continuous terrain changes can easily disrupt previously learned strategies, making it difficult to adapt to new tasks while maintaining performance on earlier ones. In this work, we focus on lifelong morphology learning for deformable agents that must adaptively traverse a sequence of diverse terrains. We propose Ske-Ex, a lifelong learning framework where both the Skeleton Transform and Execution stages are designed for lifelong adaptation. Unlike existing methods that optimize only control policies under fixed morphologies, Ske-Ex supports joint adaptation of structure and control, making it better suited for deformable agents. We adopt a regularization-based approach as our lifelong learning strategy, as it avoids the need to store large amounts of prior task data. Experimental results show that Ske-Ex exhibits strong resistance to forgetting and superior generalization, and that the joint optimization of both modules outperforms using either stage alone. Additionally, we introduce a flexible MuJoCo terrain benchmark to facilitate future research on lifelong learning for deformable agents. Our demonstration videos are available at https://johncenavsbatista.github.io/Ske-Ex/

Abstract:
Cameras and LiDAR are essential sensors for autonomous vehicles. The fusion of camera and LiDAR data addresses the limitations of individual sensors but relies on precise extrinsic calibration. Recently, numerous end-to-end calibration methods have been proposed; however, most predict extrinsic parameters in a single step and lack iterative optimization capabilities. To address the increasing demand for higher accuracy, we propose a versatile iterative framework based on surrogate diffusion. This framework can enhance the performance of any calibration method without requiring architectural modifications. Specifically, the initial extrinsic parameters undergo iterative refinement through a denoising process, in which the original calibration method serves as a surrogate denoiser to estimate the final extrinsics at each step. For comparative analysis, we selected four state-of-the-art calibration methods as surrogate denoisers and compared the results of our diffusion process with those of two other iterative approaches. Extensive experiments demonstrate that when integrated with our diffusion model, all calibration methods achieve higher accuracy, improved robustness, and greater stability compared to other iterative techniques and their single-step counterparts.

Abstract:
Microrobots have garnered significant attention due to their vast potential applications across various fields. Among various types of microrobots, piezoelectric robots stand out due to their exceptional motion accuracy, low power consumption, and simple structural design. This work introduces a novel piezoelectric microrobot, the Dual-Modal Piezoelectric Robot (DMPBot), which is fabricated with an innovative carbon fiber substrate through a heat-pressing process with a compact size of 6 mm × 9 mm × 1.1 mm and a weight of only 0.05 g. DMPBot can achieve both high-speed and high-precision motion in non-resonant mode, as well as omnidirectional movement by integrating non-resonant and resonant modes. In non-resonant mode, the robot can reach a speed of 33 mm/s (3.67 body lengths per second) and a sub-micron resolution of 0.4 μm by adjusting the applied signal. This work presents an analysis of the design, fabrication, and performance of DMPBot, focusing on its dynamic response, motion mechanisms, high-speed and high-precision motion, and omnidirectional movement capabilities. Experimental results validate the ability of DMPBot to perform high-speed, high-precision, and omnidirectional motion, demonstrating its promising potential in the field of micromanipulation.

Abstract:
Traditional single-cell mechanical characterization techniques (e.g., atomic force microscopy) often face limitations in throughput, require invasive labeling, or fail to replicate physiological microenvironments, impeding their clinical utility for rapid cancer cell analysis. To address these limitations for automated characterization of cellular mechanical properties, this study proposes a novel method using microchannels with narrow geometric structures to measure cellular mechanical characteristics. A dynamic mechanical characterization technique with serially connected microchannels simulates malignant tumor cell deformation and migration in vivo, enabling precise identification of three malignant tumor cell lines and three normal cell lines through consecutive compressions. High-speed imaging combined with computer vision and image processing techniques facilitates rapid and accurate automated analysis for tumor cells. Furthermore, this study reveals that the mechanical properties of the cell nucleus determine the overall cellular mechanics, with the differences between tumor and normal cells attributed to variations in nucleus mechanics. This approach shows promise for early cancer diagnosis.

Abstract:
Flapping-wing robots exhibit numerous advantages in flight performance, which mimic the natural flight of birds or insects. However, autonomous takeoff remains a significant challenge for large-sized bird-like flapping-wing robots. To address this challenge, we design a jumping mechanism based on a bow-like carbon fiber spring. This mechanism is capable of repeated self-compression and release and can be readily integrated into flapping-wing robots to assist their jumping takeoff. Then we conduct a mechanical analysis of the motion states during the jumping takeoff process. Additionally, we propose a jump-flapping coupling control method based on sensor data to ensure a seamless transition and smooth coordination between the jumping and flapping action, thus enabling a smooth takeoff. Experimental results validate the effectiveness of the proposed jumping mechanism and its collaborative control strategy. This study provides support for advancing flapping-wing robots toward autonomous multi-modal locomotion and further deepens research in this field.

Abstract:
Micromanipulation techniques struggle to achieve three-dimensional rotational control at the microscale without compromising biocompatibility or spatial flexibility. Conventional methods based on mechanical contact, optical forces, or confined microfluidics constrain dynamic reconfiguration and surgical accessibility. Here, we introduce a dual-bubble acoustic micromanipulator that enables multidirectional rotation through controlled hydrodynamic fields. By placing oscillating microbubbles at the tips of micropipettes, this system creates adjustable vortex patterns: a single microbubble generates toroidal flows for out-of-plane rotation, while two microbubbles produce shear forces for in-plane spinning. This approach uses simple mechanical adjustments to control rotational axes in open fluid environments, without needing frequency modulation or phase synchronization. Flow-field simulations and experiments with polystyrene microspheres confirm deterministic orientation control, and tests with shrimp embryos demonstrate rotation at clinically relevant speeds. The open architecture integrates seamlessly with standard microscopy and robotic injection systems, offering a non-contact, precise tool for applications such as polar body alignment, intracellular surgery, and 3-D imaging.

Abstract:
Fueled by the rapid evolution of robotics, the demand for intelligent and lightweight robotic systems continues to grow across industries. However, conventional designs often separate sensing and actuation, resulting in structural complexity and diminished reliability. While integrated sensor-actuator systems offer a promising solution, they face significant challenges in manufacturing and scalability. Liquid crystal elastomer (LCE) are widely utilized in actuators for their thermally responsive deformation and programmability, while Neodymium-Iron-Boron (NdFeB) nanoparticles provide exceptional magnetic properties for sensing. This paper introduces a novel Self-Sensing LCE (SS-LCE) actuator, seamlessly combining LCE and NdFeB to enable simultaneous actuation and self-sensing capabilities. Under thermal stimulation, the actuator executes complex motions while delivering real-time feedback through magnetic field variations. Its programmability and adaptable fabrication process support diverse motion modes, unlocking broad application potential. By enhancing integration, reliability, and flexibility, this self-sensing actuator represents a pivotal advancement in the development of lightweight, intelligent robotic systems with significant research and industrial implications.

Abstract:
This paper presents the design of a reconfigurable manipulator capable of performing Schönflies and Remote Center of Motion (RCM) operations. The Schönflies mode handles plane objects efficiently, like a SCARA robot, while the RCM mode enables the remote center operation, similar to a Da Vinci robot. Through kinematic reconfiguration, the manipulator achieves multimodal operations without component replacement, and the 1R1T module based on a spline lead screw mechanism ensures a compact structure. The kinematic reconfiguration is analyzed in this paper, and mapping rules from joint space to Cartesian space are respectively enabled in different operation modes. The concise kinematic expression without actuation redundancy simplifies control in both modes. Experimental results confirm the manipulator’s versatility in plane object handling and remote center operations, highlighting its effectiveness across diverse applications.

Abstract:
Soft eversion robots have demonstrated significant advantages in navigating within confined spaces with minimal friction, making them promising candidates for various intraluminal applications in medical, industrial, and exploratory domains. While these type of growing robots enable frictionless movement within hollow structures, no existing soft robotic actuation mechanism can grow along the outer surface of an environment without generating friction. A robot with these capabilities could open new possibilities, such as endoscopic vein harvesting for coronary artery bypass graft surgery. This paper introduces a novel growing robotic manipulator based on an outside-in material feeding mechanism - the inversion robot. Unlike conventional eversion robots, which expand by feeding material from the inside out, the inversion robot draws material from the outside to the inside, encapsulating its external environment within an inner sleeve to achieve frictionless movement. We present the design, implementation, and experimental validation of this inversion robot, investigating its growth behavior under varying pressure values and with different diameters, its ability to navigate along defined trajectories, and with a tools mounted to its tip. This inversion robot could enable vein dissection while preserving the surrounding fat layer, making it a promising innovation for minimally invasive vascular surgery and beyond.

Abstract:
Cooperative transport, the simultaneous movement of an object by multiple agents, has been widely observed in biological systems such as ant colonies, which improve efficiency and adaptability in dynamic environments. Inspired by these natural phenomena, we present a novel acoustic robotic system for the transport of contactless objects in mid-air. Our system utilizes onboard phased ultrasonic transducers and a robotic control system to generate localized acoustic pressure fields, enabling the precise manipulation of airborne particles and robots. We categorize contactless object-transport strategies into independent transport (uncoordinated) and forward-facing cooperative transport (coordinated), drawing parallels with biological systems to optimize efficiency and robustness. The proposed system is experimentally validated by evaluating levitation stability using a microphone in the measurements lab, transport efficiency through a phase-space motion capture system, and clock synchronization accuracy using an oscilloscope. The results demonstrate the feasibility of both independent and cooperative airborne object transport. This research contributes to the field of acoustophoretic robotics, with potential applications in contactless material handling, microassembly, and biomedical applications.

Abstract:
Search and rescue (SAR) operations require optimal solutions to maximize efficiency and minimize risks for human responders. Quadruped robots have emerged as viable agents due to their locomotion and agility capabilities. This work presents a navigation framework designed to enhance stability and adaptability of legged robots in complex environments. A high-fidelity simulation environment was developed using NVIDIA IsaacSim, incorporating realistic disaster conditions such as unstructured terrain, fire, and smoke. Multiple stochastic routes were generated and analyzed in simulation in terms of energy consumption, stability, traversal time, and fall occurrences, to categorize them and determine the safest and most efficient path before real-world deployment, reducing the likelihood of failures. A fuzzy logic controller was proposed to regulate speed and improve locomotion adaptability. The proposed approach was validated in both simulation and real-world experiments. The results demonstrate the effectiveness of the proposed strategy in enabling safe and efficient navigation for quadruped robots in SAR missions. The code is publicly available at Robcib-GIT/IsaacSim_LeggedRobotSupplementary Video: https://youtu.be/cxEP5YVy8Qo

Abstract:
Human-robot shared visual servoing systems can combine the precise control ability of the robot and the human decision-making ability. However, integrating human input into such systems remains a challenging endeavor. This work studies the robot visual servoing control system in the human-robot shared environment and proposes a human-robot shared visual servoing framework based on game theory. Game theory is used to model the relationship between humans and robots. According to the observation of human input, the human intention is adaptively estimated using a radial basis function neural network (RBFNN), and the robot control objective is dynamically adjusted to realize human-robot coordination. The Lyapunov theory is used to prove the stability of the system. Experiments are conducted to verify the effectiveness of the proposed method.

Abstract:
Forecasting how human hands would move around target objects on egocentric videos can provide prior knowledge to enhance the path planning capabilities of service robots and assistive wearable devices. During the hand-object interaction process, head movements always occur concurrently to provide observations for the interaction scene from different egocentric views. Although some prior works have successfully integrated head motion information into hand trajectory prediction (HTP), they basically overlook the multi-view consistency (MVC) inherent in headset camera egomotion. We argue that multi-view consistency reveals geometric and semantic relationships during hand-object interaction, and can be regarded as additional supervision signals for predicting more realistic hand trajectories. Therefore, in this work, we propose a novel learning scheme dubbed EER to improve diffusion-based 2D hand trajectory prediction methods, which involves exploiting the geometric consistency, enhancing the multi-canvas consistency, and reconstructing the semantic consistency inherent in MVC. The experimental results show that our proposed EER scheme significantly improves the prediction accuracy of existing diffusion-based 2D HTP methods on the publicly available datasets. We will release the code as open-source at https://github.com/IRMVLab/EER-HTP.

Abstract:
Morphological symmetry is a fundamental characteristic of legged animals and robots. Most existing Deep Reinforcement Learning approaches for legged locomotion neglect to exploit this inherent symmetry, often producing unnatural and suboptimal behaviors such as dominant legs or non-periodic gaits. To address this limitation, we propose a novel learning-based framework to systematically optimize symmetry by state distribution symmetrization. First, we introduce the degree of asymmetry (DoA), a quantitative metric that measures the discrepancy between original and mirrored state distributions. Second, we develop an efficient computation method for DoA using gradient ascent with a trained discriminator network. This metric is then incorporated into a reinforcement learning framework by introducing it to the reward function, explicitly encouraging symmetry during policy training. We validate our framework with extensive experiments on quadrupedal and humanoid robots in simulated and real-world environments. Results demonstrate the efficacy of our approach for improving policy symmetry and overall locomotion performance.

Abstract:
In this study, in order to address the robotic auditory perception problem, we propose a novel framework for object material recognition of common containers, which combines deep learning with active auditory perception to achieve breakthrough results. We developed a modular robotic system for acoustic data acquisition that employs a hybrid mechanism of vertical translation and horizontal rotation that is capable of performing full-scale tapping in three dimensions. The system is capable of creating an acoustic dataset consisting of 50 containers made of five materials, which improves the data acquisition efficiency by 93.9% compared to manual operations. In addition, we propose an end-to-end transfer learning model, TBAP, which is trained on a crawler-generated pre-training dataset and 50 real scene samples, and achieves a recognition accuracy of 91.0% for unseen materials. To improve reliability, we design a dynamic confidence assessment mechanism that generates confidence indices through probability distribution analysis and feature stability assessment to support robust robot decision-making. Experimental results show that the framework greatly improves data acquisition efficiency while maintaining high recognition accuracy, providing a valuable tool for advancing acoustic perception research.

Abstract:
The deployment of autonomous robots in human environments requires an understanding of social interactions and the factors that influence them. Human-robot proxemics is an important factor that impacts interactions, and modeling personalized proxemic behavior has always been a challenge, as it depends on multiple user attributes, including gender, age, and height. In this paper, we propose a novel approach that uses rubber-sheet transformation models to represent human-robot proxemics. We do this by collecting human-robot interpersonal distance data from 20 users and model it with respect to their height, age, gender, and the angle at which the robot approaches. We present an evaluation of the model, and the outcome of our results, which show a promising approximation of proxemic distances based on different user attributes. Finally, we provide a coefficient table for rubber-sheet models to lay the foundation for personalized human-robot proxemics and outline future research directions.

Abstract:
As mobile robots become increasingly common in human-centric environments, social navigation—adhering to unwritten social norms rather than merely avoiding pedestrians—has drawn growing attention. Existing methods, from hand-crafted techniques to learning-based approaches, often overlook the nuanced context and scene understanding that humans naturally exhibit. Inspired by studies indicating the critical role of language in cognition and reasoning, we propose a new approach to bridge robot perception and socially aware actions through human-like language reasoning. We introduce Social robot Navigation via Explainable Interactions (SNEI), a human-annotated vision-language dataset comprising over 40K Visual Question Answering (VQA) pairs across 2K unique social scenarios, drawn from diverse, unstructured public spaces. SNEI contains perception, prediction, chain-of-thought reasoning, action, and explanation, thereby allowing robots to interpret social contexts in human language. We fine-tune a Vision-Language Model, Social-LLaVA, on SNEI to demonstrate the potential of language-guided reasoning for high-level navigation tasks. Experimental evaluations—both quantitative and qualitative—demonstrate that Social-LLaVA can outperform state-of-the-art models.†.

Abstract:
Prior knowledge vastly exists in the automation industry, especially for tasks like pick-and-place, where simple programmatic demonstrations with online generation ability can be acquired easily. How to learn a policy faster with higher flexibility and generalization ability based on these demonstrations is a question to be answered. End-to-end target learning and imitation learning are widely discussed in previous works. Here, we focus on the online generation ability of the demonstration and propose a demo injection method based on actor-critic off-policy reinforcement learning (RL) for the interaction and policy optimization phase. We conduct experiments and an ablation study based on four research questions around a two-finger reactive grasping task with a Panda robot. The result shows our proposed injection method increases the training stability, strongly reduces the time to convergence and benefits sim-2-real transfer with smooth motion.

Abstract:
This paper addresses the challenge of assigning heterogeneous sensors (i.e., robots with varying sensing capabilities) for multi-target tracking. We classify robots into two categories: (1) sufficient sensing robots, equipped with range and bearing sensors, capable of independently tracking targets, and (2) limited sensing robots, which are equipped with only range or bearing sensors and need to at least form a pair to collaboratively track a target. Our objective is to optimize tracking quality by minimizing uncertainty in target state estimation through efficient robot-to-target assignment. By leveraging matroid theory, we propose a greedy assignment algorithm that dynamically allocates robots to targets to maximize tracking quality. The algorithm guarantees constant-factor approximation bounds of 1/3 for arbitrary tracking quality functions and 1/2 for submodular functions, while maintaining polynomial-time complexity. Extensive simulations demonstrate the algorithm’s effectiveness in accurately estimating and tracking targets over extended periods. Furthermore, numerical results confirm that the algorithm’s performance is close to that of the optimal assignment, highlighting its robustness and practical applicability.

Abstract:
This study presents a methodology to safely manipulate branches to aid various agricultural tasks. Humans in a real agricultural environment often manipulate branches to perform agricultural tasks effectively, but current agricultural robots lack this capability. This proposed strategy to manipulate branches can aid in different precision agriculture tasks, such as fruit picking in dense foliage, pollinating flowers under occlusion, and moving overhanging vines and branches for navigation. The proposed method modifies RRT to plan a path that satisfies the branch geometric constraints and obeys branch deformable characteristics. Re-planning is done to obtain a path that helps the robot exert force within a desired range so that branches are not damaged during manipulation. Experimentally, this method achieved a success rate of 78% across 50 trials, successfully moving a branch from different starting points to a target region.

Abstract:
Multi-step cloth manipulation is a challenging problem for robots due to the high-dimensional state spaces and the dynamics of cloth. Despite recent significant advances in end-to-end imitation learning for multi-step cloth manipulation skills, these methods fail to generalize to unseen tasks. Our insight in tackling the challenge of generalizable multi-step cloth manipulation is decomposition. We propose a novel pipeline that autonomously learns basic skills from long demonstrations and composes learned basic skills to generalize to unseen tasks. Specifically, our method first discovers and learns basic skills from the existing long demonstration benchmark with the commonsense knowledge of a large language model (LLM). Then, leveraging a high-level LLM-based task planner, these basic skills can be composed to complete unseen tasks. Experimental results demonstrate that our method outperforms baseline methods in learning multi-step cloth manipulation skills for both seen and unseen tasks. Project website: https://sites.google.com/view/gen-cloth

Abstract:
Underwater simulators offer support for building robust underwater perception solutions. Significant work has recently been done to develop new simulators and to advance the performance of existing underwater simulators. Still, there remains room for improvement on physics-based underwater sensor modeling and rendering efficiency. In this paper, we propose OceanSim, a high-fidelity GPU-accelerated underwater simulator to address this research gap. We propose advanced physics-based rendering techniques to reduce the sim-to-real gap for underwater image simulation. We develop OceanSim to fully leverage the computing advantages of GPUs and achieve real-time imaging sonar rendering and fast synthetic data generation. We evaluate the capabilities and realism of OceanSim using real-world data to provide qualitative and quantitative results. The code and detailed documentation are made available on the project website to support the marine robotics community: https://umfieldrobotics.github.io/OceanSim.

Abstract:
Humanoid robots, designed to operate in human-centric environments, serve as a fundamental platform for a broad range of tasks. Although humanoid robots have been extensively studied for decades, a majority of existing humanoid robots still heavily rely on complex modular frameworks, leading to inflexibility and potential compounded errors from independent sensing, planning, and acting components. In response, we propose an end-to-end humanoid sense-plan-act walking system, enabling vision-based obstacle avoidance and footstep planning for whole body balancing simultaneously. We designed two imperative learning (IL)-based bilevel optimizations for model-predictive step planning and whole body balancing, respectively, to achieve self-supervised learning for humanoid robot walking. This enables the robot to learn from arbitrary unlabeled data, improving its adaptability and generalization capabilities. We refer to our method as iWalker and demonstrate its effectiveness in both simulated and real-world environments, representing a significant advancement toward autonomous humanoid robots.

Abstract:
Autonomous drone racing presents a challenging control problem, requiring real-time decision-making and robust handling of nonlinear system dynamics. While iterative learning model predictive control (LMPC) offers a promising framework for iterative performance improvement, its direct application to drone racing faces challenges like real-time compatibility or the trade-off between time-optimal and safe traversal. In this paper, we enhance LMPC with three key innovations: (1) an adaptive cost function that dynamically weights time-optimal tracking against centerline adherence, (2) a shifted local safe set to prevent excessive shortcutting and enable more robust iterative updates, and (3) a Cartesian-based formulation that accommodates safety constraints without the singularities or integration errors associated with Frenet-frame transformations. Results from extensive simulation and real-world experiments demonstrate that our improved algorithm can optimize initial trajectories generated by a wide range of controllers with varying levels of tuning for a maximum improvement in lap time by 60.85%. Even applied to the most aggressively tuned state-of-the-art model-based controller, MPCC++, on a real drone, a 6.05% improvement is still achieved. Overall, the proposed method pushes the drone toward faster traversal and avoids collisions in simulation and real-world experiments, making it a practical solution to improve the peak performance of drone racing.

Abstract:
Quadrupedal animals can perform agile and playful tasks while interacting with real-world objects. For instance, a trained dog can track and catch a flying frisbee before it touches the ground, while a cat left alone at home may leap to grasp the door handle. Successfully grasping an object during high-dynamic locomotion requires highly precise perception and control. However, due to hardware limitations, agility and precision are usually a trade-off in robotics problems. In this work, we employ a perception-control decoupled system based on Reinforcement Learning (RL), aiming to explore the level of precision a quadrupedal robot can achieve while interacting with objects during high-dynamic locomotion. Our experiments show that our quadrupedal robot, mounted with a passive gripper in front of the robot’s chassis, can perform both tracking and catching tasks similar to a real trained dog. The robot can follow a mid-air ball moving at speeds of up to 3m/s and it can leap and successfully catch a small object hanging above it at a height of 1.05m in simulation and 0.8m in the real world.

Abstract:
Our goal is to enable social robots to interact autonomously with humans in a realistic, engaging, and expressive manner. The 12 Principles of Animation are a well-established framework animators use to create movements that make characters appear convincing, dynamic, and emotionally expressive. This paper proposes a novel approach that leverages Dynamic Movement Primitives (DMPs) to implement key animation principles, providing a learnable, explainable, modulable, online adaptable and composable model for automatic expressive motion generation. DMPs, originally developed for general imitation learning in robotics and grounded in a spring-damper system design, offer mathematical properties that make them particularly suitable for this task. Specifically, they enable modulation of the intensities of individual principles and facilitate the decomposition of complex, expressive motion sequences into learnable and parametrizable primitives. We present the mathematical formulation of the parameterized animation principles and demonstrate the effectiveness of our framework through experiments and application on three robotic platforms with different kinematic configurations, in simulation, on actual robots and in a user study. Our results show that the approach allows for creating diverse and nuanced expressions using a single base model.

Abstract:
In multi-robot systems (MRS), cooperative localization is a crucial task for enhancing system robustness and scalability, especially in GPS-denied or communication-limited environments. However, adversarial attacks, such as sensor manipulation, and communication jamming, pose significant challenges to the performance of traditional localization methods. In this paper, we propose a novel distributed fault-tolerant cooperative localization framework to enhance resilience against sensor and communication disruptions in adversarial environments. We introduce an adaptive event-triggered communication strategy that dynamically adjusts communication thresholds based on real-time sensing and communication quality. This strategy ensures optimal performance even in the presence of sensor degradation or communication failure. Furthermore, we conduct a rigorous analysis of the convergence and stability properties of the proposed algorithm, demonstrating its resilience against bounded adversarial zones and maintaining accurate state estimation. Robotarium-based experiment results show that our proposed algorithm significantly outperforms traditional methods in terms of localization accuracy and communication efficiency, particularly in adversarial settings. Our approach offers improved scalability, reliability, and fault tolerance for MRS, making it suitable for large-scale deployments in real-world, challenging environments.

Abstract:
Online evaluation is increasingly adopted in robotics research, providing an efficient approach to collect data from large and diverse populations. However, there have been ongoing debates about online studies as a proxy for in-person studies, especially where a participant passively observes video of robot behaviours or interaction. We conduct an online video comparison study (N=178) evaluating three robot handover policies in a collaborative assembly task, namely an adaptive autonomous policy, a non-adaptive scripted policy, and teleoperation. Participants watched three sets of videos in third-person view, each consisting of 9 sequential handovers executing one of the policies. Compared to in-person participants in two previous studies who evaluated handovers as users, online participants were observant of different robot behaviours and human-robot collaboration contexts, with 76.4% and 71.9% recognising the adaptive handovers exhibited by the teleoperated and autonomous robot, respectively. However, as observers, online participants showed more critical subjective perceptions compared to the in-person participants with a user’s perspective. They valued efficiency over adaptation with twice more autonomous handovers rated as being too late compared to scripted handovers. Our work highlights the need to consider user contexts when evaluating human-robot collaboration.

Abstract:
Pursuit-evasion (PE) problem is a critical challenge in multi-robot systems (MRS). While reinforcement learning (RL) has shown its promise in addressing PE tasks, research has primarily focused on single-target pursuit, with limited exploration of multi-target encirclement, particularly in large-scale settings. This paper proposes a Transformer-Enhanced Reinforcement Learning (TERL) framework for large-scale multi-target encirclement. By integrating a transformer-based policy network with target selection, TERL enables robots to adaptively prioritize targets and safely coordinate robots. Results show that TERL outperforms existing RL-based methods in terms of encirclement success rate and task completion time, while maintaining good performance in large-scale scenarios. Notably, TERL, trained on small-scale scenarios (15 pursuers, 4 targets), generalizes effectively to large-scale settings (80 pursuers, 20 targets) without retraining, achieving a 100% success rate. The code and demonstration video are available at https://github.com/ApricityZ/TERL.

Abstract:
The task of co-optimizing the body and behaviour of agents has been a long-standing problem in the fields of evolutionary robotics and embodied AI. Previous work has largely focused on the development of learning methods exploiting massive parallelization of agent evaluations with large population sizes, a paradigm which is applicable to simulated agents but cannot be transferred to the real world due to the assoicated costs with the production of embodiments and robots. Furthermore, recent data-efficient approaches utilizing reinforcement learning can suffer from distributional shifts in transition dynamics as well as in state and action spaces when experiencing new body morphologies. In this work, we propose a new co-adaptation method combining reinforcement learning and State-Aligned Self-Imitation Learning to co-design embodiment and behavioural policies withing a handful of design iterations. We show that the integration of a self-imitation signal improves the data-efficiency of the co-adaptation process as well as the behavioural recovery when adapting morphological parameters.

Abstract:
In the oil and gas industry, scale accumulation on radiant coils within furnaces significantly reduces heat-transfer efficiency, leading to increased energy consumption. This paper introduces the REFINE-bot, a robotic system developed to improve the descaling process and operational efficiency in fired heaters. Unlike existing solutions which are mainly designed for specific tube sizes and positions and focused on inspection, the REFINE-bot integrates an adaptable clamping mechanism that adapts to both vertical and horizontal tubes of varying diameters (3"–8"), even in complex environments with narrow tube-to-tube and wall-to-tube gaps. An adaptive force control is also developed to online adjust the position of the cleaning relative to the tube surface to address uneven scale heights. We evaluated three different cleaning tools—a Knot End Brush, Wire Cup Brush, and Sandpaper—under simulated hard scale conditions in a lab environment. This evaluation revealed the cleaning tools’ limitations and helped to identify optimal safety parameters to prevent tube damage. The results showed that the 1-inch Wire Cup Brush, removing 431.1 μm of scale, achieved the highest descaling rate among the tested tools. The robot was successfully deployed in a real furnace setting to test its clamping and cleaning mechanisms on the actual scale. The real-world results demonstrated superior cleaning performance on the radiant coils of a furnace compared to traditional manual descaling methods, as evaluated by measured reductions in scale thickness and infrared thermal imaging. Furthermore, ultrasonic thickness measurements (UTM) were performed and indicated that there was no significant loss in wall thickness after the on-site experiments.

Abstract:
In the field of robotics, researchers face a critical challenge in ensuring reliable and efficient task planning. Verifying high-level task plans before execution significantly reduces errors and enhance the overall performance of these systems. In this paper, we propose an architecture for automatically verifying high-level task plans before their execution in simulator or real-world environments. Leveraging Large Language Models (LLMs), our approach consists of two key steps: first, the conversion of natural language instructions into Linear Temporal Logic (LTL), followed by a comprehensive analysis of action sequences. The module uses the reasoning capabilities of the LLM to evaluate logical coherence and identify potential gaps in the plan. Rigorous testing on datasets of varying complexity demonstrates the broad applicability of the module to household tasks. We contribute to improving the reliability and efficiency of task planning and addresses the critical need for robust pre-execution verification in autonomous systems. The project page is available at https://verifyllm.github.io.

Abstract:
Vision-language navigation (VLN) is a pivotal area within embodied intelligence, where agents must navigate based on natural language instructions. While traditional VLN research has focused on enhancing environmental comprehension and decision-making policy, these methods often reveal substantial performance gaps when agents are deployed in novel environments. This issue primarily arises from the lack of diverse training data. Expanding datasets to encompass a broader range of environments is impractical and costly. To address this challenge, we propose Vision-Language Navigation with Continuous Learning (VLNCL), a framework that allows agents to learn from new environments while preserving previous knowledge incrementally. We introduce a novel dual-loop scenario replay method (Dual-SR) inspired by brain memory mechanisms integrated with VLN agents. This approach helps consolidate past experiences and improves generalization across novel tasks. As a result, the agent exhibits enhanced adaptability to new environments and mitigates catastrophic forgetting. Our experiment demonstrates that VLN agents with Dual-SR effectively resist forgetting and adapt to unfamiliar environments. Combining VLN with continual learning significantly boosts the performance of otherwise average models, achieving SOTA results.

Abstract:
This paper presents a framework for the real-time initialization of unknown Ultra-Wideband (UWB) anchors in UWB-aided navigation systems. The method is designed for localization solutions where UWB modules act as supplementary sensors. Our approach enables the automatic detection and calibration of previously unknown anchors during operation, removing the need for manual setup. By combining an online Positional Dilution of Precision (PDOP) estimation, a lightweight outlier detection method, and an adaptive robust kernel for non-linear optimization, our approach significantly improves robustness and suitability for real-world applications compared to state-of-the-art. In particular, we show that our metric which triggers an initialization decision is more conservative than current ones commonly based on initial linear or non-linear initialization guesses. This allows for better initialization geometry and subsequently lower initialization errors. We demonstrate the proposed approach on two different mobile robots: an autonomous forklift and a quadcopter equipped with a UWB-aided Visual-Inertial Odometry (VIO) framework. The results highlight the effectiveness of the proposed method with robust initialization and low positioning error. We open-source our code in a C++ library including a ROS wrapper.

Abstract:
Small and medium-sized enterprises (SMEs) often struggle with automating high-mix, low-volume (HMLV) manufacturing due to the inflexibility and high cost of traditional automation solutions. This paper presents a novel approach to robotic manipulation for HMLV environments that leverages vibrotactile sensing. We propose integrating vibrotactile sensors, which capture subtle vibrations and acoustic signals, to provide real-time feedback during manipulation tasks. This approach enables the robot to detect subtle misalignments, which can assist in refining vision-based policies and improving the robot’s overall manipulation skills. We demonstrate the effectiveness of this method in several representative insertion tasks, showing how vibrotactile feedback can be used to predict success or failure of an insertion task as well as predict initial contact between an object grasped in-hand and the placement location. Our results suggest that vibrotactile sensing offers a promising pathway towards more robust and adaptable robotic systems that can better empower SMEs to embrace automation.

Abstract:
Model-based planners and controllers are commonly used to solve complex manipulation problems as they can efficiently optimize diverse objectives and generalize to long horizon tasks. However, they often fail during deployment due to noisy actuation, partial observability and imperfect models. To enable a robot to recover from such failures, we propose to use hierarchical reinforcement learning to learn a recovery policy. The recovery policy is triggered when a failure is detected based on sensory observations and seeks to take the robot to a state from which it can complete the task using the nominal model-based controllers. Our approach, called RecoveryChaining, uses a hybrid action space, where the model-based controllers are provided as additional nominal options which allows the recovery policy to decide how to recover, when to switch to a nominal controller and which controller to switch to even with sparse rewards. We evaluate our approach in three multi-step manipulation tasks with sparse rewards, where it learns significantly more robust recovery policies than those learned by baselines. We successfully transfer recovery policies learned in simulation to a physical robot to demonstrate the feasibility of sim-to-real transfer with our method.

Abstract:
Bird’s-Eye View (BEV) perception has gained significant attention in autonomous driving and robotics due to its advantages in simplifying modality alignment and feature fusion. Addressing the challenge of jointly optimizing performance and efficiency in 2D-3D view transformation, we identify that, compared to depth information which is viewpoint-dependent and requires camera intrinsics for estimation, height information can maintains prediction consistency across different camera perspectives. Based on this insight, we propose the HeightAware-BEV framework, which achieves efficient and accurate view transformation through height-aware feature mapping. (1) Building on an efficient projection-based view transformation approach, 3D voxels directly query the height probability distribution predicted by images according to grid height, weighting corresponding features to enable precise and efficient feature projection; (2) Design a dynamic feature filtering mechanism to filter out task-irrelevant features during the view transformation process. Additionally, a weakly-supervised training strategy is designed to improve model performance in scenarios with limited samples. The HeightAware-BEV (R50@448×800) achieves an IOU of 47.8% on the nuScenes validation set and 60 FPS on 2080Ti, outperforming advanced methods such as SimpleBEV and PointBEV. The code is available at https://github.com/Zhou-Renjie/HeightAware-BEV.

Abstract:
We propose VecNav, a novel approach that trains a monocular navigation model through self-supervision using uncalibrated, human-captured videos. These videos, characterized by unknown camera intrinsics and extrinsics, are readily available from video-sharing platforms (e.g. YouTube) and are referred to as "in-the-wild" videos due to their unregulated capture conditions. Our approach involves estimating ground truth trajectories from these videos using monocular visual odometry. We then train a transformer-based diffusion policy that takes a goal specified by a vector and RGB images as input and generates action predictions. Our method leverages a significantly larger and more diverse dataset compared to existing monocular visual navigation approaches. This diversity holds the potential to develop a generalist navigation model capable of guiding various types of robots in unfamiliar environments. We evaluated our method on a differential drive robot, demonstrating its capability to effectively navigate using solely "in-the-wild" videos for training. Our experiments demonstrate that VecNav successfully learned to act based on visual affordances, relying solely on uncalibrated "in-the-wild" data.

Abstract:
Natural and lifelike locomotion remains a fundamental challenge for humanoid robots to interact with human society. However, previous methods either neglect motion naturalness or rely on unstable and ambiguous style rewards. In this paper, we propose a novel Generative Motion Prior (GMP) that provides fine-grained motion-level supervision for the task of natural humanoid robot locomotion. To leverage natural human motions, we first employ whole-body motion retargeting to effectively transfer them to the robot. Subsequently, we train a generative model offline to predict future natural reference motions for the robot based on a conditional variational auto-encoder. During policy training, the generative motion prior serves as a frozen online motion generator, delivering precise and comprehensive supervision at the trajectory level, including joint angles and keypoint positions. The generative motion prior significantly enhances training stability and improves interpretability by offering detailed and dense guidance throughout the learning process. Experimental results in both simulation and real-world environments demonstrate that our method achieves superior motion naturalness compared to existing approaches. Project page can be found at https://sites.google.com/view/humanoid-gmp

Abstract:
Using Quadrics as the object representation has the benefits of both generality and closed-form projection derivation between image and world spaces. Although numerous constraints have been proposed for dual quadric reconstruction, we found that many of them are imprecise and provide minimal improvements to localization. After scrutinizing the existing constraints, we introduce a concise yet more precise convex hull-based algebraic constraint for object landmarks, which is applied to object reconstruction, frontend pose estimation, and backend bundle adjustment. This constraint is designed to fully leverage precise semantic segmentation, effectively mitigating mismatches between complex-shaped object contours and dual quadrics. Experiments on public datasets demonstrate that our approach is applicable to both monocular and RGB-D SLAM and achieves improved object mapping and localization than existing quadric SLAM methods. The implementation of our method is available at https://github.com/tiev-tongji/convexhull-based-algebraic-constraint.

Abstract:
Practitioners designing reinforcement learning policies face a fundamental challenge: translating intended behavioral objectives into representative reward functions. This challenge stems from behavioral intent requiring simultaneous achievement of multiple competing objectives, typically addressed through labor-intensive linear reward composition that yields brittle results. Consider the ubiquitous robotics scenario where performance maximization directly conflicts with energy conservation. Such competitive dynamics are resistant to simple linear reward combinations. In this paper, we present the concept of objective fulfillment upon which we build Fulfillment Priority Logic (FPL). FPL allows practitioners to define logical formulae representing their intentions and priorities within multi-objective reinforcement learning. Our novel Balanced Policy Gradient algorithm leverages FPL specifications to achieve up to 500% better sample efficiency compared to Soft Actor Critic. Notably, this work constitutes the first implementation of a non-linear utility scalarization design, intended explicitly for continuous control problems.

Abstract:
This paper presents a novel method for modeling the shape of a continuum robot as a Neural Configuration Signed Distance Function (N-CSDF). By learning separate distance fields for each link and combining them through the kinematics chain, the learned N-CSDF provides an accurate and computationally efficient representation of the robot’s shape. The key advantage of a distance function representation of a continuum robot is that it enables efficient collision checking for motion planning in dynamic and cluttered environments, even with point-cloud observations. We integrate the N-CSDF into a Model Predictive Path Integral (MPPI) controller to generate safe trajectories for multi-segment continuum robots. The proposed approach is validated for continuum robots with various links in several simulated environments with static and dynamic obstacles.

Abstract:
We present a novel autonomous driving framework, DualAD, designed to imitate human reasoning during driving. DualAD comprises two layers: a rule-based motion planner at the bottom layer that handles routine driving tasks requiring minimal reasoning, and an upper layer featuring a rule-based text encoder that converts driving scenarios from absolute states into text description. This text is then processed by a large language model (LLM) to make driving decisions. The upper layer intervenes in the bottom layer’s decisions when potential danger is detected, mimicking human reasoning in critical situations. Closed-loop experiments demonstrate that DualAD, using a zero-shot pre-trained model, significantly outperforms both rule-based and learning-based motion planners when interacting with reactive agents. Our experiments also highlight the effectiveness of the text encoder, which considerably enhances the model’s scenario understanding. Additionally, the integrated DualAD model improves with stronger LLMs, indicating the framework’s potential for further enhancement. Code and benchmarks are available at github.com/TUM-AVS/DualAD.

Abstract:
Humanoid robots and mixed reality headsets benefit from the use of head-mounted sensors for tracking. While advancements in visual-inertial odometry (VIO) and simultaneous localization and mapping (SLAM) have produced new and high-quality state-of-the-art tracking systems, we show that these are still unable to gracefully handle many of the challenging settings presented in the head-mounted use cases. Common scenarios like high-intensity motions, dynamic occlusions, long tracking sessions, low-textured areas, adverse lighting conditions, saturation of sensors, to name a few, continue to be covered poorly by existing datasets in the literature. In this way, systems may inadvertently overlook these essential real-world issues. To address this, we present the Monado SLAM dataset, a set of real sequences taken from multiple virtual reality headsets. We release the dataset under a permissive CC BY 4.0 license, to drive advancements in VIO/SLAM research and development.

Abstract:
Object manipulation skills are necessary for robots operating in various daily-life scenarios, ranging from warehouses to hospitals. They allow the robots to manipulate the given object to their desired arrangement in the cluttered environment. The existing approaches to solving object manipulations are either inefficient sampling based techniques, require expert demonstrations, or learn by trial and error, making them less ideal for practical scenarios. In this paper, we propose a novel, multimodal physics-informed neural network (PINN) for solving object manipulation tasks. Our approach efficiently learns to solve the Eikonal equation without expert data and finds object manipulation trajectories fast in complex, cluttered environments. Our method is multimodal as it also reactively replans the robot’s grasps during manipulation to achieve the desired object poses. We demonstrate our approach in both simulation and real-world scenarios and compare it against state-of-the-art baseline methods. The results indicate that our approach is effective across various objects, has efficient training compared to previous learning-based methods, and demonstrates high performance in planning time, trajectory length, and success rates. Our demonstration videos can be found at https://youtu.be/FaQLkTV9knI.

Abstract:
The automation of hydraulic excavators is significant for enhancing productivity and safety in uncertain and dynamic environments. Achieving autonomous operation requires advanced control strategies capable of handling system constraints, nonlinear hydraulic dynamics, and complex environmental interactions. This study proposes a reinforcement learning (RL)-based methodology to perform a complete excavation cycle by controlling proportional valves. A comprehensive joint simulation tool is developed, in which a hydraulic system model is detailed based on a real machine, and it is integrated with an excavator mechanism and working environment to create a realistic interaction environment for RL training. The RL agent, trained using Proximal Policy Optimization (PPO), incorporates a customized reward shaping method that ensures operational safety and accuracy, considering constraints such as pump flow saturation and geometric constraints. In addition, an Adaptive Control Frequency (ACF) method is developed to enhance training efficiency by dynamically adjusting the control frequency based on task complexity. Comparative validations demonstrate the RL agent’s ability to successfully complete a full excavation cycle, satisfy operational constraints, and generalize across varying initial conditions and valve responses. Furthermore, the controller operates effectively in a soil environment despite being trained without soil, demonstrating robustness to uncertain, time-varying loads.

Abstract:
This paper explores successor features for knowledge transfer in zero-sum, complete-information, and turn-based games. Prior research in single-agent systems has shown that successor features can provide a "jump start" for agents when facing new tasks with varying reward structures. However, knowledge transfer in games typically relies on value and equilibrium transfers, which heavily depends on the similarity between tasks. This reliance can lead to failures when the tasks differ significantly. To address this issue, this paper presents an application of successor features to games and presents a novel algorithm called Game Generalized Policy Improvement (GGPI), designed to address Markov games in multi-agent reinforcement learning. The proposed algorithm enables the transfer of learning values and policies across games. An upper bound of the errors for transfer is derived as a function the similarity of the task. Through experiments with a turn-based pursuer-evader game, we demonstrate that the GGPI algorithm can generate high-reward interactions and one-shot policy transfer. When further tested in a wider set of initial conditions, the GGPI algorithm achieves higher success rates with improved path efficiency compared to those of the baseline algorithms.

Abstract:
Accurately decoding human motion intentions from surface electromyography (sEMG) is essential for myoelectric control and has wide applications in rehabilitation robotics and assistive technologies. However, existing sEMG-based motion estimation methods often rely on subject-specific musculoskeletal (MSK) models that are difficult to calibrate, or purely data-driven models that lack physiological consistency. This paper introduces a novel Physics-Embedded Neural Network (PENN) that combines interpretable MSK forward-dynamics with data-driven residual learning, thereby preserving physiological consistency while achieving accurate motion estimation. The PENN employs a recursive temporal structure to propagate historical estimates and a lightweight convolutional neural network for residual correction, leading to robust and temporally coherent estimations. A two-phase training strategy is designed for PENN. Experimental evaluations on six healthy subjects show that PENN outperforms state-of-the-art baseline methods in both root mean square error (RMSE) and R2 metrics.

Abstract:
Robotic grasping guided by natural language instructions faces challenges due to ambiguities in object descriptions and the need to interpret complex spatial context. Existing visual grounding methods often rely on datasets that fail to capture these complexities, particularly when object categories are vague or undefined. To address these challenges, we make three key contributions. First, we present an automated dataset generation engine for visual grounding in tabletop grasping, combining procedural scene synthesis with template-based referring expression generation, requiring no manual labeling. Second, we introduce the RefGrasp dataset, featuring diverse indoor environments and linguistically challenging expressions for robotic grasping tasks. Third, we propose a visually grounded dexterous grasping framework with continuous grasp generation, validated through extensive real-world robotic experiments. Our work offers a novel approach for language-guided robotic manipulation, providing both a challenging dataset and an effective grasping framework for real-world applications. Project website: https://refer-and-grasp.github.io.

Abstract:
Currently, most underwater operations are conducted in deep water, and there is usually insufficient illumination in these areas. At this time, the local texture features of some objects are highly similar in images, and it is difficult to distinguish the inter-class boundaries. This typically results in poor performance of the current semantic segmentation models of terrestrial images in underwater scenes. Taking advantage of the general characteristic that high-frequency regions are more likely to correspond to semantic segmentation boundaries, we introduce the high-frequency Divergence Attention Network (HFDNet), a semantic segmentation model based on transformer. HFDNet extracts its frequency distribution by analyzing the frequency domain of the feature map, and then calculates the relative spectral magnitude of each component by comparing its frequency amplitude against the average amplitude within its local neighborhood in the frequency domain. The local frequency map can be incorporated into the attention matrix as a weighting factor to realize the divergence of attention to the surrounding areas, which improves the attention to the high-frequency areas. This operation can enhance the model’s focus on the object boundary region and local neigh-borhood categories for each component. Therefore, our model can alleviate the problem of determining the object boundary caused by insufficient light in underwater image segmentation, and enhance the ability to segment objects with similar local features under low light conditions. We conduct comprehensive experiments on three underwater segmentation datasets: Caveseg, SUIM and UWS. The results show that our HFDNet achieves state-of-the-art (SOTA) performance on the testing datasets. The source code is available at https://github.com/cv516Buaa/HongboXie/tree/main/HFDNet.

Abstract:
Planetary exploration requires efficient methods for subsurface sampling, especially in extreme energy limitations. Traditional drilling methods are often energy intensive and require large platforms, limiting their applicability. Bio-inspired burrowing techniques, inspired by animals like moles, offer lightweight, low-power alternatives suitable for small robotic platforms. This paper presents a novel bio-inspired robotic platform, the Mole-like Incisor-Burrowing Robotic Platform (MIRP), designed to mimic the incisor-burrowing behavior of naked mole rats. The MIRP features an 11 DOFs mechanism with a compact design (220 mm × 140 mm × 80 mm) and uses servomotors to achieve low energy consumption. The robot combines a qu0adrupedal locomotion mechanism with an incisor-burrowing mechanism, allowing it to navigate granular terrains and perform excavation tasks. Kinematic analysis, including inverse kinematics and close-chain analysis, was conducted to optimize the robot’s motion strategy. A prototype was developed and tested in a simulated lunar regolith environment to test its maneuverability and burrowing performance. The power consumption of the prototype is below 10 W. This work validates the feasibility of bio-inspired incisor-burrowing for planetary exploration, offering a cost-effective and efficient solution for future extraterrestrial missions.

Abstract:
Large Language Models are increasingly used in robotics for task planning, but their reliance on textual inputs limits their adaptability to real-world changes and failures. To address these challenges, we propose LERa — Look, Explain, Replan — a Visual Language Model-based replanning approach that utilizes visual feedback. Unlike existing methods, LERa requires only a raw RGB image, a natural language instruction, an initial task plan, and failure detection — without additional information such as object detection or predefined conditions that may be unavailable in a given scenario. The replanning process consists of three steps: (i) Look — where LERa generates a scene description and identifies errors; (ii) Explain — where it provides corrective guidance; and (iii) Replan — where it modifies the plan accordingly. LERa is adaptable to various agent architectures and can handle errors from both dynamic scene changes and task execution failures. We evaluate LERa on the newly introduced ALFRED-ChaOS and VirtualHome-ChaOS datasets, achieving a 40% improvement over baselines in dynamic environments. In tabletop manipulation tasks with a predefined probability of task failure within the PyBullet simulator, LERa improves success rates by up to 67%. Further experiments, including real-world trials with a tabletop manipulator robot, confirm LERa’s effectiveness in replanning. We demonstrate that LERa is a robust and adaptable solution for error-aware task execution in robotics. The project page is available at https://lera-robo.github.io.

Abstract:
We present an evolved steerable version of the single-tail Fish-&-Ribbon–Inspired Small Swimming Harmonic roBot (FRISSHBot), a 59-mg biologically inspired swimmer, which is driven by a new shape-memory alloy (SMA)-based bimorph actuator. The new FRISSHBot is controllable in the two-dimensional (2D) space, which enabled the first demonstration of feedback-controlled trajectory tracking of a single-tail aquatic robot with onboard actuation at the subgram scale. These new capabilities are the result of a physics-informed design with an enlarged head and shortened tail relative to those of the original platform. Enhanced by its design, this new platform achieves forward swimming speeds of up to 13.6 mm/s (0.38 Bl/s), which is over four times that of the original platform. Furthermore, when following 2D references in closed loop, the tested FRISSHBot prototype attains forward swimming speeds of up to 9.1 mm/s, root-mean-square (RMS) tracking errors as low as 2.6 mm, turning rates of up to 13.1 °/s, and turning radii as small as 10 mm.

Abstract:
Recent progress in robotic manipulation has been fueled by large-scale datasets collected across diverse environments. Training robotic manipulation policies on these datasets is traditionally performed in a centralized manner, raising concerns regarding scalability, adaptability, and data privacy. While federated learning enables decentralized, privacy-preserving training, its application to robotic manipulation remains largely unexplored. We introduce FLAME (Federated Learning Across Manipulation Environments), the first benchmark designed for federated learning in robotic manipulation. FLAME consists of: (i) a set of large-scale datasets of over 160,000 expert demonstrations of multiple manipulation tasks, collected across a wide range of simulated environments; (ii) a training and evaluation framework for robotic policy learning in a federated setting. We evaluate standard federated learning algorithms in FLAME, showing their potential for distributed policy learning and highlighting key challenges. Our benchmark establishes a foundation for scalable, adaptive, and privacy-aware robotic learning. The code is publicly available at https://github.com/KTH-RPL/ELSA-Robotics-Challenge.

Abstract:
Language-Conditioned robotic policies allow users to specify tasks using natural language. While much research has focused on improving the action prediction of language-conditioned policies, reasoning about task descriptions has been largely overlooked. Ambiguous task descriptions often lead to downstream policy failures due to misinterpretation by the robotic agent. To address this challenge, we introduce AmbResVLM, a novel method that grounds language goals in the observed scene and explicitly reasons about task ambiguity. We extensively evaluate its effectiveness in both simulated and realworld domains, demonstrating superior task ambiguity detection and resolution compared to recent state-of-the-art methods. Finally, real robot experiments show that our model improves the performance of downstream robot policies, increasing the average success rate from 69.6% to 97.1%. We make the data, code, and trained models publicly available at https://ambres.cs.uni-freiburg.de.

Abstract:
Time-optimal trajectory generation (TOTG) is critical in robotics applications to minimize travel time and increase robot task efficiency. To ensure the trajectory is feasible and executable by the robot, it is important to constrain the trajectory kinodynamics subject to the robot actuator limits. A typical actuator has multiple limits, 1) peak limit, and 2) multi-level continuous limits with different operation time windows. The peak limit bounds the instantaneous kinodynamics (IKD), whereas the continuous limits bound the system continuous kinodynamics (CKD). Existing works only constrain IKD, usually by the actuator peak limit, to achieve time optimality. However, a joint capable of operating at its peak limit momentarily will overheat and damage robot life if the motion continues. Alternatively, users can constrain the IKD with a reduced peak limit to avoid violating continuous limits. However, the reduced peak limit would inevitably sacrifice task efficiency. To address the challenge, this paper studies TOTG with both IKD and CKD, and proposes TOTG-C. It formulates the TOTG as a nonlinear programming (NLP). In particular, it proposes a novel formulation to encode the multi-level CKD constraints efficiently. To the best of our knowledge, TOTG-C is the first work that explicitly considers multi-level CKD constraints. We demonstrate the effectiveness and robustness of the proposed TOTG-C both in simulation and real robot experiments.

Abstract:
Multi-agent pathfinding (MAPF) is a common abstraction of multi-robot trajectory planning problems, where multiple homogeneous robots simultaneously move in the shared environment. While solving MAPF optimally has been proven to be NP-hard, scalable, and efficient, solvers are vital for real-world applications like logistics, search-and-rescue, etc. To this end, decentralized suboptimal MAPF solvers that leverage machine learning have come on stage. Building on the success of the recently introduced MAPF-GPT, a pure imitation learning solver, we introduce MAPF-GPT-DDG. This novel approach effectively fine-tunes the pre-trained MAPF model using centralized expert data. Leveraging a novel delta-data generation mechanism, MAPF-GPT-DDG accelerates training while significantly improving performance at test time. Our experiments demonstrate that MAPF-GPT-DDG surpasses all existing learning-based MAPF solvers, including the original MAPF-GPT, regarding solution quality across many testing scenarios. Remarkably, it can work with MAPF instances involving up to 1 million agents in a single environment, setting a new milestone for scalability in MAPF domains.

Abstract:
Robot pose estimation is a challenging and crucial task for vision-based surgical robotic automation. Typical robotic calibration approaches, however, are not applicable to surgical robots, such as the da Vinci Research Kit (dVRK) [1], due to joint angle measurement errors from cable-drives and the partially visible kinematic chain. Hence, previous works in surgical robotic automation used tracking algorithms to estimate the pose of the surgical tool in real-time and compensate for the joint angle errors. However, a big limitation of these previous tracking works is the initialization step which relied on only keypoints and SolvePnP. In this work, we fully explore the potential of geometric primitives beyond just keypoints with differentiable rendering, cylinders, and construct a versatile pose matching pipeline in a novel pose hypothesis space. We demonstrate the state-of-the-art performance of our single-shot calibration method with both calibration consistency and real surgical tasks. As a result, this marker-less calibration approach proves to be a robust and generalizable initialization step for surgical tool tracking.

Abstract:
This work presents the implementation and evaluation of a real-time collision-avoiding motion planning algorithm for highly dynamic environments. By combining short-horizon obstacle estimation with the robot constraints, our method implements collision-avoiding situational-aware motion planning by heuristically exploring multiple relevant paths. Directly feeding the current motion setpoint of the path into the low-level controller closes the loop between motion planning and low-level control, ensuring constraint-aware execution. Its practical implementation in physical robots in dynamic RoboCup-like scenarios validated its effectiveness, with low computational costs enabling fast adaptation to changing environments. Furthermore, the capabilities of our motion planner were demonstrated during RoboCup 2024 and practice matches1.

Abstract:
Like humans who rely on landmarks for orientation, autonomous robots depend on feature-rich environments for accurate localization. In this paper, we propose the GFM-Planner, a perception-aware trajectory planning framework based on the geometric feature metric, which enhances LiDAR localization accuracy by guiding the robot to avoid degraded areas. First, we derive the Geometric Feature Metric (GFM) from the fundamental LiDAR localization problem. Next, we design a 2D grid-based Metric Encoding Map (MEM) to efficiently store GFM values across the environment. A constant-time decoding algorithm is further proposed to retrieve GFM values for arbitrary poses from the MEM. Finally, we develop a perception-aware trajectory planning algorithm that improves LiDAR localization capabilities by guiding the robot in selecting trajectories through feature-rich areas. Both simulation and real-world experiments demonstrate that our approach enables the robot to actively select trajectories that significantly enhance LiDAR localization accuracy.

Abstract:
Predicting agents impacted by legal policies, physical limitations, and operational preferences is inherently difficult. In recent years, neuro-symbolic methods have emerged, integrating machine learning and symbolic reasoning models into end-to-end learnable systems. Hereby, a promising avenue for expressing high-level constraints over multi-modal input data in robotics has opened up. This work introduces an approach for Bayesian estimation of agents expected to comply with a human-interpretable neuro-symbolic model we call its Constitution. Hence, we present the Constitutional Filter (CoFi), leading to improved tracking of agents by leveraging expert knowledge, incorporating deep learning architectures, and accounting for environmental uncertainties. CoFi extends the general, recursive Bayesian estimation setting, ensuring compatibility with a vast landscape of established techniques such as Particle Filters. To underpin the advantages of CoFi, we evaluate its performance on real-world marine traffic data. Beyond improved performance, we show how CoFi can learn to trust and adapt to the level of compliance of an agent, recovering baseline performance even if the assumed Constitution clashes with reality.

Abstract:
Understanding how humans would behave during hand-object interaction (HOI) is vital for applications in service robot manipulation and extended reality. To achieve this, some recent works simultaneously forecast hand trajectories and object affordances on human egocentric videos. The joint prediction serves as a comprehensive representation of future HOI in 2D space, indicating potential human motion and motivation. However, the existing approaches mostly adopt the autoregressive paradigm, which lacks bidirectional constraints within the holistic future sequence, and accumulates errors along the time axis. Meanwhile, they overlook the effect of camera egomotion on first-person view predictions. To address these limitations, we propose a novel diffusion-based HOI prediction method, namely Diff-IP2D, to forecast future hand trajectories and object affordances with bidirectional constraints in an iterative non-autoregressive manner on egocentric videos. Motion features are further integrated into the conditional denoising process to enable Diff-IP2D aware of the camera wearer’s dynamics for more accurate interaction prediction. Extensive experiments demonstrate that Diff-IP2D significantly outperforms the state-of-the-art baselines on both the off-the-shelf and our newly proposed evaluation metrics. This highlights the efficacy of leveraging a generative paradigm for 2D HOI prediction. The code and the video have been released at https://github.com/IRMVLab/Diff-IP2D.

Abstract:
Robot-assisted minimally invasive surgery is widely used because of its superior postoperative recovery outcomes. However, the workload for surgeons remains high. The development of autonomous suturing capabilities in surgical robots is poised to significantly reduce surgeon workload. In this study, we present a novel method or autonomous suturing using a minimally invasive surgical robot. We quantify the surgical suturing requirements and propose corresponding metrics for evaluating the suturing effect. We also use the dynamic adjustment of stitch position to optimize the surgical robot autonomous suturing scheme. Furthermore, we employ particle swarm algorithms to enhance the grasping posture of surgical instruments, enabling the robot to achieve optimal suture needle clamping. Our method maintains the same level of expert operator in the suturing parametric index of suturing when suturing two types of wounds: gauze and egg membrane. The autonomous suturing method proposed in this study is currently deployed on our own surgical robot, and it can be generalized to other surgical robots. This will lay the foundation for surgical robots to achieve fully autonomous surgery. The experimental results show that the stitching effect of our proposed autonomous robot stitching method is already close to that of surgeons using the same robot, and it maintains good consistency in multiple sets of experiments. The method proposed in this study can be generalized to various other surgical robots, laying the foundation for surgical robots to achieve fully autonomous surgery.

Abstract:
This paper presents a robust monocular visual SLAM system that simultaneously utilizes point, line, and vanishing point features for accurate camera pose estimation and mapping. To address the critical challenge of achieving reliable localization in low-texture environments, where traditional point-based systems often fail due to insufficient visual features, we introduce a novel approach leveraging Global Primitives structural information to improve the system’s robustness and accuracy performance. Our key innovation lies in constructing vanishing points from line features and proposing a weighted fusion strategy to build Global Primitives in the world coordinate system. This strategy associates multiple frames with non-overlapping regions and formulates a multi-frame reprojection error optimization, significantly improving tracking accuracy in texture-scarce scenarios. Evaluations on various datasets show that our system outperforms state-of-the-art methods in trajectory precision, particularly in challenging environments.

Abstract:
In this paper, we propose a novel recommendation-based path planning system that leverages VLM and LLM to interpret user intentions. The system infers user preferences through both conversational and behavioral data, thereby delivering personalized navigation and guidance services within complex consumer environments. The LLM component is designed to deduce user intent even in the absence of direct item references by utilizing higher-level conceptual cues, while the VLM component analyzes images of user behavior to extract contextual information. A virtual museum simulation was implemented using Isaac Sim, and a metadata dataset for exhibits was constructed to validate the system’s performance. Experimental results demonstrate that the proposed system effectively interprets user intent and generates optimized pathways. Future work will focus on extending the system to consumer spaces such as department stores and supermarkets—areas where conventional 2D semantic maps are inadequate—by exploring topology-based mapping solutions. Ultimately, this research aims to revolutionize user experience by enabling personalized robotic services in consumer environments.

Abstract:
Soft manipulators (SMs) have shown great potential for interactive tasks in confined environments. However, avoiding obstacles of SMs may conflict with the manipulator’s configuration, the planned trajectory for tracking control, and the target position for grasping. To coordinate configuration planning, tracking control, and target grasping in obstacle avoidance, this study proposes a hierarchical configuration planning framework with three levels: behavior planning, configuration planning, and shape/position control. At the behavior planning level, a Discrete Event System (DES)-based planner is designed to orchestrate mode transitions among obstacle avoidance, tracking control, and target grasping. The configuration planning level adopts the Bézier curve to model the SM backbone curve and constructs a repulsive potential field to quantify obstacle effects on the entire manipulator configuration. Under the constraints of grasping distance and material physical limit, the control points of the Bézier curve corresponding to the optimal configuration that minimizes the repulsive potential energy are computed. Experiments demonstrate the effectiveness of the proposed framework in achieving collision-free configuration planning for object grasping and placement in confined operational scenarios.

Abstract:
Depth information which specifies the distance between objects and current position of the robot is essential for many robot tasks such as navigation. Recently, researchers have proposed depth completion frameworks to provide dense depth maps that offer comprehensive information about the surrounding environment. However, existing methods show significant trade-offs between computational efficiency and accuracy during inference. The substantial memory and computational requirements make them unsuitable for real-time applications, highlighting the need to improve the completeness and accuracy of depth information while improving processing speed to enhance robot performance in various tasks. To address these challenges, in this paper, we propose CHADET (cross-hierarchical-attention depth-completion transformer), a lightweight depth-completion network that can generate accurate dense depth maps from RGB images and sparse depth points. For each pair, its feature is extracted from the depthwise blocks and passed to the equally lightweight transformer-based decoder. In the decoder, we utilize the novel cross-hierarchical-attention module that refines the image features from the depth information. Our approach improves the quality and reduces memory usage of the depth map prediction, as validated in both KITTI, NYUv2, and VOID datasets.

Abstract:
We present Discoverse, the first unified, modular, open-source 3DGS-based simulation framework for Real2Sim2Real robot learning. It features a holistic Real2Sim pipeline that synthesizes hyper-realistic geometry and appearance of complex real-world scenarios, paving the way for analyzing and bridging the Sim2Real gap. Powered by Gaussian Splatting and MuJoCo, Discoverse enables massively parallel simulation of multiple sensor modalities and accurate physics, with inclusive supports for existing 3D assets, robot models, and ROS plugins, empowering large-scale robot learning and complex robotic benchmarks. Through extensive experiments on imitation learning, Dis coverse demonstrates state-of-the-art zero-shot Sim2Real transfer performance compared to existing simulators. For code and demos: https://air-discoverse.github.io/.

Abstract:
This paper introduces a new framework, Drive-Blip2, built upon the BLIP2-OPT architecture, to generate accurate and contextually relevant explanations for emerging driving scenarios. While existing vision-language models perform well in general tasks, they encounter difficulties in understanding complex, multi-object environments, particularly in real-time applications such as autonomous driving, where the rapid identification of key objects is crucial. To address this limitation, an Attention Map Generator is proposed to highlight significant objects relevant to driving decisions within critical video frames. By directing the model’s focus to these key regions, the generated attention map helps produce clear and relevant explanations, enabling drivers to better understand the vehicle’s decision-making process in critical situations. Evaluations on the DRAMA dataset reveal significant improvements in explanation quality, as indicated by higher BLEU, ROUGE, CIDEr, and SPICE scores compared to baseline models. These findings underscore the potential of targeted attention mechanisms in vision-language models for enhancing explainability in real-time autonomous driving.

Abstract:
Recently, camera localization has been widely adopted in autonomous robotic navigation due to its efficiency and convenience. However, autonomous navigation in unknown environments often suffers from scene ambiguity, environmental disturbances, and dynamic object transformation in camera localization. To address this problem, inspired by the brain cognitive navigation mechanism (such as grid cells, place cells, and head direction cells), we propose a novel neurobiological camera location method, namely NeuroLoc. Firstly, we designed a Hebbian learning module driven by place cells to save and replay historical information, aiming to restore the details of historical representations and solve the issue of scene fuzziness. Secondly, we utilized the head direction cell-inspired internal direction learning as multi-head attention embedding to help restore the true orientation in similar scenes. Finally, we added a 3D grid center prediction in the pose regression module to reduce the final wrong prediction. We evaluate the proposed NeuroLoc on commonly used benchmark indoor and outdoor datasets. The experimental results show that our NeuroLoc can enhance the robustness in complex environments and improve the performance of pose regression by using only a single image.

Abstract:
This work presents an optimization method for generating kinodynamically feasible and collision-free multi-robot trajectories that exploits an incremental denoising scheme in diffusion models. Our key insight is that high-quality trajectories can be discovered merely by denoising noisy trajectories sampled from a distribution. This approach has no learning component, relying instead on only two ingredients: a dynamical model of the robots to obtain feasible trajectories via rollout, and a fitness function to guide denoising with Monte Carlo gradient approximation. The proposed framework iteratively optimizes a deformation for the previous trajectory with the current denoising process, allows anytime refinement as time permits, supports different dynamics, and benefits from GPU acceleration. Our evaluations for differential-drive and holonomic teams with up to 16 robots in 2D and 3D worlds show its ability to discover high-quality solutions faster than other black-box optimization methods such as MPPI. In a 2D holonomic case with 16 robots, it is almost twice as fast. As evidence for feasibility, we demonstrate zero-shot deployment of the planned trajectories on eight multirotors.

Abstract:
The next generation of active safety features in autonomous vehicles should be capable of safely executing evasive hazard-avoidance maneuvers to achieve rapid motion at the limits of vehicle handling.This paper presents a novel framework, ManeuverGPT, for generating and executing highly dynamic stunt maneuvers in autonomous vehicles using large language model (LLM)-based agents as controllers. We target aggressive maneuvers, such as J-turns, within the CARLA simulation environment and demonstrate an iterative, prompt-based approach to refine vehicle control parameters, starting tabula rasa without retraining model weights. We propose an agentic architecture composed of three specialized agents: (1) a Query Enricher Agent for contextualizing user commands, (2) a Driver Agent for generating maneuver parameters, and (3) a Parameter Validator Agent that enforces physics-based and safety constraints. Experimental results demonstrate successful J-turn execution across multiple vehicle models through textual prompts that adapt to differing vehicle dynamics. We evaluate performance via established success criteria and discuss limitations regarding numeric precision and scenario complexity. Our findings underscore the potential of LLM-driven control for high-agility maneuvers, while highlighting the importance of hybrid approaches that combine language-based reasoning with algorithmic validation. We provide an open-source implementation at https://github.com/SHi-ON/ManeuverGPT to foster further research within the broader community.

Abstract:
Teleoperation is essential for autonomous robot learning, especially in manipulation tasks that require human demonstrations or corrections. However, most existing systems only offer unilateral robot control and lack the ability to synchronize the robot’s status with the teleoperation hardware, preventing real-time, flexible intervention. In this work, we introduce HACTS (Human-As-Copilot Teleoperation System), a novel system that establishes bilateral, real-time joint synchronization between a robot arm and teleoperation hardware. This simple yet effective feedback mechanism, akin to a steering wheel in autonomous vehicles, enables the human copilot to intervene seamlessly while collecting action-correction data for future learning. Implemented using 3D-printed components and low-cost, off-the-shelf motors, HACTS is both accessible and scalable. Our experiments show that HACTS significantly enhances performance in imitation learning (IL) and reinforcement learning (RL) tasks, boosting IL recovery capabilities and data efficiency, and facilitating human-in-the-loop RL. HACTS paves the way for more effective and interactive human-robot collaboration and data-collection, advancing the capabilities of robot manipulation.

Abstract:
To succeed in the real world, robots must deal with situations that differ from those seen during training. Those out-of-distribution situations for legged robot mainly include challenging dynamic gaps and perceptual gaps. Here we study the problem of robust locomotion in such novel situations. While previous methods usually rely on designing elaborate training and adaptation techniques, we approach the problem from a network model perspective. Our approach, RObust Locomotion Transformer(ROLT), a variation of transformer, could achieve robustness in a variety of unseen conditions. ROLT introduces two key designs: body tokenization and consistent dropout. Body tokenization supports knowledge share across different limbs, which boosts generalization ability of the network. Meanwhile, a novel dropout strategy enhances the policy’s robustness to unseen perceptual noise. We conduct extensive experiments both on quadruped and hexapod robots. Results demonstrate that ROLT is more robust than existing methods. Although trained in only a few dynamic settings, the learned policy generalizes well to multiple unseen dynamic conditions. Additionally, despite training with clean observations, the model handles challenging corruption noise during testing.

Abstract:
We present SCORE, a visual relocalization system that achieves unprecedented map compactness through semantically labeled 3D line maps. SCORE requires only 0.01%-0.1% of the storage needed by structure-based or learning-based baselines, while maintaining practical accuracy and comparable runtime. The key innovation is a novel robust mechanism, Saturated Consensus Maximization (Sat-CM), which generalizes classical Consensus Maximization (CM) by assigning diminishing weights to inlier associations with probabilistic justification. Under extreme outlier ratios (up to 99.5%) arising from one-to-many ambiguity in semantic matching, Sat-CM enables accurate estimation when CM fails. To ensure computational efficiency, we propose an accelerating framework for globally solving Sat-CM formulations and specialize it for the Perspective-n-Lines problem at the core of SCORE.

Abstract:
This paper considers a multi-robot trajectory planning problem with inter-robot connectivity maintenance for information gathering. Given an information map in the form of a distribution over the workspace, ergodic search plans trajectories, along which, the time spent in any region is proportional to the amount of information in that region, and can balance between exploration and exploitation. Existing ergodic search rarely considers the limited communication range among robots or connectivity maintenance, and this paper takes a step to fill this gap. Besides, multi-robot connectivity maintenance was studied a lot, including continual, periodic, intermittent connectivity, etc. Naively combining these methods with ergodic search may prevent the planner from finding high-quality ergodic trajectories or lead to poor connectivity among the robots. To handle the challenge, this paper adapts an intermittent connectivity maintenance strategy to the ergodic search framework, and develops a two-phase trajectory planning approach utilizing the augmented Lagrangian method. Our simulation and real drone experiments show that under the same connectivity maintenance requirement, our approach plans trajectories that are about 10 times better than the baselines in terms of the ergodic metric.

Abstract:
Achieving precise and generalizable grasping across diverse objects and environments is essential for intelligent and collaborative robotic systems. However, existing approaches often struggle with ambiguous affordance reasoning and limited adaptability to unseen objects, leading to suboptimal grasp execution. In this work, we propose GAT-Grasp, a gesture-driven grasping framework that directly utilizes human hand gestures to guide the generation of task-specific grasp poses with appropriate positioning and orientation. Specifically, we introduce a retrieval-based affordance transfer paradigm, leveraging the implicit correlation between hand gestures and object affordances to extract grasping knowledge from large-scale human-object interaction videos. By eliminating the reliance on pre-given object priors, GAT-Grasp enables zero-shot generalization to novel objects and cluttered environments. Real-World evaluations confirm its robustness across diverse and unseen scenarios, demonstrating reliable grasp execution in complex task settings.

Abstract:
Recently, there has been growing interest in using low-cost sensor combinations, such as cameras and IMUs, to achieve accurate localization within pre-built pointcloud maps. In this paper, we propose a novel hybrid visual-Lidar mapping and visual-only re-localization framework, specifically designed for UAVs with limited computational resources operating in challenging environments. Keyframes function as a bridge in our system, associating images with pointcloud to facilitate efficient and accurate pose estimation. Besides, our system creates omnidirectional keyframes at the mapping stage, enabling effective re-localization from any orientation, which enhance the robustness and practicability of our system. Experiments show that the proposed algorithm achieves high localization accuracy on pre-built maps and is capable of running in real-time on UAVs for autonomous navigation tasks. The source code will be made publicly available soon.

Abstract:
Legged robots require low-inertia limbs capable of carrying a high payload. The design of such limbs poses challenges in integrating optimal kinematic structures with practical design considerations. In the search for optimal design parameters, advantages of optimization algorithms can be applied. This paper introduces an open-source framework for optimizing topology and parameters of closed linkage mechanisms, addressing the need for task-specific robotic limbs. Closed-loop structures are motivated by two main purposes: (1) to decrease robotic limb inertia by relocation of actuators close to the robot’s body and (2) to redistribute efforts among actuators. The framework leverages joint-based spatial graph representations, kinetostatic criteria, and multi-objective genetic algorithms to optimize mechanism topology and parameters. Focusing on kinetostatic criteria such as Jacobian metrics and inertia properties, the framework swiftly explores the design space to balance trade-offs in robot linkages. We demonstrate the framework pipeline for the task of optimizing planar robotic legs with 2 degrees of freedom. Project github page: https://licaibeerlab.github.io/jmoves.github.io/

Abstract:
Humans naturally obtain intuition about the interactions between and the stability of rigid objects by observing and interacting with the world. It is this intuition that governs the way in which we regularly configure objects in our environment, allowing us to build complex structures from simple, everyday objects. Robotic agents, on the other hand, traditionally require an explicit model of the world that includes the detailed geometry of each object and an analytical model of the environment dynamics, which are difficult to scale and preclude generalization. Instead, robots would benefit from an awareness of intuitive physics that enables them to similarly reason over the stable interaction of objects in their environment. Towards that goal, we propose STACKGEN—a diffusion model that generates diverse stable configurations of building blocks matching a target silhouette. To demonstrate the capability of the method, we evaluate it in a simulated environment and deploy it in the real setting using a robotic arm to assemble structures generated by the model. Our code is available at https://ripl.github.io/StackGen.

Abstract:
Imitation learning (IL) has proven effective for enabling robots to acquire visuomotor skills through expert demonstrations. However, traditional IL methods are limited by their reliance on high-quality, often scarce, expert data, and suffer from covariate shift. To address these challenges, recent advances in offline IL have incorporated suboptimal, unlabeled datasets into the training. In this paper, we propose a novel approach to enhance policy learning from mixed-quality offline datasets by leveraging task-relevant trajectory fragments and rich environmental dynamics. Specifically, we introduce a state-based search framework that stitches state-action pairs from imperfect demonstrations, generating more diverse and informative training trajectories. Experimental results on standard IL benchmarks and real-world robotic tasks showcase that our proposed method significantly improves both generalization and performance. The code is available at https://github.com/BIT-KAUIS/SBR.

Abstract:
Insertion tasks are fundamental yet challenging for robots, particularly in autonomous operations, due to their continuous interaction with the environment. AI-based approaches appear to be up to the challenge, but in production they must not only achieve high success rates. They must also ensure insertion quality and reliability. To address this, we introduce QBIT, a quality-aware benchmarking framework that incorporates additional metrics such as force energy, force smoothness and completion time to provide a comprehensive assessment. To ensure statistical significance and minimize the sim-to-real gap, we randomize contact parameters in the MuJoCo simulator, account for perceptual uncertainty, and conduct large-scale experiments on a Kubernetes-based infrastructure. Our microservice-oriented architecture ensures extensibility, broad applicability, and improved reproducibility. To facilitate seamless transitions to physical robotic testing, we use ROS2 with containerization to reduce integration barriers. We evaluate QBIT using three insertion approaches: geometric-based, force-based, and learning-based, in both simulated and real-world environments. In simulation, we compare the accuracy of contact simulation using different mesh decomposition techniques. Our results demonstrate the effectiveness of QBIT in comparing different insertion approaches and accelerating the transition from laboratory to real-world applications. Code is available on GitHub3.

Abstract:
This paper establishes a stiffness model for an origami-inspired pneumatic continuum manipulator (OPM) capable of large stretch ratio and active stiffness modulation. A kinematic model is firstly established, using the piecewise constant curvature assumption, in order to describe the end-effector’s posture by configuration states. Subsequently, utilizing virtual work theory, the static model is derived, which integrates both pneumatic actuation and intrinsic elastic energy. Based on this foundation, a Cartesian compliance matrix is formulated to quantitatively predict 3D deformations under external loads. Experimental validation of stiffness model demonstrates spatial prediction accuracy with maximum errors of 2.00 mm (z-axis), 2.04◦ (roll) under 500 g payloads for one module. For the OPM, tested up to 300 g loading, positional and angular errors remain below 5 mm (x-axis), 3 ◦ (pitch). This study aims to bridge pressure-stiffness coupling and enable model-based stiffness-position control for adaptive tasks.

Abstract:
Visual place recognition is a critical component of robust simultaneous localization and mapping systems. Conventional approaches primarily rely on RGB imagery, but their performance degrades significantly in extreme environments, such as those with poor illumination and airborne particulate interference (e.g., smoke or fog), which significantly degrade the performance of RGB-based methods. Furthermore, existing techniques often struggle with cross-scenario generalization. To overcome these limitations, we propose an RGB-thermal multimodal fusion framework for place recognition, specifically designed to enhance robustness in extreme environmental conditions. Our framework incorporates a dynamic RGB-thermal fusion module, coupled with dual fine-tuned vision foundation models as the feature extraction backbone. Experimental results on public datasets and our self-collected dataset demonstrate that our method significantly outperforms state-of-the-art RGB-based approaches, achieving generalizable and robust retrieval capabilities across day and night scenarios. The code is available at https://github.com/HITSZ-NRSL/RGB-Thermal-VPR.

Abstract:
Human drivers naturally balance the risks of different concerns while driving, including traffic rule violations, minor accidents, and fatalities. However, achieving the same behavior in autonomous driving systems remains an open problem. This paper extends a risk metric that has been verified in human-like driving studies to encompass more complex driving scenarios specified by linear temporal logic (LTL) that go beyond just collision risks. This extension incorporates the timing and severity of events into LTL specifications, thereby reflecting a human-like risk awareness. Without sacrificing expressivity for traffic rules, we adopt LTL specifications composed of safety and co-safety formulas, allowing the control synthesis problem to be reformulated as a reachability problem. By leveraging occupation measures, we further formulate a linear programming (LP) problem for this LTL-based risk metric. Consequently, the synthesized policy balances different types of driving risks, including both collision risks and traffic rule violations. The effectiveness of the proposed approach is validated by three typical traffic scenarios in Carla simulator.

Abstract:
Reinforcement learning (RL) is a fundamental and pivotal algorithm in the advancement of autonomous intelligence, including Embodied Intelligence and Physical Intelligence. The performance of RL directly influences the quality and efficiency of a robot’s decision-making and execution during interactions with its environment. Moreover, the robustness of RL remains a critical challenge that needs to be addressed. A promising approach to enhancing robustness is adversarial reinforcement learning. However, the existing methods primarily focus on perturbations in the state space, while perturbations in the action space have been relatively underexplored. The action space in RL is as crucial as the state space in autonomous intelligence. Furthermore, action-space perturbations provide a more comprehensive evaluation of RL robustness. Therefore, it is necessary and valuable to investigate RL robustness under action-space perturbations for the development of autonomous intelligence. To this end, we propose an adversarial learning framework that employs momentum-based gradient descent to model perturbations in the action space, such as actuator disturbances. Furthermore, we introduce an improved optimization method that integrates historical gradient information into conventional Stochastic Gradient Descent (SGD). This approach enhances training stability and improves perturbation efficiency. The proposed method is evaluated through simulations in the MuJoCo environment and UAV control experiments in GymFC, demonstrating significant improvements in robustness and adaptability under action-space perturbations. Additionally, real-world UAV flight tests are conducted to further validate the effectiveness of the proposed framework. The results confirm that the Sim-to-Real transfer is successful, providing empirical evidence for the applicability of our method in real-world scenarios. This study shows that enhancing RL robustness through action-space perturbations is feasible and effective. More importantly, our findings contribute to the future development of autonomous intelligence, particularly in improving its resilience to uncertainties and dynamic environments.

Abstract:
Object-goal navigation requires mobile robots to efficiently locate targets with visual and spatial information, yet existing methods struggle with generalization in unseen environments. Heuristic approaches with naive metrics fail in complex layouts, while graph-based and learning-based methods suffer from environmental biases and limited generalization. Although Large Language Models (LLMs) as planners or agents offer a rich knowledge base, they are cost-inefficient and lack targeted historical experience. To address these challenges, we propose the LLM-enhanced Object Affinities Transfer (LOAT) framework, integrating LLM-derived semantics with learning-based approaches to leverage experiential object affinities for better generalization in unseen settings. LOAT employs a dual-module strategy: one module accesses LLMs’ vast knowledge, and the other applies learned object semantic relationships, dynamically fusing these sources based on context. Evaluations in AI2-THOR and Habitat simulators show significant improvements in navigation success and efficiency, and real-world deployment demonstrates the zero-shot ability of LOAT to enhance object-goal navigation systems.

Abstract:
Recent work has demonstrated the potential of diffusion models in robot bimanual skill learning. However, existing methods ignore the learning of posture-dependent task features, which are crucial for adapting dual-arm configurations to meet specific force and velocity requirements in dexterous bimanual manipulation. To address this limitation, we propose Manipulability-Aware Diffusion Policy (ManiDP), a novel imitation learning method that not only generates plausible bimanual trajectories, but also optimizes dual-arm configurations to better satisfy posture-dependent task requirements. ManiDP achieves this by extracting bimanual manipulability from expert demonstrations and encoding the encapsulated posture features using Riemannian-based probabilistic models. These encoded posture features are then incorporated into a conditional diffusion process to guide the generation of task-compatible bimanual motion sequences. We evaluate ManiDP on six real-world bimanual tasks, where the experimental results demonstrate a 39.33% increase in average manipulation success rate and a 0.45 improvement in task compatibility compared to baseline methods. This work highlights the importance of integrating posture-relevant robotic priors into bimanual skill diffusion to enable human-like adaptability and dexterity.

Abstract:
Visual loop closure detection traditionally relies on place recognition methods to retrieve candidate loops that are validated using computationally expensive RANSAC-based geometric verification. As false positive loop closures significantly degrade downstream pose graph estimates, verifying a large number of candidates in online simultaneous localization and mapping scenarios is constrained by limited time and compute resources. While most deep loop closure detection approaches only operate on pairs of keyframes, we relax this constraint by considering neighborhoods of multiple keyframes when detecting loops. In this work, we introduce LoopGNN, a graph neural network architecture that estimates loop closure consensus by leveraging cliques of visually similar keyframes retrieved through place recognition. By propagating deep feature encodings among nodes of the clique, our method yields high precision estimates while maintaining high recall. Extensive experimental evaluations on the TartanDrive 2.0 and NCLT datasets demonstrate that LoopGNN outperforms traditional baselines. Additionally, an ablation study across various keypoint extractors demonstrates that our method is robust, regardless of the type of deep feature encodings used, and exhibits higher computational efficiency compared to classical geometric verification baselines. We release our code, supplementary material, and keyframe data at https://loopgnn.cs.uni-freiburg.de.

Abstract:
Autonomous long-horizon mobile manipulation encompasses a multitude of challenges, including scene dynamics, unexplored areas, and error recovery. Recent works have leveraged foundation models for scene-level robotic reasoning and planning. However, the performance of these methods degrades when dealing with a large number of objects and largescale environments. To address these limitations, we propose MORE, a novel approach for enhancing the capabilities of language models to solve zero-shot mobile manipulation planning for rearrangement tasks. MORE leverages scene graphs to represent environments, incorporates instance differentiation, and introduces an active filtering scheme that extracts task-relevant subgraphs of object and region instances. These steps yield a bounded planning problem, effectively mitigating hallucinations and improving reliability. Additionally, we introduce several enhancements that enable planning across both indoor and outdoor environments. We evaluate MORE on 81 diverse rearrangement tasks from the BEHAVIOR-1K benchmark, where it becomes the first approach to successfully solve a significant share of the benchmark, outperforming recent foundation model-based approaches. Furthermore, we demonstrate the capabilities of our approach in several complex real-world tasks, mimicking everyday activities. We make the code publicly available at https://more-model.cs.uni-freiburg.de.

Abstract:
State uncertainty is a primary obstacle to effective long-horizon robot task planning. State uncertainty can be decomposed into spatial uncertainty—resolved using SLAM—and uncertainty about the objects in the environment, formalized as the object scouting problem and modeled using the Locally Observable Markov Decision Process (LOMDP). We introduce a new planning framework specifically designed for object scouting with LOMDPs called the Scouting Partial-Order Planner (SPOP), which exploits the characteristics of partial order and regression planning to plan around knowledge gaps the robot may have about the existence, location, and state of relevant objects in its environment. Our results highlight the benefits of partial-order planning, demonstrating its suitability for object scouting due to its ability to identify absent but task-relevant objects, and show that it outperforms comparable planners in plan length, computation time, and execution time.

Abstract:
In visual place recognition (VPR), filtering and sequence-based matching approaches can improve performance by integrating temporal information across image sequences, especially in challenging conditions. While these methods are commonly applied, their effects on system behavior can be unpredictable and can actually make performance worse in certain situations. In this work, we present a new supervised learning approach that learns to predict the per-frame sequence matching receptiveness (SMR) of VPR techniques, enabling the system to selectively decide when to trust the output of a sequence matching system. Our approach is agnostic to the underlying VPR technique and effectively predicts SMR, and hence significantly improves VPR performance across a large range of state-of-the-art and classical VPR techniques (namely CosPlace, MixVPR, EigenPlaces, SALAD, AP-GeM, NetVLAD and SAD), and across three benchmark VPR datasets (Nordland, Oxford RobotCar, and SFU-Mountain). We also provide insights into a complementary approach that uses the predictor to replace discarded matches, and present ablation studies including an analysis of the interactions between our SMR predictor and the selected sequence length.

Abstract:
In learning from demonstrations (LfD) for trajectory planning, end-to-end deep learning (DL) methods offer fast inference and adaptability to complex inputs. However, they are prone to cumulative errors due to limited expert time-series data, which poses challenges in safety-critical applications. To address this, we introduce bounded discontinuities in trajectory planning, with the bound adaptively determined via binary search. Two generative networks, trained in opposite directions, produce primitive trajectories. These are connected using the discontinuity-allowed multi-point RRT-connect (DAMP-RRT-connect) algorithm, which expands the trajectory while maintaining discontinuities within the bound. A sequence of navigation points directs the expansion. Experiments on aircraft landing and takeoff tasks at a non-towered airport demonstrate the robustness and efficiency of our approach. [Code]1

Abstract:
In this work, we introduce and formalize the Zero-Knowledge Task Planning (ZKTP) problem, i.e., formulating a sequence of actions to achieve some goal without task-specific knowledge. Additionally, we present a first investigation and approach for ZKTP that leverages a large language model (LLM) to decompose natural language instructions into subtasks and generate behavior trees (BTs) for execution. If errors arise during task execution, the approach also uses an LLM to adjust the BTs on-the-fly in a refinement loop. Experimental validation in the AI2-THOR simulator demonstrate our approach’s effectiveness in improving overall task performance compared to alternative approaches that leverage task-specific knowledge. Our work demonstrates the potential of LLMs to effectively address several aspects of the ZKTP problem, providing a robust framework for automated behavior generation with no task-specific setup.

Abstract:
Achieving dexterous robotic grasping with multi-fingered hands remains a significant challenge. While existing methods rely on complete 3D scans to predict grasp poses, these approaches face limitations due to the difficulty of acquiring high-quality 3D data in real-world scenarios. In this paper, we introduce GRASPLAT, a novel grasping framework that leverages consistent 3D information while being trained solely on RGB images. Our key insight is that by synthesizing physically plausible images of a hand grasping an object, we can regress the corresponding hand joints for a successful grasp. To achieve this, we utilize 3D Gaussian Splatting to generate high-fidelity novel views of real hand-object interactions, enabling end-to-end training with RGB data. Unlike prior methods, our approach incorporates a photometric loss that refines grasp predictions by minimizing discrepancies between rendered and real images. We conduct extensive experiments on both synthetic and real-world grasping datasets, demonstrating that GRASPLAT improves grasp success rates up to 36.9% over existing image-based methods. Project page: https://mbortolon97.github.io/grasplat/

Abstract:
Robot motion can have many goals. Depending on the task, we might optimize for pose error, speed, collision, or similarity to a human demonstration. Motivated by this, we present PyRoki: a modular, extensible, and deviceagnostic toolkit for solving kinematic optimization problems. PyRoki couples an interface for specifying kinematic variables and costs with an efficient nonlinear least squares optimizer. Unlike existing tools, it is also device-agnostic: optimization runs natively on CPU, GPU, and TPU. In this paper, we present (i) the design and implementation of PyRoki, (ii) motion retargeting and planning case studies that highlight the advantages of PyRoki’s modularity, and (iii) optimization benchmarking, where PyRoki can be 1.4-1.7x faster and converges to lower errors than cuRobo, an existing GPU-accelerated inverse kinematics library. The code is open-sourced at https://pyroki-toolkit.github.io.

Abstract:
This contribution presents a robot path-following framework via Reactive Model Predictive Contouring Control (RMPCC) that successfully avoids obstacles, singularities and self-collisions in dynamic environments at 100 Hz. Many path-following methods rely on the time parametrization, but struggle to handle collision and singularity avoidance while adhering kinematic limits or other constraints. Specifically, the error between the desired path and the actual position can become large when executing evasive maneuvers. Thus, this paper derives a method that parametrizes the reference path by a path parameter and performs the optimization via RMPCC. In particular, Control Barrier Functions (CBFs) are introduced to avoid collisions and singularities in dynamic environments. A Jacobian-based linearization and Gauss-Newton Hessian approximation enable solving the nonlinear RMPCC problem at 100 Hz, outperforming state-of-the-art methods by a factor of 10. Experiments confirm that the framework handles dynamic obstacles in real-world settings with low contouring error and low robot acceleration.

Abstract:
Autonomous vehicles that navigate in open-world environments may encounter previously unseen object classes. However, most existing LiDAR panoptic segmentation models rely on closed-set assumptions, failing to detect unknown object instances. In this work, we propose ULOPS, an uncertainty-guided open-set panoptic segmentation framework that leverages Dirichlet-based evidential learning to model predictive uncertainty. Our architecture incorporates separate decoders for semantic segmentation with uncertainty estimation, embedding with prototype association, and instance center prediction. During inference, we leverage uncertainty estimates to identify and segment unknown instances. To strengthen the model’s ability to differentiate between known and unknown objects, we introduce three uncertainty-driven loss functions. Uniform Evidence Loss to encourage high uncertainty in unknown regions. Adaptive Uncertainty Separation Loss ensures a consistent difference in uncertainty estimates between known and unknown objects at a global scale. Contrastive Uncertainty Loss refines this separation at the fine-grained level. To evaluate open-set performance, we extend benchmark settings on KITTI-360 and introduce a new open-set evaluation for nuScenes. Extensive experiments demonstrate that ULOPS consistently outperforms existing open-set LiDAR panoptic segmentation methods. We make the code and pre-trained models available at http://ulops.cs.uni-freiburg.de.

Abstract:
Enabling the embodied agent to imagine step-by-step the future states and sequentially approach these situation-aware states can enhance its capability to make reliable action decisions from textual instructions. In this work, we introduce a simple but effective mechanism called Chain-of-Imagination (CoI), which repeatedly employs a Multimodal Large Language Model (MLLM) equipped with diffusion model to facilitate imagining and acting upon the series of intermediate situation-aware visual sub-goals one by one, resulting in more reliable instruction-following capability. Based on the CoI mechanism, we propose an embodied agent DecisionDreamer as the low-level controller that can be adapted to different open-world scenarios. Extensive experiments demonstrate that Decision-Dreamer can achieve more reliable and accurate decision-making and significantly outperform the state-of-the-art generalist agents in the Minecraft and CALVIN sandbox simulators, regarding the instruction-following capability. For more demos, please see https://sites.google.com/view/decisiondreamer.

Abstract:
In this paper, we present a new approach for improving 3D point and line mapping regression for camera re-localization. Previous methods typically rely on feature matching (FM) with stored descriptors or use a single network to encode both points and lines. While FM-based methods perform well in large-scale environments, they become computationally expensive with a growing number of mapping points and lines. Conversely, approaches that learn to encode mapping features within a single network reduce memory footprint but are prone to overfitting, as they may capture unnecessary correlations between points and lines. We propose that these features should be learned independently, each with a distinct focus, to achieve optimal accuracy. To this end, we introduce a new architecture that learns to prioritize each feature independently before combining them for localization. Experimental results demonstrate that our approach significantly enhances the 3D map point and line regression performance for camera re-localization. The implementation of our method will be publicly available at: https://github.com/ais-lab/pl2map/.

Abstract:
Considering the potential of using multi-frame information to solve the occlusion problem, we introduce a novel idea of multi-frame information integration, which uses the attention mechanism to fuse the temporal information from the previous frame. The idea can effectively improve the estimation accuracy in occluded regions and optimize the inference speed under multi-frame settings. Meanwhile, we suggest the concept of attention confidence to provide an explicit value criterion for the model to utilize useful attention information more efficiently. Furthermore, we propose an Efficient Temporal Attention network (ETA), which achieves promising results on Sintel and KITTI benchmarks, especially with a 9.4% error reduction compared to the baseline method GMA on Sintel (test) Clean.

Abstract:
The core of Multi-View Stereo (MVS) is to find corresponding pixels in neighboring images. However, due to challenging regions in input images such as untextured areas, repetitive patterns, or reflective surfaces, existing methods struggle to find precise pixel correspondence therein, resulting in inferior reconstruction quality. In this paper, we present an efficient context-perception MVS network, termed ACP-MVS. The ACP-MVS constructs a context-aware cost volume that can enhance pixels containing essential context information while suppressing irrelevant or noisy information via our proposed Context-stimulated Weighting Fusion module. Furthermore, we introduce a new Context-Guided Global Aggregation module, based on the insight that similar-looking pixels tend to have similar depths, which exploits global contextual cues to implicitly guide depth detail propagation from high-confidence regions to low-confidence ones. These two modules work in synergy to substantially improve reconstruction quality of ACP-MVS without incurring significant additional computational and time cost. Extensive experiments demonstrate that our approach not only achieves state-of-the-art performance but also offers the fastest inference speed and minimal GPU memory usage, providing practical value for practitioners working with high-resolution MVS image sets. Notably, our method ranks 2nd on the challenging Tanks and Temples advanced benchmark among all published methods. Code is available at https://github.com/HaoJia-mongh/ACP-MVS.

Abstract:
3D reconstruction methods such as 3D Gaussian Splatting (3DGS), have achieved significant advancements in recent years. However, the study of articulated objects remains limited due to their geometric and dynamic complexity. We propose Splatter Joint, a novel method that models articulated objects, particularly focusing on joints, to capture both the appearance and the geometric information from a few images taken at a single viewpoint. By integrating joint parameters into the 3DGS rendering process in a differentiable manner, we enable the prediction of joint movements while enhancing the accuracy of object appearance reconstruction. We evaluated Splatter Joint on existing and newly created datasets, demonstrating its effectiveness in modeling object appearance and geometry simultaneously.

Abstract:
Long-horizon reasoning and task execution are crucial for complex mobile manipulation tasks in household environments. Existing benchmarks and methods primarily focus on single-room or single-object mobile manipulation scenarios, limiting the scope of long-horizon planning and scene-level understanding. To address this gap, we introduce a novel benchmark for long-horizon mobile manipulation in multi-room household environments. Our task requires agents to follow a sequence of language instructions, each directing the movement of specific objects across receptacles and rooms. In this task, we investigate the role of long-term memory by constructing a hierarchical scene graph that captures the relationships between objects, furniture, and rooms. This scene graph-based memory is dynamically updated as the agent explores the environment, which effectively aligns the scene information with the targets and environmental context specified in the language instructions. Additionally, we benchmark the proposed task in dynamic environments where objects can be relocated during task execution, simulating real-world scenarios. Our results demonstrate that the scene graph-based memory significantly improves the agent’s performance in long-horizon mobile manipulation tasks. Moreover, dynamically updating the state of objects within the scene graph enables the agent to better adapt to dynamic conditions.

Abstract:
Autonomous exploration of unknown environments is a critical task in robotic search and rescue operations. Recently, hierarchical planning frameworks have gained significant attention for their potential to enhance exploration efficiency. However, most existing approaches struggle with efficient exploration due to two key limitations: (1) neglecting subregion environmental information and (2) inconsistency between local and global paths. To overcome these challenges, we propose an Information Entropy-assisted Hierarchical Planning (IEHP) framework for efficient autonomous exploration. Specifically, we introduce an efficient subregion arrangement method that considers total travel distance, path similarity, and information entropy. Additionally, we propose a globally consistent frontier selection method to minimize redundant local paths, improving alignment between local and global planning. We validate the feasibility and efficiency of our approach through a series of complex simulation scenarios, with experimental results demonstrating the superiority of the proposed method.

Abstract:
Panoramic RGB-D cameras enable high-quality 3D scene reconstruction but require manual viewpoint selection and physical camera transportation, making the process time-consuming and tedious—especially for novice users. Key challenges include ensuring sufficient feature overlap between camera views and planning collision-free paths. We propose a fully autonomous scan planner that generates efficient and collision-free tours with adequate viewpoint overlap to address these issues. Experiments in both synthetic and real-world environments show that our method achieves up to 99% scan coverage and is up to three times faster than state-of-the-art view planning approaches.

Abstract:
Language plays a crucial role in robotic manipulation, particularly in facilitating complex tasks. Previous work primarily focused on two-finger manipulation. However, leveraging language to guide reinforcement learning for dexterous hands remains a challenge due to their high degrees of freedom. In this work, we introduce a language-guided dexterous multi-task manipulation framework (LDexMM), which decomposes the problem into two distinct phases. First, we use language instructions to guide a segmentation model in generating a dexterous grasp pose for the functional part of the object. After establishing this initial grasp, reinforcement learning is employed to refine the grasp pose and complete the task. Simultaneously, language constraints are applied to focus the actions on the specified object. Our experiments demonstrate success rates of 31%, 40%, 49.2%, and 72.7% on 10, 7, 5, and 3 tasks, respectively, with a single model.

Abstract:
In our daily life, we often encounter objects that are fragile and can be damaged by excessive grasping force, such as fruits. For these objects, it is paramount to grasp gently—not using the maximum amount of force possible, but rather the minimum amount of force necessary. This paper proposes using visual, tactile, and auditory signals to learn to grasp and regrasp objects stably and gently. Specifically, we use audio signals as an indicator of gentleness during the grasping, and then train an end-to-end action-conditional model from raw visuo-tactile inputs that predicts both the stability and the gentleness of future grasping candidates, thus allowing the selection and execution of the most promising action. Experimental results on a multi-fingered hand over 1,500 grasping trials demonstrated that our model is useful for gentle grasping by validating the predictive performance (3.27% higher accuracy than the vision-only variant) and providing interpretations of their behavior. Finally, real-world experiments confirmed that the grasping performance with the trained multi-modal model outperformed other baselines (17% higher rate for stable and gentle grasps than vision-only). Our approach requires neither tactile sensor calibration nor analytical force modeling, drastically reducing the engineering effort to grasp fragile objects. Dataset and videos are available at https://lasr.org/research/gentle-grasping.

Abstract:
Scene reconstruction is an essential capability for underwater robots navigating in close proximity to structures. Monocular vision-based reconstruction methods are unreliable in turbid waters and lack depth scale information. Sonars are robust to turbid water and non-uniform lighting conditions, however, they have low resolution and elevation ambiguity. This work proposes a real-time opti-acoustic scene reconstruction method that is specially optimized to work in turbid water. Our strategy avoids having to identify point features in visual data and instead identifies regions of interest in the data. We then match relevant regions in the image to corresponding sonar data. A reconstruction is obtained by leveraging range data from the sonar and elevation data from the camera image. Experimental comparisons against other vision-based and sonar-based approaches at varying turbidity levels, and field tests conducted in marina environments, validate the effectiveness of the proposed approach. We have made our code open-source to facilitate reproducibility and encourage community engagement.

Abstract:
Traditional unmanned aerial vehicle (UAV) swarm missions rely heavily on expensive custom-made drones with onboard perception or external positioning systems, limiting their widespread adoption in research and education. To address this issue, we propose AirSwarm. AirSwarm democratizes multi-drone coordination using low-cost commercially available drones such as Tello or Anafi, enabling affordable swarm aerial robotics research and education. Key innovations include a hierarchical control architecture for reliable multi-UAV coordination, an infrastructure-free visual SLAM system for precise localization without external motion capture, and a ROS-based software framework for simplified swarm development. Experiments demonstrate cm-level tracking accuracy, low-latency control, communication failure resistance, formation flight, and trajectory tracking. By reducing financial and technical barriers, AirSwarm makes multi-robot education and research more accessible. The complete instructions and open source code will be available at https://github.com/vvEverett/tello_ros.

Abstract:
Low-cost millimeter automotive radar has received more and more attention due to its ability to handle adverse weather and lighting conditions in autonomous driving. However, the lack of quality datasets hinders research and development. We report a new method that is able to simulate 4D millimeter wave radar signals including pitch, yaw, range, and Doppler velocity along with radar signal strength (RSS) using camera image, light detection and ranging (lidar) point cloud, and ego-velocity. The method is based on two new neural networks: 1) DIS-Net, which estimates the spatial distribution and number of radar signals, and 2) RSS-Net, which predicts the RSS of the signal based on appearance and geometric information. We have implemented and tested our method using open datasets from 3 different models of commercial automotive radar. The experimental results show that our method can successfully generate high-fidelity radar signals. Moreover, we have trained a popular object detection neural network with data augmented by our synthesized radar. The network outperforms the counterpart trained only on raw radar data, a promising result to facilitate future radar-based research and development.

Abstract:
Hand-eye calibration aims to estimate the transformation between a camera and a robot. Traditional methods rely on fiducial markers, which require considerable manual effort and precise setup. Recent advances in deep learning have introduced markerless techniques but come with more prerequisites, such as retraining networks for each robot, and accessing accurate mesh models for data generation. In this paper, we propose Kalib, an automatic and easy-to-setup hand-eye calibration method that leverages the generalizability of visual foundation models to overcome these challenges. It features only two basic prerequisites, the robot’s kinematic chain and a predefined reference point on the robot. During calibration, the reference point is tracked in the camera space. Its corresponding 3D coordinates in the robot coordinate can be inferred by forward kinematics. Then, a PnP solver directly estimates the transformation between the camera and the robot without training new networks or accessing mesh models. Evaluations in simulated and real-world benchmarks show that Kalib achieves good accuracy with a lower manual workload compared with recent baseline methods. We also demonstrate its application in multiple real-world settings with various robot arms and grippers. Kalib’s user-friendly design and minimal setup requirements make it a possible solution for continuous operation in unstructured environments. The code, data, and supplementary materials are available at https://sites.google.com/view/hand-eye-kalib.

Abstract:
Retinex theory, which treats an image as a composition of illuminance and reflectance, has made significant progress in low-light image enhancement. Previous methods attempt to refine the impractical Retinex theory by introducing deviations in estimated illumination and reflectance to develop more practical and robust enhancement techniques. However, the fact that state-of-the-art approaches still produce inferior results suggests that some form of corruption may be over-looked. In this paper, we propose a novel Complete Corruption-Aware Retinex Framework (CCRF), which not only considers corruption in low-light imaging—such as high ISO or long exposure settings—but, more importantly, also accounts for corruption induced by the enhancement method itself. Guided by this framework, we propose a Robust Corruption-Aware Loss (RCL) that enables the model to be robust under extreme darkness and complex light-object interactions. Additionally, we propose a Light-Up Map Denoising (LMD) module, which further eliminates model-induced perturbations. With these two plug-and-play modules, downstream tasks (e.g., low-light object detection) can benefit significantly. Extensive experiments demonstrate that our methods can be seamlessly integrated into state-of-the-art approaches, resulting in significant performance improvements over these methods. Code will be available at github.com/eafi/ccrf.

Abstract:
As delivery robots are increasingly integrated into our daily lives, their ability to navigate through crowded spaces demands swift and accurate prediction of pedestrian trajectories, which is crucial for autonomous functionality. However, existing methods face challenges of unstable accuracy and inefficiency in real-world deployment. Trajectory prediction involves both temporal and social dimensions. Recent methods have achieved better results by modeling temporal and social dimensions simultaneously, preventing information loss compared to modeling them separately, which significantly increases computational costs, posing challenges for practical deployment.In this paper, we conceptualize the trajectory prediction task as a deterministic sequence-to-sequence model that produces one precise forecast, aligning with real-world needs while reducing complexity. To improve efficiency and reduce latency for real-time applications, we propose a novel dynamic-stream transformer architecture that categorizes layers into multi-stream and single-stream based on the number of dimensions involved in computation. The single-stream modules attend to all dimensions simultaneously, providing comprehensive information fusion but with higher computational complexity. The multi-stream modules focus on only one dimension, enabling parallel and batched computation, crucial for improving the model’s real-time performance. By combining them strategically, we achieve a balance between accuracy and speed. Extensive experiments on real datasets show that our dynamic-stream transformer architecture significantly reduces computational complexity, achieving a speed increase of 180% to 3180% compared to similar approaches, while also attaining performance close to the state-of-the-art (SOTA) for deterministic trajectory prediction.

Abstract:
The regulation of muscle function is very important for tissue engineering and sports science. This paper presents a simple microfluidic chip platform and its control method to investigate the regulation of muscle function. By employing C2C12 cells as the model system for skeletal muscle research, these cells were inoculated onto the microfluidic chips and induced to differentiate into fully functional muscle tubes. Programmable actuation control enables localized strain gradients within the microfluidic platform, achieving differential mechanical regimes for functional modulation of integrated muscle constructs. The system implements mechanical conditioning to recapitulate exercise-induced myocyte damage and subsequent regenerative processes through controlled deformation protocols. Our radial-strain actuators generate 19.4% maximum principal strain, while axial-strain configurations achieve 8.3% baseline deformation. Dynamic input modulation enables precise strain reduction to 7.4% and 2.2%, respectively establishing differential mechanical regimes for simulating exercise-associated functional impairment (high-strain phase) and recovery processes (low-strain phase). This strain-programmable platform establishes a robust framework for investigating mechanobiological thresholds in functional muscle regeneration.

Abstract:
When observing objects, humans benefit from their spatial visualization and mental rotation ability to envision potential optimal viewpoints based on the current observation. This capability is crucial for enabling robots to achieve efficient and robust scene perception during operation, as optimal viewpoints provide essential and informative features for accurately representing scenes in 2D images, thereby enhancing downstream tasks.To endow robots with this human-like active viewpoint optimization capability, we propose ViewActive, a modernized machine learning approach drawing inspiration from aspect graph, which provides viewpoint optimization guidance based solely on the current 2D image input. Specifically, we introduce the 3D Viewpoint Quality Field (VQF), a compact and consistent representation for viewpoint quality distribution similar to an aspect graph, composed of three general-purpose viewpoint quality metrics: self-occlusion ratio, occupancy-aware surface normal entropy, and visual entropy. We utilize pre-trained image encoders to extract robust visual and semantic features, which are then decoded into the 3D VQF, allowing our model to generalize effectively across diverse objects, including unseen categories. The lightweight ViewActive network (72 FPS on a single GPU) significantly enhances the performance of state-of-the-art object recognition pipelines and can be integrated into real-time motion planning for robotic applications. Our code and dataset are available here https://github.com/jiayi-wu-umd/ViewActive.

Abstract:
Recent work has shown that exoskeletons con-trolled through data-driven methods can dynamically adapt assistance to various tasks for healthy young adults. However, applying these methods to populations with neuromotor gait deficits, such as post-stroke hemiparesis, is challenging. This is due not only to high population heterogeneity and gait variability but also to a lack of post-stroke gait datasets to train accurate models. Despite these challenges, data-driven methods offer a promising avenue for control, potentially allowing exoskeletons to function safely and effectively in unstructured community settings. This work presents a first step towards enabling adaptive plantarflexion and dorsiflexion assistance from data-driven torque estimation during post-stroke walking. We trained a multi-task Temporal Convolutional Network (TCN) using collected data from four post-stroke participants walking on a treadmill (R2 of 0.74 ± 0.13). The model uses data from three inertial measurement units (IMU) and was pretrained on healthy walking data from 6 participants. We implemented a wearable prototype for our ankle torque estimation approach for exoskeleton control and demonstrated the viability of real-time sensing, estimation, and actuation with one post-stroke participant.

Abstract:
Focus of attention is one of the most influential factors facilitating motor training performance. Most of robotic training methods have not well solved the negative effect of divided-attention on motor execution performance, resulting in limited rehabilitation efficiency for motor-cognitive dysfunction. In this study, we propose a novel visuomotor human-robot interaction framework by integrating a gaze-visual game and force-movement robot, to realize more efficient training for both attentional and motor function. An important novelty of this framework is to design a dynamical pattern recognition scheme for the hierarchical-coupled behavior of attentional and motor execution, to facilitate efficient human-robot interaction in both cognitive and motor perspectives. Specifically, an attentional-motor dynamical system modeling method is first developed by using the gaze, force and movement data collected from the human under different attentional-motor behavior. Then, an online dynamical pattern recognition scheme can be design with these models to online recognizing the human’s attentional and motor behavior states. The training robot system can dynamically adjust the parameters according to the recognition results, to guide the collaboration of both attentional and motor training. Experimental study are conducted to demonstrate the desired accuracy and efficiency of our designed approaches in attentional-motor behavior recognition and training.

Abstract:
By accurately recognizing the wearer’s motion, the underwater exoskeleton enables more efficient human-machine collaboration and provides enhanced assistance in complex and dynamic underwater environments. In this study, we propose a soft underwater exosuit motion mode recognizer based on a long short-term memory network and convolutional neural networks, referred to as LSTM-CNN. This model is designed to perform two tasks: motion mode classification and state transition label recognition. First, the LSTM network extracts features from the time-series data, followed by further feature extraction and classification using the convolutional and fully connected networks. The recognition of motion modes relies on three IMU sensors placed on the left and right legs and the back of the torso of the soft underwater exosuit. On the dataset containing four classes, including non-assist, breaststroke, flutter kick, and underwater walking, LSTM-CNN achieved an overall accuracy of 99.943±0.006% in motion mode classification and 92.101±0.054% in state transition label recognition. The experimental results indicate that the LSTM-CNN achieves better accuracy and performs optimally across various evaluation metrics compared to the other methods.

Abstract:
Safe and effective motion planning is crucial for autonomous robots. Diffusion models excel at capturing complex agent interactions, a fundamental aspect of decision-making in dynamic environments. Recent studies have successfully applied diffusion models to motion planning, demonstrating their competence in handling complex scenarios and accurately predicting multi-modal future trajectories. Despite their effectiveness, diffusion models have limitations in training objectives, as they approximate data distributions rather than explicitly capturing the underlying decision-making dynamics. However, the crux of motion planning lies in non-differentiable downstream objectives, such as safety (collision avoidance) and effectiveness (goal-reaching), which conventional learning algorithms cannot directly optimize. In this paper, we propose a reinforcement learning-based training scheme for diffusion motion planning models, enabling them to effectively learn non-differentiable objectives that explicitly measure safety and effectiveness. Specifically, we introduce a reward-weighted dynamic thresholding algorithm to shape a dense reward signal, facilitating more effective training and outperforming models trained with differentiable objectives. State-of-the-art performance on pedestrian datasets (CrowdNav, ETH-UCY) compared to various baselines demonstrates the versatility of our approach for safe and effective motion planning.

Abstract:
Pre-trained large language models (LLMs) have demonstrated strong common-sense reasoning abilities, making them promising for robotic navigation and planning tasks. However, despite recent progress, bridging the gap between language descriptions and actual robot actions in the open-world, beyond merely invoking limited predefined motion primitives, remains an open challenge. In this work, we aim to enable robots to interpret and decompose complex language instructions, ultimately synthesizing a sequence of trajectory points to complete diverse navigation tasks given open-set instructions and open-set objects. We observe that multi-modal large language models (MLLMs) exhibit strong cross-modal understanding when processing free-form language instructions, demonstrating robust scene comprehension. More importantly, leveraging their code-generation capability, MLLMs can interact with vision-language perception models to generate compositional 2D bird-eye-view value maps, effectively integrating semantic knowledge from MLLMs with spatial information from maps to reinforce the robot’s spatial understanding. To further validate our approach, we effectively leverage large-scale autonomous vehicle datasets (AVDs) to validate our proposed zero-shot vision-language navigation framework in outdoor navigation tasks, demonstrating its capability to execute a diverse range of free-form natural language navigation instructions while maintaining robustness against object detection errors and linguistic ambiguities. Furthermore, we validate our system on a Husky robot in both indoor and outdoor scenes, demonstrating its real-world robustness and applicability. Supplementary videos are available at https://trailab.github.io/OpenNav-website/

Affiliations: Department of Mechanical Engineering and the Laboratory of Computational Sensing and Robotics, Johns Hopkins University, Baltimore, MD, USA; Malone Center for Engineering in Healthcare, Whiting School of Engineering, Johns Hopkins University, Baltimore, MD, USA; Ross and Carol Nese College of Nursing, Pennsylvania State University, PA, USA; Department of Diagnostic Radiology, R. Cowley Shock Trauma Center, School of Medicine, University of Maryland, Baltimore, MD, USA

Abstract:
Femoral artery access is essential for numerous clinical procedures, including diagnostic angiography, therapeutic catheterization, and emergency interventions. Despite its critical role, successful vascular access remains challenging due to anatomical variability, overlying adipose tissue, and the need for precise ultrasound (US) guidance. Needle placement errors can result in severe complications, thereby limiting the procedure to highly skilled clinicians operating in controlled hospital environments. While robotic systems have shown promise in addressing these challenges through autonomous scanning and vessel reconstruction, clinical translation remains limited due to reliance on simplified phantom models that fail to capture human anatomical complexity. In this work, we present a method for autonomous robotic US scanning of bifurcated femoral arteries, and validate it on five vascular phantoms created from real patient computed tomography (CT) data. Additionally, we introduce a video-based deep learning US segmentation network tailored for vascular imaging, enabling improved 3D arterial reconstruction. The proposed network achieves a Dice score of 89.21% and an Intersection over Union of 80.54% on a new vascular dataset. The reconstructed artery centerline is evaluated against ground truth CT data, showing an average L2 error of 0.91±0.70 mm, with an average Hausdorff distance of 4.36±1.11mm. This study is the first to validate an autonomous robotic system for US scanning of the femoral artery on a diverse set of patient-specific phantoms, introducing a more advanced framework for evaluating robotic performance in vascular imaging and intervention.

Abstract:
Exploring high hydrostatic pressure environments such as deep sea presents significant challenges to robotic devices, for they often rely on strong yet heavy and costly protective structures to shield components from being crushed by the extreme pressure. To dismiss the need for bulky protection shells for actuation devices, we reported an extreme-hydrostatic-pressure resilient rotary dielectric elastomer actuator (DEA) for propulsion application in deep-sea pressure condition. DEAs are inherently resistant to damage caused by external pressure, due to their uniform and cavity-free structure. In this study, we analyzed the material properties of the DEA’s elastomer, evaluated the rotary actuator’s lifespan at up to 110 MPa high-pressure liquid conditions, and output performance under both ambient and 30 MPa (equivalent to 3,000 m underwater). Our results show that the rotary actuator maintained functionality at such hydrostatic pressure, with a lifespan exceeding 300,000 cycles and a high rotational output speed of 820 rpm. The rotary actuator was subsequently used to drive the robot with a propeller in a simulated deep-sea pressure fluidic environment, demonstrating our DEA’s performance as well as design simplicity for deep-sea applications without protection structures. While high hydrostatic pressure negatively impacted the actuator’s lifespan and slightly reduced its dynamic performance, our results confirmed that the DEA was a viable solution for deep-sea exploration, laying a solid foundation for the further development of DEA-powered devices for underwater missions.

Abstract:
We present a novel Lie algebra based Iterative Reweighted Least Squares (IRLS) algorithm for robust 3D point cloud alignment. We reformulate the optimal update computation to a compact form which requires only one pass through the data. Although this reformulation does not alter the asymptotic computational complexity, it is well suited for contemporary hardware architectures, yielding significant practical speedups. In extensive experiments on challenging benchmark datasets with added correspondence corruption, the method is consistently at least four times faster than previous literature whilst being mathematically equivalent, demonstrating it is well suited for time-critical applications.

Abstract:
Robot task planning from high-level instructions is an important step towards deploying fully autonomous robot systems in the service sector. Three key aspects of robot task planning present challenges yet to be resolved simultaneously, namely, (i) factorization of complex tasks specifications into simpler executable subtasks, (ii) understanding of the current task state from raw observations, and (iii) planning and verification of task executions. To address these challenges, we propose LATMOS, an automata-theory-inspired task model that, given observations from correct task executions, is able to factorize the task, while supporting verification and planning operations. LATMOS combines an observation encoder to extract features from potentially high-dimensional observations with a sequence model that encapsulates an automaton with symbols in the latent feature space. We conduct evaluations in three task model learning setups: (i) abstract tasks described by logical formulas, (ii) real-world human tasks described by videos and natural language prompts and (iii) a robot task described by image and state observations. The results show improved plan generation and verification capabilities of LATMOS across different observation modalities and tasks.

Abstract:
Soft robotic grippers gently and safely manipulate delicate objects due to their inherent adaptability and softness. Limited by insufficient stiffness and imprecise force control, conventional soft grippers are not suitable for applications that require stable grasping force. In this work, we propose a soft gripper that utilizes an origami-inspired structure to achieve tunable constant force output over a wide strain range. The geometry of each taper panel is established to provide necessary parameters such as protrusion distance, taper angle, and crease thickness required for 3D modeling and FEA analysis. Simulations and experiments show that by optimizing these parameters, our design can achieve a tunable constant force output. Moreover, the origami-inspired soft gripper dynamically adapts to different shapes while preventing excessive forces, with potential applications in logistics, manufacturing, and other industrial settings that require stable and adaptive operations.

Abstract:
This paper proposes a new control algorithm for human-robot co-transportation using a robot manipulator equipped with a mobile base and a robotic arm. We integrate the regular Model Predictive Control (MPC) with a novel pose optimization mechanism to more efficiently mitigate disturbances (such as human behavioral uncertainties or robot actuation noise) during the task. The core of our methodology involves a two-step iterative design: At each planning horizon, we determine the optimal pose of the robotic arm (joint angle configuration) from a candidate set, aiming to achieve the lowest estimated control cost. This selection is based on solving a disturbance-aware Discrete Algebraic Riccati Equation (DARE), which also determines the optimal inputs for the robot’s whole body control (including both the mobile base and the robotic arm). To validate the effectiveness of the proposed approach, we provide theoretical derivation for the disturbance-aware DARE and perform simulated experiments and hardware demos using a Fetch robot under varying conditions, including different trajectories and different levels of disturbances. The results reveal that our proposed approach outperforms baseline algorithms.

Abstract:
In the evolving landscape of high-speed agile quadrotor flight, achieving precise trajectory tracking at the platform’s operational limits is paramount. Controllers must handle actuator constraints, exhibit robustness to disturbances, and remain computationally efficient for safety-critical applications. In this work, we present a novel neural-augmented feedback controller for agile flight control. The controller addresses individual limitations of existing state-of-the-art control paradigms and unifies their strengths. We demonstrate the controller’s capabilities, including the accurate tracking of highly aggressive trajectories that surpass the feasibility of the actuators. Notably, the controller provides universal stability guarantees, enhancing its robustness and tracking performance even in exceedingly disturbance-prone settings. Its nonlinear feedback structure is highly efficient enabling fast computation at high update rates. Moreover, the learning process in simulation is both fast and stable, and the controller’s inherent robustness allows direct deployment to real-world platforms without the need for training augmentations or fine-tuning.

Abstract:
Multimodal data is indispensable for advancing imitation learning, particularly in the context of dexterous hands. However, existing datasets predominantly rely on single-modality inputs, such as RGB images, which inherently lack the capacity to capture the spatial and temporal dynamics essential for achieving human-like dexterity. To address this limitation, we introduce Multi-Modal Dex, a dataset that integrates multimodal sensory data to enable the effective learning of dexterous skills from human demonstrations. By combining visual, point cloud, and kinematic modalities, our dataset provides a richer representation of hand interactions, thereby facilitating a more nuanced understanding of dexterous imitation. Our framework leverages neural rendering and kinematic optimization to align human and robotic hand poses in a shared canonical space, enabling geometrically consistent skill transfer. Furthermore, we analyze the dataset’s potential to advance dexterous robots in perception, imitation learning, and real-world dexterous skill transfer. The data is available at https://github.com/WangShaoSUN/MutliDex.

Abstract:
The development of vision-based tactile sensors has significantly enhanced robots’ perception and manipulation capabilities, especially for tasks requiring contact-rich interactions with objects. In this work, we present DTactive, a novel vision-based tactile sensor with active surfaces. DTactive inherits and modifies the tactile 3D shape reconstruction method of DTact while integrating a mechanical transmission mechanism that facilitates the mobility of its surface. Thanks to this design, the sensor is capable of simultaneously performing tactile perception and in-hand manipulation with surface movement. Leveraging the high-resolution tactile images from the sensor and the magnetic encoder data from the transmission mechanism, we propose a learning-based method to enable precise angular trajectory control during in-hand manipulation. In our experiments, we successfully achieved accurate rolling manipulation within the range of [−180°, 180°] on various objects, with the root mean square error between the desired and actual angular trajectories being less than 12° on nine trained objects and less than 19° on three novel objects. The results demonstrate the potential of DTactive for in-hand object manipulation in terms of effectiveness, robustness and precision.

Abstract:
Human-robot interaction plays a critical role in scientific experiments by ensuring efficient and reliable execution of experimental tasks. To achieve successful task completion, robots must adapt in real time to unexpected task variations, external disturbances, and safety constraints. In this work, we propose a reactive task and motion planning framework designed to address these challenges. By formulating interaction tasks using Linear Temporal Logic (LTL), our approach introduces Planning Decision Tree and Augmented Planning Decision Tree approach to dynamically adjust task sequences in response to environmental changes. At the execution layer, we employ a Model Predictive Path Integral controller, which ensures both efficient and safe control. Additionally, the planning interface effectively coordinates the planning and execution layers, ensuring strict adherence to experimental task specifications. The effectiveness of the proposed reactive planning framework is demonstrated through physical experiments using a 7-DoFs robot. Project website: https://sites.google.com/view/rtlp-iros/

Abstract:
Robotic and autonomous driving platforms necessitate efficient 3D Multi-Object Tracking (MOT) that harmonizes geometric precision, motion robustness, and computational efficiency. Traditional 3D MOT approaches face critical challenges: geometric similarity metrics (e.g., IoU-based) degrade at long ranges with high computational costs, while distance-based methods fail to capture object orientation and shape; the effects of occlusion and the intricate relative ego-object motion degrade tracking performance in dynamic scenes. To this end, we propose PB-MOT, an online framework integrating two key innovations: ego-motion-compensated state estimation that decouples dynamic interactions; and a rotated ellipse association algorithm unifying pose and shape-aware matching with adaptive distance constraints. Evaluations on the KITTI benchmark show that our PB-MOT achieves state-of-the-art performance with a HOTA score of 81.94%, while running at an impressive 2,402.76 FPS on CPU. This enables real-time, high-fidelity perception and tracking for resource-constrained robotic systems.

Abstract:
Traditional robot control relies on analytical methods that require precise system models, which are hard to apply in real-world settings and limit generalization to arbitrary tasks. However, systems like serial manipulators and passively adaptive hands feature inherently stable regions without control discontinuities like loss of contact or singularities. In these regions, approximate controllers focusing on the correct direction of motion enable successful coarse manipulation. When coupled with a rough estimation of the motion magnitude, precision manipulation is achieved. Leveraging this insight, we introduce a novel inverse Jacobian estimation method that independently estimates the primary motion direction and magnitude of the manipulator’s actuators. Our method efficiently estimates the direct mapping from task to actuator space with no need for a priori system knowledge enabling the same framework to control both hands and arms without compromising task performance. We present a novel control method with no a priori knowledge for precision manipulation. Experiments on the Yale Model O hand, Yale Stewart Hand, and a UR5e arm demonstrate that the inverse Jacobians estimated via our approach enable real-time control with submillimeter precision in manipulation tasks. These results highlight that online self-ID data alone is sufficient for precise real-world manipulation.

Abstract:
Traditional vision-based autonomous driving systems often face difficulties in navigating complex environments when relying solely on single-image inputs. To overcome this limitation, incorporating temporal data such as past image frames or steering sequences, has proven effective in enhancing robustness and adaptability in challenging scenarios. While previous high-performance methods exist, they often rely on resource-intensive fusion networks, making them impractical for training and unsuitable for federated learning. To address these challenges, we propose lightweight temporal transformer decomposition, a method that processes sequential image frames and temporal steering data by breaking down large attention maps into smaller matrices. This approach reduces model complexity, enabling efficient weight updates for convergence and real-time predictions while leveraging temporal information to enhance autonomous driving performance. Intensive experiments on three datasets demonstrate that our method outperforms recent approaches by a clear margin while achieving real-time performance. Additionally, real robot experiments further confirm the effectiveness of our method. Our source code can be found at: https://github.com/aioz-ai/LTFed.

Abstract:
A reliable communication network is essential for multiple UAVs operating within obstacle-cluttered environments, where limited communication due to obstructions often occurs. A common solution is to deploy intermediate UAVs to relay information via a multi-hop network, which introduces two challenges: (i) how to design the structure of multi-hop networks; and (ii) how to maintain connectivity during collaborative motion. To this end, this work first proposes an efficient constrained search method based on the minimum-edge RRT⋆ algorithm, to find a spanning-tree topology that requires a less number of UAVs for the deployment task. Then, to achieve this deployment, a distributed model predictive control strategy is proposed for the online motion coordination. It explicitly incorporates not only the inter-UAV and UAV-obstacle distance constraints, but also the line-of-sight (LOS) connectivity constraint. These constraints are well-known to be nonlinear and often tackled by various approximations. In contrast, this work provides a theoretical guarantee that all agent trajectories are ensured to be collision-free with a team-wise LOS connectivity at all time. Numerous simulations are performed in 3D valley-like environments, while hardware experiments validate its dynamic adaptation when the deployment position changes online.

Abstract:
In this paper, we propose a differential-flatness-based controller (DFBC) for precise trajectory tracking of tractor-trailers, particularly during reversing maneuvers, which are challenging due to unstable equilibrium points. The proposed controller leverages the differential flatness property of tractor-trailers, equivalently transforming the nonlinear kinematics into a brunovsky canonical form, allowing the application of linear control theory for control design. Compared to traditional linear quadratic regulator (LQR) controllers, the proposed DFBC method achieves higher precision and robustness in reversing maneuvers. We also showcase the performance of the proposed DFBC method through physical experiments conducted on our self-developed 1/10 scale autonomous tractor-trailer.

Abstract:
Simultaneous Localization and Mapping (SLAM) is critical for real-time robotic applications, enabling precise localization and comprehensive scene reconstruction. Recent advances in 3D Gaussian Splatting (3DGS) enable high-quality view synthesis and rapid rendering, yet robust and consistent semantic scene representation remains under-explored. In this work, we introduce a dense semantic SLAM framework that integrates high-dimensional semantic features with an explicit 3D Gaussian-based scene representation to address these challenges. Our approach employs a lightweight projection layer that maps low-dimensional semantic features to high-dimensional embeddings, a coarse-to-fine and semantically informed camera tracking strategy that robustly estimates camera poses, a mapping module that incrementally refines the Gaussian map by simultaneously leveraging geometric and photometric cues alongside semantic information and a covisibility-based local bundle adjustment module for joint optimization of camera poses and Gaussian parameters. Extensive experiments on synthetic and real-world indoor datasets demonstrate that our framework achieves superior reconstruction quality, enhanced semantic segmentation accuracy, and competitive camera pose estimation.

Abstract:
Multi-agent trajectory planning requires ensuring both safety and efficiency, yet deadlocks remain a significant challenge, especially in obstacle-dense environments. To address this, we propose a novel distributed trajectory planning framework that bridges the gap between global path and local trajectory cooperation. At the global level, a homotopy-aware optimal path planning algorithm is proposed, which fully leverages the topological structure of the environment. A reference path is chosen from distinct homotopy classes by considering both its spatial and temporal properties, leading to improved coordination among agents globally. At the local level, a model predictive control-based trajectory optimization method is used to generate dynamically feasible and collision-free trajectories. Additionally, an online replanning strategy ensures its adaptability to changing environments. Simulations and experiments validate the effectiveness of our approach in mitigating deadlocks. Ablation studies demonstrate that by incorporating time-aware homotopic properties into the underlying global paths, our method can significantly reduce deadlocks and improve the average success rate from 4%-13% to over 90% in randomly generated dense scenarios.

Abstract:
This paper presents a visual servoing strategy for a clevis and tenon joint assembly as typically performed in aircraft manufacturing. The desired velocity in the image acquired by the cameras observing the parts is extracted from a vector field that guides the motion of the moving part so that it follows an accurate 3D straight line trajectory. The vector field is designed so that it is continuous and globally exponentially stable whatever the initial configuration. The strategy has been successfully implemented and validated on a full-scale demonstrator, with coarse camera localization, while ensuring an assembly without any collision thanks to a maximal error less than 1 mm along a 10 cm trajectory.

Abstract:
Precise drone rephotography technology aims to recover camera poses from a reference sequence and obtain well-aligned image sequences, playing a crucial role in autonomous drone inspection tasks. However, existing rephotography methods rely on static image inputs, resulting in low efficiency and limited applicability in real-world scenarios. This paper presents a novel video-based precise drone rephotography system leveraging video sequences. To the best of our knowledge, this is the first work to extend precise drone rephotography from still images to videos while significantly reducing rephotography time. The proposed approach integrates advanced visual SLAM techniques with a dense flow prediction model to continuously refine the drone’s pose, enabling robust and precise rephotography tasks. To further quantify system performance, we introduce a trajectory-based visual similarity evaluation standard—Dynamic Frame Alignment Error (DFAE), which assesses the visual similarity of drone-captured videos of varying durations. We conducted multiple experiments with drones in real-world scenarios. Experimental results demonstrate that the proposed system achieves efficient and precise rephotography across multiple indoor and outdoor trials. Specifically, the average rephotography error is only 7.956 pixels indoors and 9.800 pixels outdoors. More importantly, the rephotography time is only half of the baseline.

Abstract:
Plane detection from depth images is a crucial subtask with broad robotic applications, often accomplished by iterative methods such as Random Sample Consensus (RANSAC). While RANSAC is a robust strategy with strong probabilistic guarantees, the ambiguity of its inlier threshold criterion makes it susceptible to false positive plane detections. This issue is particularly prevalent in complex real-world scenes, where the true number of planes is unknown and multiple planes coexist. In this paper, we aim to address this limitation by proposing a generalised framework for plane detection based on model information optimization. Building on previous works, we treat the observed depth readings as discrete random variables, with their probability distributions constrained by the ground truth planes. Various models containing different candidate plane constraints are then generated through repeated random sub-sampling to explain our observations. By incorporating the physics and noise model of the depth sensor, we can calculate the information for each model, and the model with the least information is accepted as the most likely ground truth. This information optimization process serves as an objective mechanism for determining the true number of planes and preventing false positive detections. Additionally, the quality of each detected plane can be ranked by summing the information reduction of inlier points for each plane. We validate these properties through experiments with synthetic data and find that our algorithm estimates plane parameters more accurately compared to the default Open3D RANSAC plane segmentation. Furthermore, we accelerate our algorithm by partitioning the depth map using neural network segmentation, which enhances its ability to generate more realistic plane parameters in real-world data.

Abstract:
Accurate 6D object pose estimation is a prerequisite for successfully completing robotic prehensile and non-prehensile manipulation tasks. At present, 6D pose estimation for robotic manipulation generally relies on depth sensors based on, e.g., structured light, time-of-flight, and stereo-vision, which can be expensive, produce noisy output (as compared with RGB cameras), and fail to handle transparent objects. On the other hand, state-of-the-art monocular depth estimation models (MDEMs) provide only affine-invariant depths up to an unknown scale and shift. Metric MDEMs achieve some successful zero-shot results on public datasets, but fail to generalize. We propose a novel framework, monocular one-shot metric-depth alignment, MOMA, to recover metric depth from a single RGB image, through a one-shot adaptation building on MDEM techniques. MOMA performs scale-rotation-shift alignments during camera calibration, guided by sparse ground-truth depth points, enabling accurate depth estimation without additional data collection or model retraining on the testing setup. MOMA supports fine-tuning the MDEM on transparent objects, demonstrating strong generalization capabilities. Real-world experiments on tabletop 2-finger grasping and suction-based bin-picking applications show MOMA achieves high success rates in diverse tasks, confirming its effectiveness.

Abstract:
This paper introduces EDeformNet, a novel method for real-time 3D reconstruction of fishing nets using sparse positional measurements. Currently, net deployment during large-scale fishing operations is challenging as the submerged lattice deformations that occur in response to the various environmental factors are not visible to the vessel operator. EDeformNet extends Embedded Deformation Graphs (EDGs), a commonly used technique in template-based nonrigid 3D reconstruction that allows control of embedded spaces through sparse control point correspondences. These can be suitably derived from acoustic tracking beacons attached to the net. EDeformNet enhances the standard EDG optimization scheme by including constraints that preserve surface normals at control points and guard distances between vertices in the template mesh. These improvements are proven to enable an accurate representation of the complex deformations and movements typical in purse seine nets, the fishing technique where the algorithm has been tested, which standard EDG is unable to attain. Moreover, EDeformNet also proposes a tailored strategy that dynamically adjusts the net template according to the known length of the deployed portion of the fishing net. This approach reconstructs exclusively the submerged portion of the fishing net, avoiding extraneous data from above-water sections and enhancing accuracy under realistic fishing conditions. The proposed method is validated using realistic 3D physics simulations in Blender, where quantifiable comparisons demonstrate that EDeformNet effectively captures the spatial dynamics of purse-seining. Compared to standard EDG, EDeformNet achieves superior performance, resulting in at least a 25% improvement across the array of challenging temporal scenarios studied.

Abstract:
Due to dynamic variations such as changing payload, aerodynamic disturbances, and varying platforms, a robust solution for quadrotor trajectory tracking remains challenging. To address these challenges, we present a deep reinforcement learning (DRL) framework that achieves physical dynamics invariance by directly optimizing force/torque inputs, eliminating the need for traditional intermediate control layers. Our architecture integrates a temporal trajectory encoder, which processes finite-horizon reference positions/velocities, with a latent dynamics encoder trained on historical state-action pairs to model platform-specific characteristics. Additionally, we introduce scale-aware dynamics randomization parameterized by the quadrotor’s arm length, enabling our approach to maintain stability across drones spanning from 30g to 2.1kg and outperform other DRL baselines by 85% in tracking accuracy. Extensive real-world validation of our approach on the Crazyflie 2.1 quadrotor, encompassing over 200 flights, demonstrates robust adaptation to wind, ground effects, and swinging payloads while achieving less than 0.05m RMSE at speeds up to 2.0 m/s. This work introduces a universal quadrotor control paradigm that compensates for dynamic discrepancies across varied conditions and scales, paving the way for more resilient aerial systems.

Abstract:
With the growing need for diverse and scalable data in indoor scene tasks, such as question answering and dense captioning, we propose 3D-MoRe, a novel paradigm designed to generate large-scale 3D-language datasets by lever-aging the strengths of foundational models. The framework integrates key components, including multi-modal embedding, cross-modal interaction, and a language model decoder, to process natural language instructions and 3D scene data. This approach facilitates enhanced reasoning and response generation in complex 3D environments. Using the ScanNet 3D scene dataset, along with text annotations from ScanQA and ScanRefer, 3D-MoRe generates 62,000 question-answer (QA) pairs and 73,000 object descriptions across 1,513 scenes. We also employ various data augmentation techniques and implement semantic filtering to ensure high-quality data. Experiments on ScanQA demonstrate that 3D-MoRe significantly outperforms state-of-the-art baselines, with the CIDEr score improving by 2.15%. Similarly, on ScanRefer, our approach achieves a notable increase in CIDEr@0.5 by 1.84%, highlighting its effectiveness in both tasks. Our code and generated datasets will be publicly released to benefit the community, and both can be accessed on the https://3D-MoRe.github.io.

Abstract:
Controlling a robot based on physics-consistent dynamic models, such as Deep Lagrangian Networks (DeLaN), can improve the generalizability and interpretability of the resulting behavior. However, in complex environments, the number of objects to potentially interact with is vast, and their physical properties are often uncertain. This complexity makes it infeasible to employ a single global model. Therefore, we need to resort to online system identification of context-aware models that capture only the currently relevant aspects of the environment. While physical principles such as the conservation of energy may not hold across varying contexts, ensuring physical plausibility for any individual context-aware model can still be highly desirable, particularly when using it for receding horizon control methods such as model predictive control (MPC). Hence, in this work, we extend DeLaN to make it context-aware, combine it with a recurrent network for online system identification, and integrate it with an MPC for adaptive, physics-consistent control. We also combine DeLaN with a residual dynamics model to leverage the fact that a nominal model of the robot is typically available. We evaluate our method on a 7-DOF robot arm for trajectory tracking under varying loads. Our method reduces the end-effector tracking error by 39%, compared to a 21% improvement achieved by a baseline that uses an extended Kalman filter.

Abstract:
Panoptic tracking enables pixel-level scene interpretation of videos by integrating instance tracking in panoptic segmentation. This provides robots with a spatio-temporal understanding of the environment, an essential attribute for their operation in dynamic environments. In this paper, we propose a novel approach for panoptic tracking that simultaneously captures general semantic information and instance-specific appearance and motion features. Unlike existing methods that overlook dynamic scene attributes, our approach leverages both appearance and motion cues through dedicated network heads. These interconnected heads employ multi-scale deformable convolutions that reason about scene motion offsets with semantic context and motion-enhanced appearance features to learn tracking embeddings. Furthermore, we introduce a novel two-step fusion module that integrates the outputs from both heads by first matching instances from the current time step with propagated instances from previous time steps and subsequently refines associations using motion-enhanced appearance embeddings, improving robustness in challenging scenarios. Extensive evaluations of our proposed MAPT model on two benchmark datasets demonstrate that it achieves state-of-the-art performance in panoptic tracking accuracy, surpassing prior methods in maintaining object identities over time. To facilitate future research, we make the code available at http://panoptictracking.cs.uni-freiburg.de.

Abstract:
Autonomous vehicle safety is crucial for the successful deployment of self-driving cars. However, most existing planning methods rely heavily on imitation learning, which limits their ability to leverage collision data effectively. Moreover, collecting collision or near-collision data is inherently challenging, as it involves risks and raises ethical and practical concerns. In this paper, we propose SafeFusion, a training framework to learn from collision data. Instead of over-relying on imitation learning, SafeFusion integrates safety-oriented metrics during training to enable collision avoidance learning. In addition, to address the scarcity of collision data, we propose CollisionGen, a scalable data generation pipeline to generate diverse, high-quality scenarios using natural language prompts, generative models, and rule-based filtering. Experimental results show that our approach improves planning performance in collision-prone scenarios by 56% over previous state-of-the-art planners while maintaining effectiveness in regular driving situations. Our work provides a scalable and effective solution for advancing the safety of autonomous driving systems.

Abstract:
We address the problem of intention estimation in human-robot teleoperation, which involves identifying the task being completed and predicting the next actions. Our approach sequentially quantifies the similarity between the observed action sequence and nominal action sequences representing possible tasks using the edit distance metric. Task estimation and action prediction are then performed using a nearest-neighbor rule. A key advantage of our approach is its robustness to deviations in operator actions and action recognition errors, commonly encountered in real-world teleoperation settings. Through extensive experiments on both real and simulated data, we demonstrate that our method largely outperforms alternative approaches, including probabilistic graphical models and transformer-based methods, particularly in scenarios with significant action deviations or action recognition errors. Additionally, we construct task distance matrices to analyze task similarities and potential confusion points, providing insights into when and where estimation errors are likely to occur. This analysis can guide the design of more distinctive task sequences and further improve the reliability of teleoperated robotic systems.

Abstract:
Monocular visual odometry is a key technology in various autonomous systems. Traditional feature-based methods suffer from failures due to poor lighting, insufficient texture, and large motions. In contrast, recent learning-based dense SLAM methods exploit iterative dense bundle adjustment to address such failure cases, and achieve robust and accurate localization in a wide variety of real environments, without depending on domain-specific supervision. However, despite its potential, the methods still struggle with scenarios involving large motion and object dynamics. In this study, we diagnose key weaknesses in a popular learning-based dense SLAM model (DROID-SLAM) by analyzing major failure cases on outdoor benchmarks and exposing various shortcomings of its optimization process. We then propose the use of self-supervised priors leveraging a frozen large-scale pre-trained monocular depth estimator to initialize the dense bundle adjustment process, leading to robust visual odometry without the need to fine-tune the SLAM backbone. Despite its simplicity, the proposed method demonstrates significant improvements on KITTI odometry, as well as the challenging DDAD benchmark. The project page: https://toyotafrc.github.io/SGInit-Proj/

Abstract:
Recent advances in generative models have sparked exciting new possibilities in the field of autonomous vehicles. Specifically, video generation models are now being explored as controllable virtual testing environments. Simultaneously, end-to-end (E2E) driving models have emerged as a streamlined alternative to conventional modular autonomous driving systems, gaining popularity for their simplicity and scalability. However, the application of these techniques to simulation and planning raises important questions. First, while video generation models can generate increasingly realistic videos, can these videos faithfully adhere to the specified conditions and be realistic enough for E2E autonomous planner evaluation? Second, given that data is crucial for understanding and controlling E2E planners, how can we gain deeper insights into their biases and improve their ability to generalize to out-of-distribution scenarios? In this work, we bridge the gap between the driving models and generative world models (Drive&Gen) to address these questions. We propose novel statistical measures leveraging E2E drivers to evaluate the realism of generated videos. By exploiting the controllability of the video generation model, we conduct targeted experiments to investigate distribution gaps affecting E2E planner performance. Finally, we show that synthetic data produced by the video generation model offers a cost-effective alternative to real-world data collection. This synthetic data effectively improves E2E model generalization beyond existing Operational Design Domains, facilitating the expansion of autonomous vehicle services into new operational contexts.

Abstract:
In autonomous driving and robotics, ensuring road safety and reliable decision-making critically depends on out-of-distribution (OOD) segmentation. While numerous methods have been proposed to detect anomalous objects on the road, leveraging the vision-language space–which provides rich linguistic knowledge–remains an underexplored field. We hypothesize that incorporating these linguistic cues can be especially beneficial in the complex contexts found in real-world autonomous driving scenarios.To this end, we present a novel approach that trains a Text-Driven OOD Segmentation model to learn a semantically diverse set of objects in the vision-language space. Concretely, our approach combines a vision-language model’s encoder with a transformer decoder, employs Distance-Based OOD prompts located at varying semantic distances from in-distribution (ID) classes, and utilizes OOD Semantic Augmentation for OOD representaitons. By aligning visual and textual information, our approach effectively generalizes to unseen objects and provides robust OOD segmentation in diverse driving environments.We conduct extensive experiments on publicly available OOD segmentation datasets such as Fishyscapes, Segment-Me-If-You-Can, and Road Anomaly datasets, demonstrating that our approach achieves state-of-the-art performance across both pixel-level and object-level evaluations. This result underscores the potential of vision-language–based OOD segmentation to bolster the safety and reliability of future autonomous driving systems.

Abstract:
While modeling multi-contact manipulation as a quasi-static mechanical process transitioning between different contact equilibria, we propose formulating it as a planning and optimization problem, explicitly evaluating (i) contact stability and (ii) robustness to sensor noise. Specifically, we conduct a comprehensive study on multi-manipulator control strategies, focusing on dual-arm execution in a planar peg-in-hole task and extending it to the Multi-Manipulator Multiple Peg-in-Hole (MMPiH) problem to explore increased task complexity. Our framework employs Dynamic Movement Primitives (DMPs) to parameterize desired trajectories and Black-Box Optimization (BBO) with a comprehensive cost function incorporating friction cone constraints, squeeze forces, and stability considerations. By integrating parallel scenario training, we enhance the robustness of the learned policies. To evaluate the friction cone cost in experiments, we test the optimal trajectories computed for various contact surfaces, i.e., with different coefficients of friction. The stability cost is analytical explained and tested its necessity in simulation. The robustness performance is quantified through variations of hole pose and chamfer size in simulation and experiment. Results demonstrate that our approach achieves consistently high success rates in both the single peg-in-hole and multiple peg-in-hole tasks, confirming its effectiveness and generalizability. The video can be found at https://youtu.be/IU0pdnSd4tE.

Abstract:
Drones have become essential in various applications, but conventional quadrotors face limitations in confined spaces and complex tasks. Deformable drones, which can adapt their shape in real-time, offer a promising solution to overcome these challenges, while also enhancing maneuverability and enabling novel tasks like object grasping. This paper presents a novel approach to autonomous motion planning and control for deformable quadrotors. We introduce a shape-adaptive trajectory planner that incorporates deformation dynamics into path generation, using a scalable kinodynamic A search to handle deformation parameters in complex environments. The backend spatio-temporal optimization is capable of generating optimally smooth trajectories that incorporate shape deformation. Additionally, we propose an enhanced control strategy that compensates for external forces and torque disturbances, achieving a 37.3% reduction in trajectory tracking error compared to our previous work. Our approach is validated through simulations and real-world experiments, demonstrating its effectiveness in narrow-gap traversal and multi-modal deformable tasks.

Abstract:
Trajectory forecasting has become a popular deep learning task due to its relevance for scenario simulation for autonomous driving. Specifically, trajectory forecasting predicts the trajectory of a short-horizon future for specific human drivers in a particular traffic scenario. Robust and accurate future predictions can enable autonomous driving planners to optimize for low-risk and predictable outcomes for human drivers around them. Although some work has been done to model driving style in planning and personalized autonomous polices, a gap exists in explicitly modeling human driving styles for trajectory forecasting of human behavior. Human driving style is most certainly a correlating factor to decision making, especially in edge-case scenarios where risk is nontrivial, as justified by the large amount of traffic psychology literature on risky driving. So far, the current real-world datasets for trajectory forecasting lack insight on the variety of represented driving styles. While the datasets may represent real-world distributions of driving styles, we posit that fringe driving style types may also be correlated with edge-case safety scenarios. In this work, we conduct analyses on existing real-world trajectory datasets for driving and dissect these works from the lens of driving styles, which is often intangible and non-standardized.

Abstract:
Scene view synthesis, which generates novel views from limited perspectives, is increasingly vital for applications like virtual reality, augmented reality, and robotics. Unlike object-based tasks, such as generating 360° views of a car, scene view synthesis handles entire environments where non-uniform observations pose unique challenges for stable rendering quality. To address this issue, we propose a novel approach: renderability field-guided gaussian splatting (RF-GS). This method quantifies input inhomogeneity through a renderability field, guiding pseudo-view sampling to enhanced visual consistency. To ensure the quality of wide-baseline pseudo-views, we train an image restoration model to map point projections to visiblelight styles. Additionally, our validated hybrid data optimization strategy effectively fuses information of pseudo-view angles and source view textures. Comparative experiments on simulated and real-world data show that our method outperforms existing approaches in rendering stability.

Abstract:
Visual Terrain Classification (VTC) plays a vital role in enabling unmanned ground vehicles to understand complex environments. Existing research relies on image-label pairs annotated by static label sets, where semantic ambiguity and high annotation costs constrain fine-grained terrain characterization. These limitations hinder the model’s adaptation to real-world terrain diversity and restrict its applicability. To address these issues, we propose TerraX, a vision-language learning framework that integrates multi-modal image-label-text data, unifying structured annotations with fine-grained natural language descriptions. The framework introduces a composite dataset TerraData, an evaluation benchmark suite TerraBench, and a CLIP-based visual terrain classification model TerraCLIP. TerraData aggregates multi-source terrain images from public and self-collected datasets, annotated through a VLM-based vision-language data annotation pipeline. TerraBench defines three evaluation benchmarks to systematically assess model robustness and adaptability in real-world terrain classification scenarios. Built on the CLIP model, TerraCLIP utilizes multi-granularity contrastive loss and LoRA fine-tuning to enhance understanding for terrain categories and attributes, and incorporates confidence-weighted inference for accurate predictions. Extensive experiments across benchmarks and real-world platforms demonstrate that our approach significantly enhances VTC performance, highlighting its potential for deployment in complex environments.

Abstract:
In this study, we present a novel simultaneous localization and mapping (SLAM) system, VIMS, designed for underwater navigation. Conventional visual-inertial state estimators encounter significant practical challenges in perceptually degraded underwater environments, particularly in scale estimation and loop closing. To address these issues, we first propose leveraging a low-cost single-beam sonar to improve scale estimation. Then, VIMS integrates a high-sampling-rate magnetometer for place recognition by utilizing magnetic signatures generated by an economical magnetic field coil. Building on this, a hierarchical scheme is developed for visual-magnetic place recognition, enabling robust loop closure. Furthermore, VIMS achieves a balance between local feature tracking and descriptor-based loop closing, avoiding additional computational burden on the front end. Experimental results highlight the efficacy of the proposed VIMS, demonstrating significant improvements in both the robustness and accuracy of state estimation within underwater environments.

Affiliations: RAM—Robotics and Mechatronics, University of Twente, Enschede, The Netherlands; BIOS Lab-on-Chip Group, Max Planck Center for Complex Fluid Dynamics, MESA+ Institute for Nanotechnology, University of Twente, Enschede, The Netherlands; Faculty of Electrical and Control Engineering, Liaoning Technical University, Huludao, China; Department of Design Production and Management, University of Twente, Enschede, The Netherlands; Autonomous Matter Department, AMOLF, Amsterdam, The Netherlands

Abstract:
In the design of microrobots, a helical geometry is pivotal to overcome the time-reversal constraints of the scallop theorem. The helical geometry enables the microrobots to propel themselves forward in viscous fluids with a corkscrew like motion when they are allowed to rotate. It is physically advantageous for microrobots to swim with near-zero angle of attack much like buoyant microorganisms, allowing high thrust for forward propulsion. This type of propulsion is not possible as the non-buoyant microrobot drifts downward due to gravity. Here, we analyze the stability problem of controlling magnetically driven helical microrobots to achieve bounded straight runs without drift in a low-Reynolds-number regime. We demonstrate periodic active suspension solutions, that facilitate helical propulsion with minimal angle of attack and zero drift. We theoretically predict unique control inputs, for a given helical microrobot geometry and magnetic composition (i.e., 62% Ni and 24% Au Wt%), which can be generated with rotating field and field-gradient pulling. Using microrobots fabricated of denser-than-water soft-magnetic body (4870 kg•m−3), we find that the microrobot is allowed to swim with near-zero angle of attack of 8.3° ±5.2° (mean ±s.d.), outperforming conventional gravity compensation methods.

Abstract:
We present the design and implementation of a large-scale tail-sitter aircraft with vertical takeoff and landing (VTOL) capabilities. The aircraft was designed with an H-shaped configuration and incorporated canards to enhance longitudinal stability. The structural design was optimized to balance mechanical strength and lightweight construction. Furthermore, the actuation system comprised four motors and eight servos, providing the aircraft with high maneuverability. Computational fluid dynamics (CFD) simulations and wind tunnel tests were conducted to characterize the full envelope aerodynamics of the aircraft. To address the engineering challenges associated with increased scale, we developed a control framework applicable to the entire flight envelope. The performance and reliability of the proposed system were validated by extensive simulation studies and outdoor flight experiments.

Abstract:
Visual grounding aims at identifying objects or regions in a scene based on natural language descriptions, which is essential for spatially aware perception in autonomous driving. However, existing visual grounding tasks typically depend on bounding boxes that often fail to capture fine-grained details. Not all voxels within a bounding box are occupied, resulting in inaccurate object representations. To address this, we introduce a benchmark for 3D occupancy grounding in challenging outdoor scenes. Built on the nuScenes dataset, it fuses natural language with voxel-level occupancy annotations, offering more precise object perception compared to the traditional grounding task. Moreover, we propose GroundingOcc, an end-to-end model designed for 3D occupancy grounding through multimodal learning. It combines visual, textual, and point cloud features to predict object location and occupancy information from coarse to fine. Specifically, GroundingOcc comprises a multimodal encoder for feature extraction, an occupancy head for voxel-wise predictions, and a grounding head for refining localization. Additionally, a 2D grounding module and a depth estimation module enhance geometric understanding, thereby boosting model performance. Extensive experiments on the benchmark demonstrate that our method outperforms existing baselines on 3D occupancy grounding. The dataset is available at https://github.com/RONINGOD/GroundingOcc.

Abstract:
Recent advances in 3D content generation, particularly 3D Gaussian Splatting (3DGS) and diffusion models, have significantly improved the synthesis of static shapes and textures. However, the modeling of dynamic articulations remains a significant challenge. Existing datasets lack physics-aware joint annotations, segmentation methods overlook kinematic constraints, and procedural generation techniques often prioritize space coverage over physical plausibility and visual realism. Motivated by these challenges, we propose Articulation-Gen, a scalable and robust framework for generating physically compliant, multi-joint 3D objects. Our approach comprises three components: (1) a 3D semantic segmentation module that integrates 2D visual models (SAM2 and DINO) to achieve 91.4% part segmentation accuracy by resolving occlusions via multi-view fusion with semantic consistency; (2) a physics-guided joint optimizer that combines spatial sampling with heuristic search to reach 93.7% axis alignment accuracy, representing a 20.6% improvement; and (3) an LLM-augmented URDF synthesis mechanism that automatically produces physically plausible kinematic descriptions with language annotations, thereby improving generation accuracy by 87.5%. Leveraging existing 3D asset datasets and generation techniques, we further construct a large-scale articulation asset dataset comprising 10.6K articulated objects with 45.2K validated joints. This dataset enables faster articulated asset generation while ensuring URDF compliance. By proposing our pipeline and dataset, this work provides foundational tools for physics-based computer graphics and embodied AI, advancing the frontiers of 3D content creation and robotic simulation.

Abstract:
Tendon-sheath mechanisms (TSMs) are widely used in minimally invasive surgical (MIS) applications, but their inherent hysteresis—caused by friction, backlash, and tendon elongation—leads to significant tracking errors. Conventional modeling and compensation methods struggle with these non-linearities and require extensive parameter tuning. To address this, we propose a vibration-assisted hysteresis compensation approach, where controlled vibrational motion is applied along the tendon’s movement direction to mitigate friction and reduce dead zones. Experimental results demonstrate that the exerted vibration consistently reduces hysteresis across all tested frequencies, decreasing RMSE by up to 23.41% (from 2.2345 mm to 1.7113 mm) and improving correlation, leading to more accurate trajectory tracking. When combined with a Temporal Convolutional Network (TCN)-based compensation model, vibration further enhances performance, achieving an 85.2% reduction in MAE (from 1.334 mm to 0.1969 mm). Without vibration, the TCN-based approach still reduces MAE by 72.3% (from 1.334 mm to 0.370 mm) under the same parameter settings. These findings confirm that vibration effectively mitigates hysteresis, improving trajectory accuracy and enabling more efficient compensation models with fewer trainable parameters. This approach provides a scalable and practical solution for TSM-based robotic applications, particularly in MIS.

Abstract:
Video streaming based teleoperation often faces a trade-off between bandwidth consumption and the need for high-fidelity telepresence. Higher image resolution or a wider field of view (FOV) substantially increases bandwidth requirements. In this paper, we propose a novel telepresence model for teleoperated vehicles operating in bandwidth-constrained environments. Our approach employs a LiDAR-fused 3D Gaussian Splatting (3DGS) as a compact scene representation to efficiently generate remote views. Initially, a static point cloud map is constructed using LiDAR-based semantic mapping, which serves as the initial Gaussians for optimizing the 3DGS model. During teleoperation, the prebuilt 3DGS is then rendered on the teleoperation platform, while only safety-critical information, such as vehicle pose and dynamic objects, is transmitted from the vehicle to the teleoperator in real-time. The proposed telepresence model significantly reduces data transmission requirements while maintaining photorealistic telepresence, enabling reliable and effective teleoperation even under stringent bandwidth constraints. This capability ensures safe and efficient vehicle teleoperation under challenging environments without relying on traditional high-bandwidth communication, thereby broadening the applicability of teleoperation technology to more demanding and diverse operational scenarios. Real-world experimental results show that the developed system can provide immersive teleoperation experiences at Kbps-level bandwidth consumption.

Abstract:
In 3D object mapping, category-level priors enable efficient object reconstruction and canonical pose estimation, requiring only a single prior per semantic category (e.g., chair, book, laptop, etc.). DeepSDF has been used predominantly as a category-level shape prior, but it struggles to reconstruct sharp geometry and is computationally expensive. In contrast, NeRFs capture fine details but have yet to be effectively integrated with category-level priors in a real-time multi-object mapping framework. To bridge this gap, we introduce PRENOM, a Prior-based Efficient Neural Object Mapper that integrates category-level priors with object-level NeRFs to enhance reconstruction efficiency and enable canonical object pose estimation. PRENOM gets to know objects on a first-name basis by meta-learning on synthetic reconstruction tasks generated from open-source shape datasets. To account for object category variations, it employs a multi-objective genetic algorithm to optimize the NeRF architecture for each category, balancing reconstruction quality and training time. Additionally, prior-based probabilistic ray sampling directs sampling toward expected object regions, accelerating convergence and improving reconstruction quality under constrained resources. Experimental results highlight the ability of PRENOM to achieve high-quality reconstructions while maintaining computational feasibility. Specifically, comparisons with prior-free NeRF-based approaches on a synthetic dataset show a 21% lower Chamfer distance. Furthermore, evaluations against other approaches using shape priors on a noisy real-world dataset indicate a 13% improvement averaged across all reconstruction metrics, and comparable pose and size estimation accuracy, while being trained for 5× less time.Code available at: https://github.com/snt-arg/PRENOM

Abstract:
Representing articulated objects remains a difficult problem within the field of robotics. Objects such as pliers, clamps, or cabinets require representations that capture not only geometry and color information, but also part seperation, connectivity, and joint parametrization. Furthermore, learning these representations becomes even more difficult with each additional degree of freedom. Complex articulated objects such as robot arms may have seven or more degrees of freedom, and the depth of their kinematic tree may be notably greater than the tools, drawers, and cabinets that are the typical subjects of articulated object research. To address these concerns, we introduce SPLATART - a pipeline for learning Gaussian splat representations of articulated objects from posed images, of which a subset contains image space part segmentations. SPLATART disentangles the part separation task from the articulation estimation task, allowing for post-facto determination of joint estimation and representation of articulated objects with deeper kinematic trees than previously exhibited. In this work, we present data on the SPLATART pipeline as applied to the syntheic Paris dataset objects [1], and qualitative results on a real-world object under spare segmentation supervision. We additionally present on articulated serial chain manipulators to demonstrate usage on deeper kinematic tree structures. Further media and information can be found at the project website here: https://progress.eecs.umich.edu/projects/splatart/

Abstract:
Mobile robot navigation in crowded spaces is crucial for deployment but remains challenging in extremely dense environments. While recent works have utilized predictive measures to anticipate individual trajectories, bettering overall navigation, these methods often fail to scale in high-density crowds due to computational intensity. To mitigate this issue we propose PathCluster, a novel approach that generates groups based on individuals with similar trajectories to enable efficient navigation in dense crowds. Our method introduces a group generator algorithm that identifies and treats clusters as cohesive units, significantly decreasing the complexity of trajectory prediction while maintaining its benefits. Simulation results demonstrate that our method, PathCluster, achieves a 45% higher success rate, and a 25% lower collision rate, and can tackle more challenging navigational tasks within a 48hr-time limit compared to previous social navigation models in extremely crowded environments.

Abstract:
This paper tackles the spatio-temporal areas restoration problem for a single robot when faced with state uncertainty: a robot, with limited battery life, deployed in a known environment, persistently plans a schedule to visit areas of interest and charge its battery as needed. The temporal properties of areas decay over time, wherein the decay is only partially observable and evolves over time, potentially with correlation among areas. The goal is to restore the temporal properties so that the time the measured property values are below a certain threshold is minimized.Our previous work formulated the spatio-temporal areas restoration problem assuming that the decays are known. Instead, in this paper, we relax that assumption and account for the uncertainty, proposing a heuristic to measure the discounted opportunity cost of a visit, which induces risk-aversion to revisit overlooked areas, and adding a component that learns the decay parameters in each area as well as potential correlation among areas. The learning component can then be used to predict future trends and be incorporated in the heuristic forecast. Moreover, the algorithm learns and constantly adjusts for noise that can happen during mission. We show in experiments using a robotics simulator that our devised approach is able to maintain areas above the critical threshold better than existing state-of-the-art methods from related problems. This contribution enables a robot to come up with an effective schedule efficiently for preserving spatio-temporal properties of an environment considering realistic scenarios–which has markedly impact in important environmental applications.

Abstract:
A comprehensive understanding of 3D scenes is essential for autonomous vehicles (AVs), and among various perception tasks, occupancy estimation plays a central role by providing a general representation of drivable and occupied space. However, most existing occupancy estimation methods rely on LiDAR or cameras, which perform poorly in degraded environments such as smoke, rain, snow, and fog. In this paper, we propose 4D-ROLLS, the first weakly supervised occupancy estimation method for 4D radar using the LiDAR point cloud as the supervisory signal. Specifically, we introduce a method for generating pseudo-LiDAR labels, including occupancy queries and LiDAR height maps, as multi-stage supervision to train the 4D radar occupancy estimation model. Then the model is aligned with the occupancy map produced by LiDAR, fine-tuning its accuracy in occupancy estimation. Extensive comparative experiments validate the exceptional performance of 4D-ROLLS. Its robustness in degraded environments and effectiveness in cross-dataset training are qualitatively demonstrated. The model is also seamlessly transferred to downstream tasks BEV segmentation and point cloud occupancy prediction, highlighting its potential for broader applications. The lightweight network enables 4D-ROLLS model to achieve fast inference speeds at about 30 Hz on a 4060 GPU. The code of 4D-ROLLS will be made available at https://github.com/CLASS-Lab/4D-ROLLS.

Abstract:
This paper presents a novel theory of mind (ToM)-based approach for implicit coordination of multi robot systems (MRS) in environments where direct communication is unavailable. The proposed approach integrates higher-order reasoning, epistemic theory, and active inference to coordinate the actions of each robot to clarify their own intentions and make them understandable to other robots. Further, to reduce the computational overhead of higher-order reasoning, we implement a large language model (LLM)-based attention selection mechanism that focuses on a subset of robots. Simulations and physical experiments demonstrate the applicability of the proposed approach with high success rates while significantly reducing computation complexity.

Abstract:
Quadruped robots have demonstrated remarkable versatility in various applications, from search and rescue to exploration. Recent advancements have shifted focus from individual robots to swarms, recognizing the potential of collaborative behaviors to achieve complex tasks beyond the capabilities of a single robot. Inspired by the cooperative hunting behaviors observed in nature, this paper presents a reinforcement learning framework for a swarm of quadruped robots to learn decentralized end-to-end hunting locomotion. In particular, we integrate stable and dynamic locomotion with hunting objectives and utilize a guidance vector as privileged information for efficient training. The framework concerns the control dynamics of quadruped robots, ensuring both low-level stability and high-level hunting coordination in muti-robot environments. The trained policy is deployed onto a real robot system, and the experimental results demonstrate coordinative behavior in various scenarios. The implementation code is released to benefit the community.

Abstract:
Dynamic obstacle avoidance is a challenging problem in robotic control, with many algorithms developed to balance efficiency and real-time performance. Existing resolved-rate motion control (RRMC) methods formulate obstacle avoidance as a quadratic programming (QP) problem. However, the lack of directional guidance for obstacle avoidance and frequent constraint conflicts often lead to execution failures. In this work, we propose the Mode-Isolated Velocity-Guide (MIVG) algorithm that deploys a dual-mode isolation strategy combined with a Velocity-Guide Potential Field (VGPF). This novel approach separates obstacle avoidance from target-driven tasks while providing velocity-based directional guidance. Simulations on a 7-degree-of-freedom Franka Emika Panda robot demonstrate that our approach significantly enhances task success rates while maintaining real-time feasibility, achieving an increase in execution success rates of 35.0% ~ 52.0% compared to the baseline RRMC strategy (NEO). Additionally, we analyze the impact of key parameters through simulations, further validating the effectiveness of the proposed algorithm in dynamic environments.

Abstract:
Multi-task learning (MTL) can advance assistive driving by exploring inter-task correlations through shared representations. However, existing methods face two critical limitations: single-modality constraints limiting comprehensive scene understanding and inefficient architectures impeding real-time deployment. This paper proposes TEM3-Learning (Time-Efficient Multimodal Multi-task Learning), a novel framework that jointly optimizes driver emotion recognition, driver behavior recognition, traffic context recognition, and vehicle behavior recognition through a two-stage architecture. The first component, the mamba-based multi-view temporal-spatial feature extraction subnetwork (MTS-Mamba), introduces a forward-backward temporal scanning mechanism and global-local spatial attention to efficiently extract low-cost temporal-spatial features from multi-view sequential images. The second component, the MTL-based gated multimodal feature integrator (MGMI), employs task-specific multi-gating modules to adaptively highlight the most relevant modality features for each task, effectively alleviating the negative transfer problem in MTL. Evaluation on the AIDE dataset, our proposed model achieves state-of-the-art accuracy across all four tasks, maintaining a lightweight architecture with fewer than 6 million parameters and delivering an impressive 142.32 FPS inference speed. Rigorous ablation studies further validate the effectiveness of the proposed framework and the independent contributions of each module. The code is available on https://github.com/Wenzhuo-Liu/TEM3-Learning.

Abstract:
Correspondence matching is a fundamental and crucial task in robot vision. In recent years, deep learning-based keypoint matching techniques have shown outstanding performance in downstream tasks. Conventional learning-based correspondence matching methods rely on large datasets and a specific training procedure. Correspondence techniques based on pre-trained features have been preliminarily explored by researchers. Unfortunately, traditional convolutional neural networks only possess translation invariance but lack rotational invariance, hence, their performance suffers significantly under heavy rotations. Therefore, we propose a correspondence matching method based on pre-trained group-equivariant neural networks and compare the performance of various rotation-equivariant to rotation-invariant transformers. We conducted experiments on the Rotated-Hpatches and Rotated-MegaDepth datasets, and the results indicate that our proposed method is concise and effective, achieving state-of-the-art performance without the need for retraining in downstream tasks.

Abstract:
In this paper, we extend the Unified Force-Impedance Control (UFIC) framework for the whole-body control of service mobile robots subject to non-holonomic constraints. This enables the robot to execute complex service tasks that demand both force and impedance control within its whole body workspace. The task space of the robot is defined as the pose of both end-effectors. Following the concept of UFIC, both impedance and force tracking commands are applied in the task space of the whole-body controller, with augmented energy tanks incorporated to ensure passivity. To enable smooth transitions between force tracking and impedance control—particularly in cases of contact loss—a shaping function is used to modulate the force control command. Additionally, the robot’s redundancy is exploited to shape the posture, while satisfying joint limits, avoiding singularities, and preventing self-collisions between the arms. The effectiveness of the proposed whole-body UFIC controller is validated through simulations and real-world experiments with the service robot GARMI performing several daily tasks.

Affiliations: Department of Control Science and Engineering, College of Electronics and Information Engineering, Tongji University, China; Department of Control Science and Engineering, CEIE, State Key Laboratory of Autonomous Intelligent Unmanned System, Frontiers Science Center for Intelligent Autonomous Systems, Ministry of Education, Shanghai Research Institute for Intelligent Autonomous Systems, Shanghai Institute of Intelligent Science and Technology, Tongji University, China

Abstract:
This paper explores nudge schemes for a central regulator aimed at incentivizing players in constrained noncooperative games to reach desired equilibria. Unlike traditional intervention mechanisms where players update their actions by blindly following signals from the regulator, the nudge mechanism comprehensively integrates players’ rational judgment by incorporating trust variables into players’ models. This implies that players update their actions by evaluating the signals from the regulator and their own expectations to the incentive mechanisms. If the regulator’s signals significantly deviate from the players’ expectations, players decrease trust in the regulator and rely more on the their expectations when updating their actions. Conversely, if the signals align closely with their expectations, players tend to increase trust in the regulator and place greater emphasis on the regulator’s signals. It should be noted that each player does not have access to other players’ actions, which means that it updates its action in a distributed manner by only observing the actions of its neighbors through a directed balanced graph. Furthermore, static and dynamic nudges are designed based on different information available to the regulator, which are also extended to an online case with the desired equilibrum being time-varying. Finally, an application to the robot formation control is shown to validate the obtained results.

Abstract:
Gastroscopy and colonoscopy have become the fundamental tools for gastrointestinal (GI) tract diagnosis and treatment. Conventional tethered devices usually lead to the use of anesthetic agents and patient discomfort. Capsule endoscopy is becoming an ideal alternative, however, the smooth capsule shape may hinder its active locomotion and retention in the GI tract. Here, we propose a spherical soft magnetic millirobot (S2M2 robot) that integrates a virus-like spherical body with protrusions and an octopus-inspired sucker design for enhanced physical capabilities. The protruding suckers (radius: 2.2-4.8 mm) on its surface enable efficient locomotion performance (maximum angular velocity: 8 r/s, maximum speed: 180 mm/s) with strong adhesive ability (maximum force: 3.5 N). The ex vivo experiments in a swine stomach demonstrate the robot’s motion on the slippery surface of the gastric mucosa and the effectiveness of the pressure-based drug delivery system. The in vitro and ex vivo results highlight the superior mobility and controllability, showcasing its potential as a carrier robot for the next-generation capsule endoscopy.

Abstract:
Local feature description is crucial for robotic tasks, yet existing methods struggle with motion blur, a prevalent challenge in high-dynamic and low-light environments. While effective on sharp images, they suffer significant degradation under blur. To address this issue, we propose Motion-Feat, an end-to-end motion blur-aware feature description method. Our approach introduces a Motion Deformable Block (MDB) that adaptively adjusts the receptive field based on pixel-wise motion information at different stages of the network, enhancing multi-scale feature descriptor robustness in blurred conditions. Additionally, we construct synthetic blurred datasets to systematically benchmark feature matching performance across varying blur intensities. Extensive experiments demonstrate that Motion-Feat outperforms state-of-the-art methods on blurred images while maintaining competitive performance on sharp images for relative camera pose estimation and homography estimation tasks. Both code and datasets are available at https://github.com/AndreGao08/Motion-Feat.

Abstract:
Industrial assembly of deformable linear objects (DLOs) such as cables offers great potential for many industries. However, DLOs pose several challenges for robot-based automation due to the inherent complexity of deformation and, consequentially, the difficulties in anticipating the behavior of DLOs in dynamic situations. Although existing studies have addressed isolated subproblems like shape tracking, grasping, and shape control, there has been limited exploration of integrated workflows that combine these individual processes.To address this gap, we propose an object-centric perception and planning framework to achieve a comprehensive DLO assembly process throughout the industrial value chain. The framework utilizes visual and tactile information to track the DLO’s shape as well as contact state across different stages, which facilitates effective planning of robot actions. Our approach encompasses robot-based bin picking of DLOs from cluttered environments, followed by a coordinated handover to two additional robots that mount the DLOs onto designated fixtures. Real-World experiments employing a setup with multiple robots demonstrate the effectiveness of the approach and its relevance to industrial scenarios.

Abstract:
Thunniform fish are renowned for their high-speed and efficient swimming, achieved through minimal body undulation and powerful tail-flapping. Inspired by this natural mechanism, this study develops a thunniform robotic fish and introduces a novel frequency profile control strategy for its tail-flapping propulsion. Unlike previous works that primarily focused on constant or low-speed tail-flapping, we propose and implement a new method that dynamically adjusts the tail-flapping speed using composite frequency profiles with varying frequencies and angle ranges within a single cycle. This approach aims to enhance surge speed and energy efficiency. The effectiveness of the proposed frequency profiles is systematically evaluated through three-dimensional incompressible viscous computational fluid dynamics (CFD) simulations. Results demonstrate that tailored multi-frequency tail-flapping profiles can significantly improve the surge performance of robot, providing valuable insights for the design of next-generation bio-inspired underwater propulsion systems. The novelty of this work lies in the formulation and validation of a frequency profile function that enables adaptive and efficient surge motion, advancing the field of bio-inspired aquatic robotics.

Abstract:
Continuum robots, inspired by biological structures such as spines and tails, have attracted significant attention due to their flexibility and ability to perform complex tasks in confined and dynamic environments. However, traditional flexible continuum robots often encounter challenges such as non-linearity, hysteresis, and limited load-bearing capacity, which can compromise their precision and effectiveness in practical applications. To address these limitations, this paper presents a novel bionic continuum mechanism: Rigid Multi-joint Coupled Continuum Structure(RMCC), which employs a rigid mechanical transmission mode to couple all joints, achieving coordinated movement of multiple joints. Its rigid structural composition and transmission method provide it with high precision and load capacity. The coordinated motion of the joints endows it with the dexterity of a continuum mechanism, while also enabling efficient and precise control with a minimal number of motors. The modular joint design improves the system’s scalability and adaptability, enabling a wide range of configurations to suit diverse robotic applications. The feasibility and effectiveness of the proposed system are validated through a series of bio-inspired experiments, including lizardlike crawling, falling-cat movement, and adaptive grasping like birds. The experimental results confirm that the RMCC exhibits the flexibility and adaptability of animals, demonstrating its potential for diverse bionic robotics applications.

Abstract:
Quadrupedal locomotion involves coordinated interaction between limbs and torso, enabling them to achieve remarkable movement performance and adapt effectively to various environments. In previous studies, mathematical dynamic models of quadrupeds have been established to investigate the mechanisms of limb-torso interaction during walking. However, due to the strong nonlinearity within the model, analyzing how the torso’s motion, especially vibrations, affects walking remains a significant challenge. In this study, the linearization and frequency analysis methods are applied to the quadruped walker to analyze its vibration characteristics, including natural frequency and vibration amplitude. Subsequently, numerical simulations are conducted to examine the relationship between torso vibration and walking performance. Furthermore, a comparison between the vibration characteristics and the simulation results reveals a potential resonance phenomenon. This finding not only validates the effectiveness of the linearization approach but also offers new insights into the interaction between the limbs and torso.

Abstract:
This paper introduces a crawler robot equipped with a movable bending point mechanism that allows it to bend anywhere along its trunk. The robot features a movable motor unit that travels along its body, dynamically adjusting the bending initiation point. This enables the robot to move forward and backward, bend its body, and shift the bending position as needed. By integrating these capabilities, the robot enhances its ability to navigate obstacles and confined spaces. We develop three statics-based models to predict the robot’s performance in escaping narrow spaces, climbing steps, and crossing ditches. These models demonstrate that adjusting the bending point improves traversal capabilities, a finding corroborated by experiments with the physical robot.

Abstract:
It is a challenging task for ground robots to autonomously navigate in harsh environments due to the presence of non-trivial obstacles and uneven terrain. This requires trajectory planning that balances safety and efficiency. The primary challenge is to generate a feasible trajectory that prevents robot from tip-over while ensuring effective navigation. In this paper, we propose a capsizing-aware trajectory planner (CAP) to achieve trajectory planning on the uneven terrain. The tip-over stability of the robot on rough terrain is analyzed. Based on the tip-over stability, we define the traversable orientation, which indicates the safe range of robot orientations. This orientation is then incorporated into a capsizing-safety constraint for trajectory optimization. We employ a graph-based solver to compute a robust and feasible trajectory while adhering to the capsizing-safety constraint. Extensive simulation and real-world experiments validate the effectiveness and robustness of the proposed method. The results demonstrate that CAP outperforms existing state-of-the-art approaches, providing enhanced navigation performance on uneven terrains.

Abstract:
Tennis is widely popular across various age groups. However, mastering fluid stroke mechanics remains a significant challenge, requiring substantial time and practice. The lack of scientifically grounded tools for skill acquisition and training tools further hampers the development of tennis proficiency. In this paper, we investigate the application of an intelligent tennis robot equipped with advanced human-machine interaction capabilities, designed to serve as an effective tool for tennis learners, especially for enhancing motion geometry and coordination. First, we detail the design and development of the intelligent tennis robot, highlighting its core functions and operational principles, including ball serving, human motion sensing, and motion feedback mechanisms. Subsequently, we introduce a novel human motion analysis method that integrates motion geometry and kinematic analyses. Following this, we present a real-time motion feedback system that identifies deficiencies in players' movements, thereby facilitating the enhancement of their motion memory. Finally, we conduct experiments over players of varying skill levels, analyzing their motion patterns and providing practical examples. The proposed human-machine interaction framework offers a pioneering solution for intelligent tennis training, enabling players to understand their movements and correct errors immediately after each stroke.

Abstract:
Buoyancy systems play a vital role in the efficient movement and control of deep-sea robots. Traditional buoyancy systems for these robots rely on high-pressure hydraulic pumps and heavy pressure-resistant shells, resulting in notable increases in size, mass, and cost. Consequently, there is an urgent need for a lightweight, high-pressure-resistant, self-sensing buoyancy adjustment device that can accommodate the multimodal movement requirements of miniaturized deep-sea robots. Drawing inspiration from the buoyancy regulation mechanism of sperm whales, we have developed a self-sensing phase-change buoyancy regulation system tailored for miniaturized robots, designed to operate in extreme deep-sea environments. Each module weighs only 35g while providing 4g of buoyancy adjustment. Experimental results demonstrate that the system maintains stable operation under hydrostatic pressures of 30MPa. When integrated into a miniaturized deep-sea robot, this module successfully enables controlled buoyancy for multimodal movement in aquatic environments.

Abstract:
This work presents a novel co-design strategy that integrates trajectory planning and control to handle STL-based tasks in autonomous robots. The method consists of two phases: (i) learning spatio-temporal motion primitives to encapsulate the inherent robot-specific constraints and (ii) constructing an STL-compliant motion plan from these primitives. Initially, we employ reinforcement learning to construct a library of control policies that perform trajectories described by the motion primitives. Then, we map motion primitives to spatiotemporal characteristics. Subsequently, we present a sampling-based STL-compliant motion planning strategy to meet the STL specification. The proposed model-free approach, which generates feasible STL-compliant motion plans across various environments, is validated on differential-drive and quadruped robots across various STL specifications. Demonstration videos are available at https://youtu.be/xo2cXRYdDPQ and detailed version at https://doi.org/10.48550/arXiv.2507.13225.

Abstract:
Depth perception is essential for a robot’s spatial and geometric understanding of its environment, with many tasks traditionally relying on hardware-based depth sensors like RGB-D or stereo cameras. However, these sensors face practical limitations, including issues with transparent and reflective objects, high costs, calibration complexity, spatial and energy constraints, and increased failure rates in compound systems. While monocular depth estimation methods offer a cost-effective and simpler alternative, their adoption in robotics is limited due to their output of relative rather than metric depth, which is crucial for robotics applications. In this paper, we propose a method that utilizes a single calibrated camera, enabling the robot to act as a "measuring stick" to convert relative depth estimates into metric depth in real-time as tasks are performed. Our approach employs an LSTM-based metric depth regressor, trained online and refined through probabilistic filtering, to accurately restore the metric depth across the monocular depth map, particularly in areas proximal to the robot’s motion. Experiments with real robots demonstrate that our method significantly outperforms current state-of-the-art monocular metric depth estimation techniques, achieving a 22.1% reduction in depth error and a 52% increase in success rate for a downstream task.

Abstract:
We explore the problem of path planning for mobile robots in unknown environments using deep reinforcement learning (DRL). This paper proposes a multi-layer long short-term memory twin delayed deep deterministic policy gradient (MLTD3) algorithm for unknown environments. First, we introduce multi-layer long short-term memory (LSTM) networks to the actor network of the twin delayed deep deterministic policy gradient (TD3), to capture rich historical information to alleviate local optimal solutions in local path planning. Secondly, Poisson coding is employed in the state space to deal with the uncertainty of environments, so that the algorithm can discern the relationship within long sequence information. Then novel extrinsic and intrinsic reward functions are designed to avoid the dynamic obstacles in environments. Finally, the performance of the proposed algorithm is verified through simulations in ROS Gazebo and physical experiments in an unknown environment.

Abstract:
This work focuses on multi-floor indoor exploration, which remains an open area of research. Compared to traditional methods, recent learning-based explorers have demonstrated significant potential due to their robust environmental learning and modeling capabilities, but most are restricted to 2D environments. In this paper, we proposed a learning-integrated topological explorer, LITE, for multi-floor indoor environments. LITE decomposes the environment into a floor-stair topology, enabling seamless integration of learning or non-learning-based 2D exploration methods for 3D exploration. As we incrementally build floor-stair topology in exploration using YOLO11-based instance segmentation model, the agent can transition between floors through a finite state machine. Additionally, we implement an attention-based 2D exploration policy that utilizes an attention mechanism to capture spatial dependencies between different regions, thereby determining the next global goal for more efficient exploration. Extensive comparison and ablation studies conducted on the HM3D and MP3D datasets demonstrate that our proposed 2D exploration policy significantly outperforms all baseline explorers in terms of exploration efficiency. Furthermore, experiments in several 3D multi-floor environments indicate that our framework is compatible with various 2D exploration methods, facilitating effective multi-floor indoor exploration. Finally, we validate our method in the real world with a quadruped robot, highlighting its strong generalization capabilities.

Abstract:
In imitation learning for robotic manipulation, decomposing object manipulation tasks into sub-tasks enables the reuse of learned skills and the combination of learned behaviors to perform novel tasks, rather than simply replicating demonstrated motions. Human gaze is closely linked to hand movements during object manipulation. We hypothesize that an imitating agent’s gaze control—fixating on specific landmarks and transitioning between them—simultaneously segments demonstrated manipulations into sub-tasks. This study proposes a simple yet robust task decomposition method based on gaze transitions. Using teleoperation, a common modality in robotic manipulation for collecting demonstrations, in which a human operator’s gaze is measured and used for task decomposition as a substitute for an imitating agent’s gaze. Our approach ensures consistent task decomposition across all demonstrations for each task, which is desirable in contexts such as machine learning. We evaluated the method across demonstrations of various tasks, assessing the characteristics and consistency of the resulting sub-tasks. Furthermore, extensive testing across different hyperparameter settings confirmed its robustness, making it adaptable to diverse robotic systems. Our code is available at https://github.com/crumbyRobotics/GazeTaskDecomp.

Abstract:
Recent research has demonstrated that blockchain-enabled robot swarms—where robots coordinate using blockchain technology—can secure robot swarms by neutralizing malicious and malfunctioning robots. This security is achieved through blockchain technology’s consistency properties. However, prior work addressed malfunctions at the information level, that is, it studied how to neutralize robots that stored information in the blockchain that did not correspond to the real-world state (i.e., it studied the oracle problem). In contrast, this study focuses on inconsistencies at the blockchain protocol level. We analyze how network partitions, which may arise from robots’ local-only communication capabilities, malfunctioning hardware, or external attacks, can lead to inconsistent information in a robot swarm. In order to mitigate these disruptions, we propose a decentralized approach to detect partitions and a corresponding response. We study our approach in a swarm robotics simulator, where we demonstrate its effectiveness in reducing blockchain inconsistencies.

Abstract:
To exploit the compliant capabilities of soft robot arms we require controller which can exploit their physical capabilities. Teleoperation, leveraging a human in the loop, is a key step towards achieving more complex control strategies. Whilst teleoperation is widely used for rigid robots, for soft robots we require teleoperation methods where the configuration of the whole body is considered. We propose a method of using an identical ‘physical twin’, or demonstrator of the robot. This tendon robot can be back-driven, with the tendon lengths providing configuration perception, and enabling a direct map-ping of tendon lengths for the execture. We demonstrate how this teleoperation across the entire configuration of the robot enables complex interactions with exploit the envrionment, such as squeezing into gaps. We also show how this method can generalize to robots which are a larger scale that the physical twin, and how, tuneability of the stiffness properties of the physical twin simplify its use.

Abstract:
Robot-assisted feeding systems have the potential to significantly enhance the independence and quality of life of individuals with mobility impairments. While prior work has focused on personalizing bite sequences based on user feedback provided only at the start of the feeding process, this approach assumes that users can fully articulate their preferences upfront. In reality, it is cognitively challenging for users to anticipate every detail, and their preferences may evolve during feeding. Thus, there is a need for an adaptive system that supports iterative corrections across all stages of the feeding process while maintaining context and feeding history to interpret inputs relative to earlier instructions. In this paper, we present FRANC, a novel framework for personalized RAF that leverages large language models (LLMs) with a decomposed prompting strategy to dynamically adjust bite sequence, acquisition and transfer parameters during feeding. Our approach allows iterative corrections without sacrificing consistency and accuracy. In our user studies, FRANC improved bite sequencing accuracy from 65% to 93% and enhanced user satisfaction, with participants reliably perceiving when their preferences were being integrated despite occasional execution failures. We also provide a detailed failure analysis and offer insights for developing more adaptive and effective robot-assisted feeding systems.

Abstract:
Accurate estimation of aerodynamic forces is essential for advancing the control, modeling, and design of flapping-wing aerial robots with dynamic morphing capabilities. In this paper, we investigate two distinct methodologies for force estimation on Aerobat, a bio-inspired flapping-wing platform designed to emulate the inertial and aerodynamic behaviors observed in bat flight. Our goal is to quantify aerodynamic force contributions during tethered flight, a crucial step toward closed-loop flight control. The first method is a physics-based observer derived from Hamiltonian mechanics that leverages the concept of conjugate momentum to infer external aerodynamic forces acting on the robot. This observer builds on the system’s reduced-order dynamic model and utilizes real-time sensor data to estimate forces without requiring training data. The second method employs a neural network-based regression model, specifically a multi-layer perceptron (MLP), to learn a mapping from joint kinematics, flapping frequency, and environmental parameters to aerodynamic force outputs. We evaluate both estimators using a 6-axis load cell in a high-frequency data acquisition setup that enables fine-grained force measurements during periodic wingbeats. The conjugate momentum observer and the regression model demonstrate strong agreement across three force components (Fx, Fy, Fz).

Abstract:
This study introduces MarineGym, a high-performance reinforcement learning platform tailored for underwater robotics. It aims to address the limitations of existing underwater simulation environments in terms of reinforcement learning compatibility, training efficiency, and standardized benchmarking. MarineGym integrates a proposed GPU-accelerated hydrodynamic plugin based on Isaac Sim, achieving a rollout speed of 250,000 frames per second on a single NVIDIA RTX 3060 GPU. It also provides five models of unmanned underwater vehicles, multiple propulsion systems, and a set of predefined tasks covering core underwater control challenges. Additionally, the domain randomization toolkit allows flexible adjustments of the simulation and task parameters during training to improve the Sim2Real transfer. Further benchmark experiments demonstrate that MarineGym improves training efficiency over existing platforms and supports robust policy adaptation under various perturbations in the marine environment. We expect this platform to drive further advancements in RL research for underwater robotics. For more details about MarineGym and its applications, please visit our project page: https://marine-gym.com/.

Abstract:
Reward design is a key component of deep reinforcement learning (DRL), yet some tasks and designer’s objectives may be unnatural to define as a scalar cost function. Among the various techniques, formal methods integrated with DRL have garnered considerable attention due to their expressiveness and flexibility in defining the reward and requirements for different states and actions of the agent. Nevertheless, the exploration of leveraging Signal Temporal Logic (STL) for guiding multi-agent reinforcement learning (MARL) reward design is still limited. The presence of complex interactions, heterogeneous goals, and critical safety requirements in multi-agent systems exacerbates this challenge. In this paper, we propose a novel STL-guided multi-agent reinforcement learning framework. The STL requirements are designed to include both task specifications according to the objective of each agent and safety specifications. The robustness values from checking the states against STL specifications are leveraged to generate rewards. We validate our approach by conducting experiments across various testbeds. The experimental results demonstrate significant performance improvements compared to MARL without STL guidance, along with a remarkable increase in the overall safety rate of the multi-agent systems.

Abstract:
Developing controllers that generalize across diverse robot morphologies remains a significant challenge in legged locomotion. Traditional approaches either create specialized controllers for each morphology or compromise performance for generality. This paper introduces a two-stage teacher-student framework that bridges this gap through policy distillation. First, we train specialized teacher policies optimized for individual morphologies, capturing the unique optimal control strategies for each robot design. Then, we distill this specialized expertise into a single Transformer-based student policy capable of controlling robots with varying leg configurations. Our experiments across five distinct legged morphologies demonstrate that our approach preserves morphology-specific optimal behaviors, with the Transformer architecture achieving 94.47% of teacher performance on training morphologies and 72.64% on unseen robot designs. Comparative analysis reveals that Transformer-based architectures consistently outperform MLP baselines by leveraging attention mechanisms to effectively model joint relationships across different kinematic structures. We validate our approach through successful deployment on a physical quadruped robot, demonstrating the practical viability of our morphology-agnostic control framework. This work presents a scalable solution for developing universal legged robot controllers that maintain near-optimal performance while generalizing across diverse morphologies.

Abstract:
Soft pneumatic robots are gaining significant attention due to their compliance and adaptability in unstructured environments. While emerging dual-chamber soft pneumatic robots can achieve complex 3D deformations beyond conventional single-axis bending, real-time proprioception remains challenging due to the high degrees of freedom and the complex interaction between chambers. To address this issue, we propose a multimodal learning-based sensing method that combines camera and inertial measurement unit (IMU) and then extracts full-body shape information using deep learning algorithms. Our method enhances proprioception by effectively processing high-dimensional sensor data, providing real-time feedback on the gripper shape. The average error of key points was found to be 3.67mm (Var 8.39) for our method, while the error was 4.36mm (Var 10.47) when a camera was used alone, or 9.32mm (Var 21.29) when an IMU was used alone. Our multimodal learning-based shape estimation and reconstruction empower soft pneumatic grippers to be seamlessly integrated into the embodied AI framework, significantly improving their reliability and thus paving the way for applications in service robotics, ehabilitation robotics, and human-robot collaborations.

Abstract:
Dynamic loco-manipulation calls for effective whole-body control and contact-rich interactions with the object and the environment. Existing learning-based control synthesis relies on training low-level skill policies and explicitly switching with a high-level policy or a hand-designed finite state machine, leading to quasi-static behaviors. In contrast, dynamic tasks such as soccer require the robot to run towards the ball, decelerate to an optimal approach to dribble, and eventually kick a goal—a continuum of smooth motion. To this end, we propose Preferenced Oracle Guided Multi-mode Policies (OGMP) to learn a single policy mastering all the required modes and preferred sequence of transitions to solve uni-object loco-manipulation tasks. We design hybrid automatons as oracles to generate references with continuous dynamics and discrete mode jumps to perform a guided policy optimization through bounded exploration. To enforce learning a desired sequence of mode transitions, we present a task-agnostic preference reward that enhances performance. The proposed approach demonstrates successful loco-manipulation for tasks like soccer and moving boxes omnidirectionally through whole-body control. In soccer, a single policy learns to optimally reach the ball, transition to contact-rich dribbling, and execute successful goal kicks and ball stops. Leveraging the oracle’s abstraction, we solve each loco-manipulation task on robots with varying morphologies, including HECTOR V1, Berkeley Humanoid, Unitree G1, and H1, using the same reward definition and weights.

Abstract:
Utilizing 2D images for place recognition within 3D point cloud maps presents significant challenges in autonomous driving applications, primarily due to the inherent cross-modal disparity between visual and LiDAR data. In this study, we propose a novel cross-modal visual-LiDAR place recognition method based on Bird’s Eye View (BEV) feature distillation. Our framework is the first end-to-end solution designed to achieve cross-modal place recognition between surround-view images and LiDAR point clouds. By encoding features into a unified BEV representation, our approach effectively bridges the modality gap between 3D and 2D data. Additionally, we introduce a teacher-student distillation training strategy to further enhance the network’s cross-modal generalization capabilities. Extensive experiments on benchmark datasets, including nuScenes and Argoverse, demonstrate that our method achieves state-of-the-art (SOTA) performance in cross-modal place recognition tasks. Furthermore, validation on the SJTU-Sanya dataset confirms the robustness and adaptability of our approach in real-world scenarios. We publicly release our network model and implementation details at https://github.com/IRMVLab/CrossBEV-PR.

Abstract:
Multi-modal ground-aerial robots have been extensively studied, with a significant challenge lying in the integration of conflicting requirements across different modes of operation. The Husky robot family, developed at North-eastern University, and specifically the Husky v.2 discussed in this study, addresses this challenge by incorporating posture manipulation and thrust vectoring into multi-modal locomotion through structure repurposing. This quadrupedal robot features leg structures that can be repurposed for dynamic legged locomotion and flight. In this paper, we present the hardware design of the robot and report primary results on dynamic quadrupedal legged locomotion and hovering.

Abstract:
Visuotactile sensors provide high-resolution tactile information but are incapable of perceiving the material features of objects. We present UltraTac, an integrated sensor that combines visuotactile imaging with ultrasound sensing through a coaxial optoacoustic architecture. The design shares structural components and achieves consistent sensing regions for both modalities. Additionally, we incorporate acoustic matching into the traditional visuotactile sensor structure, enabling the integration of the ultrasound sensing modality without compromising visuotactile performance. Through tactile feedback, we can dynamically adjust the operating state of the ultrasound module to achieve more flexible functional coordination. Systematic experiments demonstrate three key capabilities: proximity sensing in the 3–8 cm range (R2 = 0.99), material classification (average accuracy: 99.20%), and texture-material dual-mode object recognition achieves 92.11% accuracy on a 15-class task. Finally, we integrate the sensor into a robotic manipulation system to concurrently detect container surface patterns and internal content, which verifies its promising potential for advanced human-machine interaction and precise robotic manipulation.

Abstract:
Autonomous exploration is a critical challenge for various unmanned aerial vehicle (UAV) applications. Existing methods often suffer from low exploration rates due to limitations such as inefficient global coverage and inadequate sensor data utilization. In this paper, we introduce FIELD, a Fast Information-driven aerial robot Exploration planner using Larger perception Distance. FIELD leverages a larger perception distance to identify high-information-gain viewpoints while maintaining mapping precision and utilizing more sensor data to guide the exploration process. Then, the method incorporates a history-aware coverage path to determine a consistent and reasonable sequence for visiting frontier viewpoints. Local viewpoints are refined to find the optimal combination of these viewpoints. We compare our method with state-of-the-art frontier-based approaches in benchmark environments. Our method significantly improves exploration efficiency by 13% to 17%.

Abstract:
Efficient exploration of large-scale environments remains a critical challenge in robotics, with applications ranging from environmental monitoring to search and rescue operations. This article proposes Frontier Shepherding (FroShe), a bio-inspired multi-robot framework for large-scale exploration. The framework heuristically models frontier exploration based on the shepherding behavior of herding dogs, where frontiers are treated as a swarm of sheep reacting to robots modeled as shepherding dogs. FroShe is robust across varying environment sizes and obstacle densities, requiring minimal parameter tuning for deployment across multiple agents. Simulation results demonstrate that the proposed method performs consistently, regardless of environment complexity, and outperforms state-of-the-art exploration strategies by an average of 20% with three UAVs. The approach was further validated in real-world experiments using single-and dual-drone deployments in a forest-like environment.

Affiliations: School of Artificial Intelligence/School of Future Technology, Nanjing University of Information Science and Technology, Nanjing, China; Reading Academy, Nanjing University of Information Science & Technology, Nanjing, China; State Key Laboratory of Precision Electronic Manufacturing Technology and Equipment, Guangdong University of Technology, Guangzhou, China; Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China

Abstract:
Accurate indoor location-based services are important for mobile robots, especially in complex indoor environments. In this paper, we propose a heterogeneous graph network-based ultra-wide band (UWB) localization method to provide accurate and robust localization results for mobile robots in complex indoor scenarios. The core of our approach lies in constructing the anchors, ranging measurements and tags into a heterogeneous graph structure according to the topological structure of the UWB localization system, and then design a spatial-temporal heterogeneous graph attention neural network to extract high-level features and estimate the tag locations from the graph. Therefore, the geometric relationships contained in the UWB localization system are comprehensively established, while the spatial and temporal information contained in the ranging measurements can also be extracted. We validate the proposed method through real-world experiments. The results demonstrate that, compared to existing deep learning-based methods, the constructed heterogeneous graph better represents the geometric structure of the UWB localization system, and the designed heterogeneous graph neural network effectively extracts the spatial-temporal and geometric features. Consequently, the accuracy and robustness of UWB localization are significantly improved.

Abstract:
Electrohydrodynamic (EHD) pumps made from flexible or stretchable materials are a promising pumping element for fluidically driven soft robots. In most soft robotic systems, EHD pumps are used separately or connected in series with their target components, such as actuators, which can limit design flexibility and complicate implementation. To address this issue, this paper presents an EHD soft actuator that integrates a pump, actuator, and reservoir into a single device. In this design, the EHD pump is implemented as a flexible PCB, which also serves as a strain-limiting layer, enhancing bending actuation. Additionally, the interdigitated electrodes on the flexible PCB generate fringe electric fields, introducing electroadhesion as an unprecedented functionality for EHD-driven soft actuators. Experimental results from the fabricated actuators demonstrate voltage-controllable actuation, achieving a maximum bending angle of 56.0° and a force of 31.0 mN. The actuators are then incorporated into a soft gripper, where electroadhesion enhances the holding force, with a 1.3× increase for a dielectric object and a 2.9× increase for a conductive object. These enhancements are observed in comparison to a control experiment in which gripping is performed using non-electric, fluidic actuation alone. The results validate the successful implementation of the highly integrated, multifunctional EHD soft actuator, highlighting its potential for soft robotic applications.

Abstract:
We present KD-RIEKF, a novel state estimation framework that incorporates kinodynamic constraints into the Right-Invariant Extended Kalman Filter (RIEKF). Our framework integrates generalized momentum-based contact estimation, centroidal dynamics, and a noise-adaptive module, improving state estimation accuracy by probabilistically adjusting propagation noise to account for contact uncertainty and sensor noise. A key innovation is the expansion of the ground reaction force (GRF) into a state variable. By using GRF-based acceleration as a measurement, our method significantly reduces estimation errors in position, velocity, and orientation. The integration of contact-force-driven adaptive noise effectively boosts the stability of estimation, especially when the system is undergoing turning, acceleration, or deceleration processes. We validated our algorithm in simulation on highly uneven terrain, showing significant enhancements in z-axis position estimation compared to RIEKF. Further experiments on the Unitree Go2 robot across different speeds demonstrated that even in high-speed scenarios over 200 meters, our method reduced position estimation relative error (RE) by 47% and orientation estimation by 42%, confirming its robustness and accuracy under dynamic locomotion.

Abstract:
For inspection robots to achieve generalizability, stability, and ease of use, it is crucial that they understand natural language commands and navigate accurately to specified target objects. We propose L-SNI, a semantic navigation system adapted for inspection tasks, offering generalizability, robust stability, and practical ease of use. In the perception phase, L-SNI constructs a precise geometric depth map of the environment using LiDAR, while RGB images are employed to extract object categories, which are then combined with depth data to generate a semantic map. To enable the large language model (LLM) to interpret the environment, L-SNI encodes the 3D semantic map into a plain text representation. During single-task execution, L-SNI decodes human commands into inspection primitives using an LLM constrained by system initial prompts. These inspection primitives guide the robot’s low-level planner for task execution. To address the challenge of traditional 3D LiDAR localization and navigation systems in accurately positioning the robot around target objects during inspection tasks, we propose a target cost gradient to assist in optimizing the robot’s target point selection and attitude control in maps with semantic information. Upon reaching the target, L-SNI uses a visual language model (VLM) to describe the scene, which is simplified by the LLM into a user-friendly response. Through testing on 18 indoor scenes from the Matterport 3D dataset, L-SNI achieves a 46.9% improvement in Success Rate (SR) and a 58.3% increase in Success weighted by Path Length (SPL) over existing state-of-the-art (SOTA) solutions, while also demonstrating superior target image understanding. Moreover, it can be easily deployed on real-world robots without complex initialization.

Abstract:
Mobile manipulator robots operating in complex domestic and industrial environments must effectively coordinate their base and arm motions while avoiding obstacles. While current reactive control methods gracefully achieve this coordination, they rely on simplified and idealised geometric representations of the environment to avoid collisions. This limits their performance in cluttered environments. To address this problem, we introduce RMMI, a reactive control framework that leverages the ability of neural Signed Distance Fields (SDFs) to provide a continuous and differentiable representation of the environment’s geometry. RMMI formulates a quadratic program that optimises jointly for robot base and arm motion, maximises the manipulability, and avoids collisions through a set of inequality constraints. These constraints are constructed by querying the SDF for the distance and direction to the closest obstacle for a large number of sampling points on the robot. We evaluate RMMI both in simulation and in a set of real-world experiments. For reaching in cluttered environments, we observe a 25% increase in success rate. For additional details, code, and experiment videos, please visit https://rmmi.github.io/.

Abstract:
LLM-based agents have demonstrated remarkable capabilities in multi-step reasoning and task execution across domains such as robotics and autonomous systems. However, deploying these agents on resource-constrained platforms presents a fundamental challenge: minimizing latency while optimizing memory usage. Existing caching techniques (KVCache, PrefixCache, PromptCache) improve inference speed by reusing cached context but overlook LLM dependency relationships in agent workflows, leading to excessive memory usage or redundant recomputation across LLM calls. To address this, we propose ContextCache, a task-aware lifecycle management framework that optimizes context fragment caching for multi-step LLM agents. ContextCache predicts the lifespan of each context fragment and dynamically allocates and releases GPU memory accordingly. We evaluate our approach on a newly constructed dataset, covering logistics coordination, assembly tasks, and health management. Experimental results demonstrate a 15% reduction in memory usage compared to state-of-the-art caching strategies, with no loss in inference efficiency, making our approach well-suited for real-world deployment in resource-constrained environments.

Abstract:
Previous methods for Learning from Demonstration leverage several approaches for a human to teach motions to a robot, including teleoperation, kinesthetic teaching, and natural demonstrations. However, little previous work has explored more general interfaces that allow for multiple demonstration types. Given the varied preferences of human demonstrators and task characteristics, a flexible tool that enables multiple demonstration types could be crucial for broader robot skill training. In this work, we propose Versatile Demonstration Interface (VDI), an attachment for collaborative robots that simplifies the collection of three common types of demonstrations. Designed for flexible deployment in industrial settings, our tool requires no additional instrumentation of the environment. Our prototype interface captures human demonstrations through a combination of vision, force sensing, and state tracking (e.g., through the robot proprioception or AprilTag tracking). Through a user study where we deployed our prototype VDI at a local manufacturing innovation center with manufacturing experts, we demonstrated VDI in representative industrial tasks. Interactions from our study highlight the practical value of VDI’s varied demonstration types, expose a range of industrial use cases for VDI, and provide insights for future tool design.

Abstract:
Parallel LiDAR emerges as an innovative framework for next-generation intelligent LiDAR systems in autonomous driving. In parallel LiDAR research, V2V (Vehicle-to-Vehicle) cooperative perception is a promising technology which can effectively enhance perception range and accuracy through inter-agent information exchange. Currently, sensor heterogeneity remains a critical challenge in V2V. Although some work has made initial attempts to address this issue, existing studies are primarily conducted under ideal clear-weather conditions, ignoring the impact of variable weather factors in real-world applications. In fact, adverse weather has been shown to significantly degrade the performance of LiDAR systems, with the risk of cumulative degradation in V2V. To address this challenge, we first introduce OPV2V-W and V2V4Real-W as new benchmarks to study sensor heterogeneity in V2V under adverse weather. Then we propose the HPLaw architecture (Heterogeneous Parallel LiDARs for Adverse Weather), a self-knowledge distillation method designed to enhance model robustness across varying weather scenarios. HPLaw employs an efficient PF network to facilitate heterogeneous feature fusion and incorporates an SAKD module to extract weather-invariant features. Extensive experiments demonstrate that the student model in HPLaw achieves outstanding performance under all weather conditions, exhibiting remarkable robustness.

Abstract:
This paper presents a robust cascaded control architecture for over-actuated multirotors. It extends the Incremental Nonlinear Dynamic Inversion (INDI) control, combined with structured ℋ∞ control—originally proposed for under-actuated multirotors in [12]—to a broader range of multirotor configurations. Furthermore, it reduces the control law’s dependency on the multirotor model compared to the methods in the literature by shifting it to the actuator model, thereby improving the closed-loop robustness to uncertainties. To achieve attitude and position tracking, we employ a weighted least-squares geometric guidance control allocation method, formulated as a quadratic optimization problem, enabling full-pose tracking. The proposed approach effectively addresses key challenges, such as preventing infeasible pose references, enhancing robustness to disturbances, and accounting for the multirotor’s actual physical limitations. Numerical simulations with an over-actuated hexacopter validate the method’s effectiveness, demonstrating its adaptability to diverse mission scenarios and its potential for real-world aerial applications.

Abstract:
Tiny robots such as nano quadcopters or micro rovers are highly beneficial for various applications as they are inexpensive, agile, and safe for humans. However, their extreme size, weight, and power (SWAP) constraints lead to extremely limited compute, making autonomous navigation challenging. Existing approaches have enabled navigation within these tight constraints, but struggle in dynamic and cluttered scenes.To this end, we present Focus Bug, a novel and robust mapless navigation algorithm that can run on extremely limited hardware. Focus Bug reduces the amount of processed sensory data using a tiny reinforcement learning policy, only processing the inputs necessary for navigation. We use deep reinforcement learning (DRL) to identify critical parts of the robot’s range data and combine it with classical mapless navigation methods to benefit from their robustness and established performance. We implement and evaluate Focus Bug both on a drone in simulation and a micro-rover in the real world to show it can be applied across embodiments. Our hybrid approach outperforms the state-of-the-art in DRL navigation (57% less collisions in dynamic environments) while reducing the amount of range data processed by 87%, and achieving a 2.6X improvement in processing time compared to classical methods. Focus bug is the first method to achieve the high success rate of robust methods (97%) within such a tight compute budget.

Abstract:
Centipedes exhibit great maneuverability in diverse environments due to their many legs and body-driven control. By leveraging similar morphologies and control strategies, their robotic counterparts also demonstrate effective terrestrial locomotion. However, the success of these multi-legged robots is largely limited to forward locomotion; steering is substantially less studied, in part because of the difficulty in coordinating a high degree-of-freedom robot to follow predictable, planar trajectories. To resolve these challenges, we take inspiration from control schemes based on geometric mechanics(GM) in elongate systems’ locomotion through highly damped environments. We model the elongate, multi-legged system as a "terrestrial swimmer" in highly frictional environments and implement steering schemes derived from low-order templates. We identify an effective turning strategy by superimposing two traveling waves of lateral body undulation and further explore variations of the "turning wave" to enable a spectrum of arc-following steering primitives. We test our hypothesized modulation scheme on a robophysical model and validate steering trajectories against theoretically predicted displacements producing steering radii between 0 and 0.6 body length. We then apply our control framework to Ground Control Robotics’ elongate multi-legged robot, Major Tom, using these motion primitives to autonomously navigate around obstacles and corners on indoor and outdoor terrain. Our work creates a systematic framework for controlling these highly mobile devices in the plane using a low-order model based on sequences of body shape changes.

Abstract:
Fine dexterous manipulation requires reactive control based on rich sensing of manipulator-object interactions. Tactile sensing arrays provide rich contact information across the manipulator’s surface. However their implementation faces two main challenges: accurate force estimation across complex surfaces like robotic hands, and integration of these estimates into reactive control loops. We present a data-efficient calibration method that enables rapid, full-array force estimation across varying geometries, providing online feedback that accounts for non-linearities and deformation effects. Our force estimation model serves as feedback in an online closed-loop control system for interaction force tracking. The accuracy of our estimates is independently validated against measurements from a calibrated force-torque sensor. Using the Allegro Hand equipped with Xela uSkin sensors, we demonstrate precise force application through an admittance control loop running at 100Hz, achieving up to 0.12±0.08 [N] error margin—results that show promising potential for dexterous manipulation.

Abstract:
Accurate human motion prediction (HMP) is critical for seamless human-robot collaboration, particularly in handover tasks that require real-time adaptability. Despite the high accuracy of state-of-the-art models, their computational complexity limits practical deployment in real-world robotic applications. In this work, we enhance human motion forecasting for handover tasks by leveraging siMLPe [1], a lightweight yet powerful architecture, and introducing key improvements. Our approach, named IntentMotion incorporates intention-aware conditioning, task-specific loss functions, and a novel intention classifier, significantly improving motion prediction accuracy while maintaining efficiency. Experimental results demonstrate that our method reduces body loss error by over 50%, achieves 200× faster inference, and requires only 3% of the parameters compared to existing state-of-the-art HMP models in robotics. These advancements establish our framework as a highly efficient and scalable solution for real-time human-robot interaction.

Abstract:
High maneuverability is essential to the autonomous operation of underwater robots. To achieve real-time maneuvering motion, the control strategy must take into account nonlinear hydrodynamic effects, which are extremely difficult to accurately capture during motion and therefore a balance must be struck between accuracy and real-time computational efficiency. Therefore, this paper proposes a data-driven approach to model the dynamics of the underwater robot using Sparse Identification of Nonlinear Dynamics (SINDy). Compared with existing works, our method does not require any physical prior knowledge and only uses a short period of onboard sensor data. Subsequently, the learned dynamic model is incorporated into a model predictive controller (MPC) to enable precise attitude control. Finally, the proposed method is implemented on our developed fully vectored propulsion underwater robot, and a series of attitude tracking experiments are conducted in an indoor water tank. Experimental results reveal that our approach significantly improves the model accuracy and reduces the attitude tracking errors by over 79% at a control frequency of 20 Hz, which proves the effectiveness and real-time performance of the method.

Abstract:
The design and control of hexapod robots have become an active research field due to the ability to achieve adaptive and stable multi-terrain locomotion. However, existing hexapod robots focus on the integration of flexible pitch joints to enhance their obstacle-crossing and slope-climbing abilities, and few biological observations have been made to gain insight into the agile steering mechanisms of hexapod insects. Herein, we observed the steering movements of Madagascar cockroaches. Observations showed that cockroaches exhibited specific phase relationships in addition to regular tripod gait pattern during steering. Moreover, we also found that a smaller steering radius resulted in a larger lateral bending angle of the thoracic segments. Inspired by this, a hexapod robot with a flexible torso (F-RHex) was designed and fabricated. Bio-inspired gait patterns were abstracted and simplified into two steering strategies: gait-based and mix-based. Compared to the purely gait-based strategy, the F-RHex testing results demonstrated a ~27.4% reduction in turning radius and ~40% enhancement in steering velocity, implying that the mix-based strategy offers superior steering capability.

Abstract:
Robots with intrinsic joint elasticity can perform highly dynamic manoeuvres by leveraging energy storage and release, enabling explosive motions such as throwing. By augmenting elastic robots with clutch mechanisms, link decoupling can be used to fully exploit inertial coupling effects and gravitational acceleration in motion while effectively circumventing spring deflection limits. However, braking such systems in a decoupled state presents a challenge, as re-engaging the link risks damaging the joint. While optimal control strategies could be applied, they are not inherently safe due to model uncertainties. To address this, we propose a feedback-based two-stage method that coordinates the transition through the hybrid modes of the system. These modes are characterized by underactuated and actuated dynamics. First, a decoupled link is braked via inertial coupling until a safe velocity for clutching is reached, after which the link is re-coupled and actively braked. We demonstrate the effectiveness of this method through simulations comparing it with optimal control and validate it experimentally using a physical prototype.

Abstract:
When operating at their full capacity, quadrupedal robots can produce loud footstep noise, which can be disruptive in human-centered environments like homes, offices, and hospitals. As a result, balancing locomotion performance with noise constraints is crucial for the successful real-world deployment of quadrupedal robots. However, achieving adaptive noise control is challenging due to (a) the trade-off between agility and noise minimization, (b) the need for generalization across diverse deployment conditions, and (c) the difficulty of effectively adjusting policies based on noise requirements. We propose QuietPaw, a framework incorporating our Conditional Noise-Constrained Policy (CNCP), a constrained learning-based algorithm that enables flexible, noise-aware locomotion by conditioning policy behavior on noise-reduction levels. We leverage value representation decomposition in the critics, disentangling state representations from condition-dependent representations and this allows a single versatile policy to generalize across noise levels without retraining while improving the Pareto trade-off between agility and noise reduction. We validate our approach in simulation and the real world, demonstrating that CNCP can effectively balance locomotion performance and noise constraints, achieving continuously adjustable noise reduction.

Abstract:
High-definition (HD) map construction methods are crucial for providing precise and comprehensive static environmental information, which is essential for autonomous driving systems. While Camera-LiDAR fusion techniques have shown promising results by integrating data from both modalities, existing approaches primarily focus on improving model accuracy, often neglecting the robustness of perception models—a critical aspect for real-world applications. In this paper, we explore strategies to enhance the robustness of multi-modal fusion methods for HD map construction while maintaining high accuracy. We propose three key components: data augmentation, a novel multi-modal fusion module, and a modality dropout training strategy. These components are evaluated on a challenging dataset containing 13 types of multi-sensor corruption. Experimental results demonstrate that our proposed modules significantly enhance the robustness of baseline methods. Furthermore, our approach achieves state-of-the-art performance on the clean validation set of the NuScenes dataset. Our findings provide valuable insights for developing more robust and reliable HD map construction models, advancing their applicability in real-world autonomous driving scenarios. Project website: https://robomap-123.github.io/.

Abstract:
The problem of Multi-Agent Reinforcement Learning (MARL) shows a high level of both complexity in the environment and coordination between agents. In order to scale the algorithm to large-scale agent scenarios, neural networks designed for MARL are typically implemented with parameter sharing. These characteristics result in the challenges of partial observability, credit assignment and strategy homogenization. In this paper, a Transformer-Based Multi-Agent Reinforcement Learning Method With Credit-Oriented Strategy Differentiation (TMRC) is presented to address each of these challenges. First, we design a Temporal-Spatial Encoding module and an Attention-Based Value Decomposition module based on the Transformer architecture. The former leverages both temporal and spatial observation information, compensating for the missing environmental perspectives due to partial observability. The latter is designed to identify each agent’s individual contribution in complex interactions, effectively optimizing the credit assignment process. Then, we propose a Credit-Oriented Strategy Differentiation module that differentiates the entity representations of each agent based on their current task differences, allowing agents to have distinct real-time strategies, effectively mitigating the issue of strategy homogenization. We evaluate the proposed method on the SMAC benchmark. It demonstrates better final performance, faster convergence, and greater stability compared to other comparative methods. Additionally, a series of experiments are conducted to validate the effectiveness of the proposed modules. Our code is available at https://github.com/Hkxuan/TMRC.git.

Abstract:
Human guidance has emerged as a powerful tool for enhancing reinforcement learning (RL). However, conventional forms of guidance such as demonstrations or binary scalar feedback can be challenging to collect or have low information content, motivating the exploration of other forms of human input. Among these, relative feedback (i.e., feedback on how to improve an action, such as "more to the left") offers a good balance between usability and information richness. Previous research has shown that relative feedback can be used to enhance policy search methods. However, these efforts have been limited to specific policy classes and use feedback inefficiently. In this work, we introduce a novel method to learn from relative feedback and combine it with off-policy reinforcement learning. Through evaluations on two sparse-reward tasks, we demonstrate our method can be used to improve the sample efficiency of reinforcement learning by guiding its exploration process. Additionally, we show it can adapt a policy to changes in the environment or the user’s preferences. Finally, we demonstrate real-world applicability by employing our approach to learn a navigation policy in a sparse reward setting.

Abstract:
Task-aware navigation continues to be a challenging area of research, especially in scenarios involving open vocabulary. Previous studies primarily focus on finding suitable locations for task completion, often overlooking the importance of the robot’s pose. However, the robot’s orientation is crucial for successfully completing tasks because of how objects are arranged (e.g., to open a refrigerator door). Humans intuitively navigate to objects with the right orientation using semantics and common sense. For instance, when opening a refrigerator, we naturally stand in front of it rather than to the side. Recent advances suggest that Vision-Language Models (VLMs) can provide robots with similar common sense. Therefore, we develop a VLM-driven method called Navigation-to-Gaze (Navi2Gaze) for efficient navigation and object gazing based on task descriptions. This method uses the VLM to score and select the best pose from numerous candidates automatically. In evaluations on multiple photorealistic simulation benchmarks, Navi2Gaze significantly outperforms existing approaches by precisely determining the optimal orientation relative to target objects, resulting in a 68.8% reduction in Distance to Goal (DTG). Real-world video demonstrations can be found on the supplementary website1.

Abstract:
The geometry of terrain is crucial for developing terrain-aware locomotion policies. Recent advancements in quadrupedal locomotion based on learning rely on depth information obtained from LiDARs and depth cameras. Despite the capabilities of these locomotion policies on terrains, they pose challenges in processing high-dimensional data in real time with onboard hardware. In this study, we develop a lightweight framework that utilizes only the intrinsic sensors of a quadrupedal robot to facilitate terrain-aware locomotion. We introduce a learning-based leg odometry, integrated with a locomotion policy trained through reinforcement learning. Utilizing blind localization from leg odometry alongside a pre-constructed height map enables the robot to navigate steps and stairs without incident.We assess the efficacy of our framework through simulations, where our results indicate that the robot achieves up to a 17% improvement in successful traversal rates and requires fewer point samples. By compensating for slippage during locomotion, our learning-based leg odometry surpasses traditional inertialleg odometry. Lastly, we validate the practical applicability of our models on a real robot, confirming their effectiveness in real-world settings.

Abstract:
When working around other agents such as humans, it is important to model their perception capabilities to predict and make sense of their behavior. In this work, we consider agents whose perception capabilities are determined by their limited field of view, viewing range, and the potential to miss objects within their viewing range. By considering the perception capabilities and observation model of agents independently from their motion policy, we show that we can better predict the agents’ behavior; i.e., by reasoning about the perception capabilities of other agents, one can better make sense of their actions. We perform a user study where human operators navigate a cluttered scene while scanning the region for obstacles with a limited field of view and range. We show that by reasoning about the limited observation space of humans, a robot can better learn a human’s strategy for navigate an environment and navigate with minimal collision with dynamic and static obstacles. We also show that this learned model helps it successfully navigate a physical hardware vehicle in real-time. Code available at https://github.com/labicon/HRMotion-RestrictedView.

Abstract:
Most existing NeRF methods rely on RGB images, making them unsuitable for scenarios with darkness, low light, or adverse weather conditions. To address this limitation, we propose TeX-NeRF, a NeRF framework based on heat sensing, designed for a new task: novel HADAR view synthesis. Our approach leverages Pseudo-TeX Vision to effectively transform heat sensing images through a structured mapping process. We introduce a loss function tailored to the transformed representation and incorporate temperature gradient embedding to enhance the capture of thermal information. Additionally, we construct 3D-TeX, a high-quality heat sensing dataset, to validate our method. Extensive experiments demonstrate that TeX-NeRF significantly improves pose estimation success rates for heat sensing images and outperforms existing approaches in novel HADAR view synthesis.

Abstract:
Visual object navigation, requiring agents to locate target objects in novel environments through egocentric visual observation, remains a critical challenge in Embodied AI. We propose FEG-VON, a training-free framework that constructs and maintains a Frontier Embedding Graph for efficient Visual Object Navigation. The graph initializes frontier embeddings using Vision Language Models (VLMs), where visual observations are encoded into spatially anchored semantic embeddings through cross-modal alignment with target text descriptors. We then update the graph by aggregating spatio-temporal semantic relations across frontiers, enabling online adaptation to new targets via similarity scoring without remapping. The evaluation results in public benchmarks demonstrate the superior performance of FEG-VON in both single- and multi-object navigation tasks compared with state-of-the-art methods. Crucially, FEG-VON eliminates dependency on task-specific training for exploration and advances the feasibility of zero-shot navigation in open-world environments.

Abstract:
Driven by the need for rapid and reliable heavy payload transport in logistics and manufacturing, researchers are increasingly exploring early applications of humanoid robotics in these domains. Although bipedal locomotion excels on challenging terrain, wheeled modes of transportation remain significantly more energy-efficient on flat surfaces. In this work, we develop a control system that enables a humanoid robot to achieve fast transportation - by riding a two-wheeled hoverboard - and robust heavy payload handling through whole-body grasping, where the robot uses its chest and arms to stabilize bulky objects. Our approach models payload-induced disturbances using a Linear Inverted Pendulum Mode extended with external forces and leverages tactile feedback from an integrated robotic skin to estimate the payload’s weight and center of mass. Feeding these estimates into the hoverboard controller reduces drift and enhances stability. Experimental evaluations on a full-sized real humanoid robot show that our system can withstand strong disturbances and autonomously navigate to deliver payloads of up to 20 kg.

Abstract:
Current service robots suffer from limited natural language communication abilities, heavy reliance on predefined commands, ongoing human intervention, and, most notably, a lack of proactive collaboration awareness in human-populated environments. This results in narrow applicability and low utility. In this paper, we introduce AssistantX, an LLM-powered proactive assistant designed for autonomous operation in real-world scenarios with high accuracy. AssistantX employs a multi-agent framework consisting of 4 specialized LLM agents, each dedicated to perception, planning, decision-making, and reflective review, facilitating advanced inference capabilities and comprehensive collaboration awareness, much like a human assistant by your side. We built a dataset of 210 real-world tasks to validate AssistantX, which includes instruction content and status information on whether relevant personnel are available. Extensive experiments were conducted in both text-based simulations and a real office environment over the course of a month and a half. Our experiments demonstrate the effectiveness of the proposed framework, showing that AssistantX can reactively respond to user instructions, actively adjust strategies to adapt to contingencies, and proactively seek assistance from humans to ensure successful task completion. More details and videos can be found at https://assistantx-agent.github.io/AssistantX/.

Abstract:
We address the challenge of object relocation by robots in environments where their behavior is expected to resemble that of humans. Existing methods typically learn to regress the position and orientation of objects specified by natural language commands using training data. However, these approaches do not account for physical constraints during training, often resulting in collisions between relocated objects. In this work, we introduce a collision avoidance loss based on functions that incorporate object size into the training process. Specifically, we propose a type of occupancy function in which particles are represented by a 3D Gaussian probability density function. By incorporating these functions into an additional training phase of existing models, we demonstrate a reduction in the number of collisions during rearrangement tasks. Notably, despite the decrease in collisions, the semantic structure of the relocation results is preserved.

Abstract:
Sign language serves individuals with hearing impairments as a crucial communication mode operating through visual-manual means. While there has been established theory and agreement about embodiment in multiple fields, only limited research has deeply engaged to lower access to the physical body for spatial perception and engagement. Embodied robots are often cost-prohibitive, and existing open-source robot fabrication packages are limited in their ability to fully address communication nuances, typically running only on predefined programs. Reprogramming for broader bodily interactions, such as gestures in various domains (e.g., construction), is nearly impossible unless expertise precedes. We introduce FABRIC, an end-to-end toolkit for fabricating and programming bodily language for unique human-robot interactions. The toolkit includes a fully 3D-printable robot, designed for consumer-grade FDM machinery, that learns from demonstration (LfD) to capture and translate users’ bodily expressions through its upper torso (arms and hands) movements. A visual programming interface enables appending or sequencing demonstrations from various sources, i.e., videos, cameras, and expandable word/phrase/sentence libraries.

Abstract:
Incrementally recovering real-sized 3D geometry from a pose-free RGB stream is a challenging task in 3D reconstruction, requiring minimal assumptions on input data. Existing methods can be broadly categorized into end-to-end and visual SLAM-based approaches, both of which either struggle with long sequences or depend on slow test-time optimization and depth sensors. To address this, we first integrate a depth estimator into an RGB-D SLAM system, but this approach is hindered by inaccurate geometric details in predicted depth. Through further investigation, we find that 3D Gaussian mapping can effectively solve this problem. Building on this, we propose an online 3D reconstruction method using 3D Gaussian-based SLAM, combined with a feed-forward recurrent prediction module to directly infer camera pose from optical flow. This approach replaces slow test-time optimization with fast network inference, significantly improving tracking speed. Additionally, we introduce a local graph rendering technique to enhance robustness in feed-forward pose prediction. Experimental results on the Replica and TUM-RGBD datasets, along with a real-world deployment demonstration, show that our method achieves performance on par with the state-of-the-art SplaTAM, while reducing tracking time by more than 90%. Code is available at https://github.com/wangyr22/DepthGS

Abstract:
Physical manipulation of garments is often crucial when performing fabric-related tasks, such as hanging garments. However, due to the deformable nature of fabrics, these operations remain a significant challenge for robots in household, healthcare, and industrial environments. In this paper, we propose GraphGarment, a novel approach that models garment dynamics based on robot control inputs and applies the learned dynamics model to facilitate garment manipulation tasks such as hanging. Specifically, we use graphs to represent the interactions between the robot end-effector and the garment. GraphGarment uses a graph neural network (GNN) to learn a dynamics model that can predict the next garment state given the current state and input action in simulation. To address the substantial sim-to-real gap, we propose a residual model that compensates for garment state prediction errors, thereby improving real-world performance. The garment dynamics model is then applied to a model-based action sampling strategy, where it is utilized to manipulate the garment to a reference pre-hanging configuration for garment-hanging tasks. We conducted four experiments using six types of garments to validate our approach in both simulation and real-world settings. In simulation experiments, GraphGarment achieves better garment state prediction performance, with a prediction error 0.46 cm lower than the best baseline. Our approach also demonstrates improved performance in the garment-hanging simulation experiment—with enhancements of 12%, 24%, and 10%, respectively. Moreover, real-world robot experiments confirm the robustness of sim-to-real transfer, with an error increase of 0.17 cm compared to simulation results. Supplementary material is available at: https://sites.google.com/view/graphgarment.

Abstract:
In robotics, Learning from Demonstration (LfD) aims to transfer skills to robots by using multiple demonstrations of the same task. These demonstrations are recorded and processed to extract a consistent skill representation. This process typically requires temporal alignment through techniques such as Dynamic Time Warping (DTW). In this pa per, we consider a novel algorithm, named Spatial Sampling (SS), specifically designed for robot trajectories, that enables time-independent alignment of the trajectories by providing an arc-length parametrization of the signals. This approach eliminates the need for temporal alignment, enhancing the accuracy and robustness of skill representation, especially when recorded movements are subject to intermittent motions or extremely variable speeds, a common characteristic of operations based on kinesthetic teaching, where the operator may encounter difficulties in guiding the end-effector smoothly. To prove this, we built a custom publicly available dataset of robot recordings to test real-world movements, where the user tracks the same geometric path multiple times, with motion laws that vary greatly and are subject to starting and stopping. The SS demonstrates better performances against state-of-the-art algorithms in terms of (i) trajectory synchronization and (ii) quality of the extracted skill.

Abstract:
Reinforcement learning (RL) is a promising approach for robotic navigation, allowing robots to learn through trial and error. However, real-world robotic tasks often suffer from sparse rewards, leading to inefficient exploration and suboptimal policies due to sample inefficiency of RL. In this work, we introduce Confidence-Controlled Exploration (CCE), a novel method that improves sample efficiency in RL-based robotic navigation without modifying the reward function. Unlike existing approaches, such as entropy regularization and reward shaping, which can introduce instability by altering rewards, CCE dynamically adjusts trajectory length based on policy entropy. Specifically, it shortens trajectories when uncertainty is high to enhance exploration and extends them when confidence is high to prioritize exploitation. CCE is a principled and practical solution inspired by a theoretical connection between policy entropy and gradient estimation. It integrates seamlessly with on-policy and off-policy RL methods and requires minimal modifications. We validate CCE across REINFORCE, PPO, and SAC in both simulated and real-world navigation tasks. CCE outperforms fixed-trajectory and entropy-regularized baselines, achieving an 18% higher success rate, 20-38% shorter paths, and 9.32% lower elevation costs under a fixed training sample budget. Finally, we deploy CCE on a Clearpath Husky robot, demonstrating its effectiveness in complex outdoor environments.

Abstract:
Advanced scene understanding is crucial for robots to navigate robustly in complex 3D environments. Recent works utilize large Vision-Language Models (VLMs) to embed semantic information into reconstructed maps, thereby creating open-vocabulary scene representations for instance-aware robot navigation. However, existing methods primarily generate point-wise feature vectors for maps, which inadequately capture the intricate scene contents necessary for navigation tasks, including holistic and relational object information. To address this limitation, we propose a novel Dual-Level Open-Vocabulary 3D (DLOV-3D) scene representation framework to improve robot navigation performance. Our framework integrates both pixel-level and image-level features into spatial scene representations, facilitating a more comprehensive understanding of the scene. By incorporating an adaptive revalidation mechanism, DLOV-3D achieves precise instance-aware navigation based on free-form queries that describe object properties such as color, shape, and relational references. Notably, when combined with Large Language Models (LLMs), DLOV-3D supports long-sequence multi-instance robot navigation guided by natural language instructions. Extensive experimental results demonstrate that DLOV-3D achieves new state-of-the-art performance in instance-aware robot navigation.

Abstract:
Achieving robust locomotion on complex terrains remains a challenge due to high-dimensional control and environmental uncertainties. This paper introduces a teacher-prior framework based on the teacher-student paradigm, integrating imitation and auxiliary task learning to improve learning efficiency and generalization. Unlike traditional paradigms that strongly rely on encoder-based state embeddings, our framework decouples the network design, simplifying the policy network and deployment. A high-performance teacher policy is first trained using privileged information to acquire generalizable motion skills. The teacher’s motion distribution is transferred to the student policy, which relies only on noisy proprioceptive data, via a generative adversarial mechanism to mitigate performance degradation caused by distributional shifts. Additionally, auxiliary task learning enhances the student policy’s feature representation, speeding up convergence and improving adaptability to varying terrains. The framework is validated on a humanoid robot, showing a great improvement in locomotion stability on dynamic terrains and significant reductions in development costs. This work provides a practical solution for deploying robust locomotion strategies in humanoid robots.

Abstract:
Humanoid robots encounter considerable difficulties in autonomously recovering from falls, especially within dynamic and unstructured environments. Conventional control methodologies are often inadequate in addressing the complexities associated with high-dimensional dynamics and the contact-rich nature of fall recovery. Meanwhile, reinforcement learning techniques are hindered by issues related to sparse rewards, intricate collision scenarios, and discrepancies between simulation and real-world applications. In this study, we introduce a multi-stage curriculum learning framework, termed HiFAR. This framework employs a staged learning approach that progressively incorporates increasingly complex and high-dimensional recovery tasks, thereby facilitating the robot’s acquisition of efficient and stable fall recovery strategies. Furthermore, it enables the robot to adapt its policy to effectively manage real-world fall incidents. We assess the efficacy of the proposed method using a real humanoid robot, showcasing its capability to autonomously recover from a diverse range of falls with high success rates, rapid recovery times, robustness, and generalization.

Abstract:
Reconstructing dynamic roads from roadside traffic surveillance cameras is crucial for smart cities and digital twin applications. While the latest monocular depth estimation methods demonstrate strong performance, they exhibit instability in roadside scenarios. Existing reconstruction approaches for autonomous driving scenes predominantly adopt vehicle-mounted perspectives, accumulating vehicle point clouds from per-frame depth maps using 3D bounding boxes. These point clouds are used to initialize the center positions and colors of 3D Gaussians to improve reconstruction performance. However, the compressed depth discrepancy between vehicles and road surfaces in roadside views leads to model confusion between vehicle and background depth estimations. To address these challenges, we propose a robust reconstruction framework based on a single fixed RGB traffic camera. Differing from conventional frame-wise depth prediction followed by 3D box-based accumulation, our method processes masked vehicle fore-ground sequences through existing models, directly predicting complete vehicle point clouds via local feature matching and global alignment while iteratively refining 3D boxes to enhance reconstruction quality. Leveraging the explicit nature of 3D Gaussians for scene editing, we introduce simple yet effective road constraints to mitigate penetration artifacts during scene manipulation. Extensive evaluations on the TUMTraf-V2X and RCooper datasets under monocular roadside settings validate the effectiveness of our approach.

Abstract:
Efficient object packing is a fundamental challenge in logistics and industrial automation. This work introduces Snuggle-Pack, a novel 3D packing algorithm that integrates Fast Fourier Transform (FFT)-based spatial analysis with a multi-heuristic optimization framework to achieve real-time, high-density packing. Unlike traditional heuristic-based approaches that rely on 2D simplifications, our method operates in a fully 3D volumetric space, ensuring collision-free, stable, and physically feasible placements. At its core, our approach employs a proximity-aware and support-sensitive placement strategy, which encourages objects to fit snugly within their surroundings —hence the name—, optimizing space utilization through ne-grained collision metrics. We evaluate our method on the YCB and IPA-3D1K datasets in both previewed and ad-hoc packing scenarios. Our experiments show that Snuggle-Pack significantly outperforms the state of the art, achieving up to 25% higher packing densities or, alternatively, accelerating computation by up to 10×. Moreover, our framework allows for dynamic adaptation to custom constraints, such as balanced center of mass, weight limitations on fragile items, and safety proximity constraints. These results highlight Snuggle-Pack as an efficient, flexible, and scalable solution for industrial robotic packing tasks.

Abstract:
Terrain analysis is critical for the practical application of ground mobile robots in real-world tasks, especially in outdoor unstructured environments. In this paper, we propose a novel spatial-temporal traversability assessment method, which aims to enable autonomous robots to effectively navigate through complex terrains. Our approach utilizes sparse Gaussian processes (SGP) to extract geometric features (curvature, gradient, elevation, etc.) directly from point cloud scans. These features are then used to construct a high-resolution local traversability map. Then, we design a spatial-temporal Bayesian Gaussian kernel (BGK) inference method to dynamically evaluate traversability scores, integrating historical and real-time data while considering factors such as slope, flatness, gradient, and uncertainty metrics. GPU acceleration is applied in the feature extraction step, and the system achieves real-time performance. Extensive simulation experiments across diverse terrain scenarios demonstrate that our method outperforms SOTA approaches in both accuracy and computational efficiency. Additionally, we develop an autonomous navigation framework integrated with the traversability map and validate it with a differential driven vehicle in complex outdoor environments. Our code will be open-source for further research and development by the community, https://github.com/ZJU-FAST-Lab/FSGP_BGK.

Abstract:
Hyper-redundant tendon-driven manipulators offer greater flexibility and compliance over traditional manipulators. A common way of controlling such manipulators relies on adjusting tendon lengths, which is an accessible control parameter. This approach works well when the kinematic configuration is representative of the real operational conditions. However, when dealing with manipulators of larger size subject to gravity, it becomes necessary to solve a static force problem, using tendon force as the input and employing a mapping from the configuration space to retrieve tendon length. Alternatively, measurements of the manipulator posture can be used to iteratively adjust tendon lengths to achieve a desired posture. Hence, either tension measurement or state estimation of the manipulator are required, both of which are not always accurately available. Here, we propose a solution by reconciling cables tension and length as the input for the solution of the system forward statics. We develop a screw-based formulation for a tendon-driven, multi-segment, hyper-redundant manipulator with elastic joints and introduce a forward statics iterative solution method that equivalently makes use of either tendon length or tension as the input. This strategy is experimentally validated using a traditional tension input first, subsequently showing the efficacy of the method when exclusively tendon lengths are used. The results confirm the possibility to perform open-loop control in static conditions using a kinematic input only, thus bypassing some of the practical problems with tension measurement and state estimation of hyper-redundant systems.

Abstract:
We present a low-cost, soft tactile sensor using common, easily sourced materials that can be integrated with existing robotic gripper systems without requiring complex fabrication techniques or expensive components. Our approach includes two designs: a flexible linear sensor constructed from a rubber tube and a planar sensor made with a rubber membrane stretched over an enclosure. Both sensors contain an embedded speaker and microphones that leverage active acoustic sensing to map the unique acoustic resonant response of the cavity’s structure to deformations that occur when the robotic gripper is grasping an object. Experimental results demonstrate that, using a support vector machine, the linear sensor achieves contact point estimation with an RMSE of 6 mm, while the planar sensor achieves an RMSE of 0.57–0.62 mm. Additionally, the planar sensor classifies six objects with an accuracy of 97.7%. These results demonstrate the potential for active acoustics to be an accessible method for enabling tactile sensing capabilities for robotic systems.

Abstract:
Robots rely heavily on visual perception to understand and interact with complex environments. To support this capability, modern perception models have become increasingly large and powerful, resulting in high computational costs that hinder their real-time performance in robotic applications. Existing acceleration techniques, such as model pruning and token pruning, focus on reducing architectural or parameter redundancy but still process all object categories, regardless of task requirements. However, in real-world robotic scenarios, different tasks typically require only a subset of object categories. For instance, a service robot may focus on kitchenware while cooking, but shift to furniture and obstacles while cleaning. This task-dependent variation creates opportunities to reduce computational cost by selectively processing relevant information. Existing methods are not designed to exploit this potential for task-specific efficiency. To address this limitation, we propose TaskTP, a task-oriented token pruning method that dynamically adjusts token pruning based on the target category set. A dynamic gating network is introduced between successive Transformer blocks, which evaluates the relevance of each token to the given task. TaskTP allows for more aggressive pruning when fewer categories are required, optimizing computation without sacrificing performance. After a task-agnostic training phase, it can be flexibly configured at deployment time to support any category subset without retraining, making it both efficient and versatile. TaskTP improves the performance of Mask R-CNN from 31.4 fps to 38.5 fps on the COCO dataset. Furthermore, on the ScanNet dataset, where an object search task was defined to simulate real-world robotic applications, processing time was reduced from 3197 ms to 2437 ms, demonstrating significant efficiency gains.

Abstract:
Articulated objects are ubiquitous in daily environments, and effective manipulation of these objects is essential for advancing open-world robotics. Existing approaches, which rely heavily on large-scale data collection or simulation, often face limitations in real-world applications, including issues with generalization and the sim-to-real gap. In this paper, we introduce the Automated Articulated Object Parameter Learning (AAOPL) framework, which autonomously learns the articulation parameters of real-world articulated objects through direct interaction. This approach enables robots to generate precise manipulation trajectories without relying on predefined object models or extensive human demonstration data. To accelerate the learning process, we develop Accelerated Single-Step Gradient (ASSG) algorithm, which efficiently refines the articulation parameters by leveraging real-time execution feedback. Experimental results demonstrate that AAOPL can learn accurate articulation parameters within 30 minutes (80 trials) and generate robust manipulation trajectories, outperforming baseline methods in terms of both task completion and force efficiency. Our approach eliminates the need for large-scale training datasets and can adapt to various articulated objects in real-world environments, offering a scalable solution for autonomous robotic manipulation in unstructured settings.

Abstract:
This paper presents a Continuous Impact-Minimizing (CIM) rolling locomotion method for Variable-Topology Truss (VTT) robots, addressing limitations of conventional stepwise motion. Traditional VTT locomotion depends on discrete reference transitions, resulting in pauses, slow progress, and unintended ground impacts. Inertia-driven rotation at each step also generates impact forces on joints, raising durability concerns. CIM rolling continuously adjusts joint lengths by tracking the center of gravity in real time, enabling smoother motion and minimizing impacts. This approach allows VTTs to move directly to targets without unnecessary resets. Simulations validate the effectiveness of CIM rolling, demonstrating a 50% increase in speed and a 49% reduction in nodal impact force compared to conventional methods.

Abstract:
Motion planning in dynamic and uncertain real-world environments remains a critical challenge in robotics, as it is essential for the effective operation of autonomous systems. One strategy for motion planning has been to introduce a state lattice where pre-computed motion primitives can be combined with graph-based search methods to find a physically feasible motion plan. However, introducing lattice planning into dynamic, uncertain settings remains challenging. It is nontrivial to incorporate uncertain dynamic information into the planning process in real time. Thus, in this paper we propose a lattice planning framework for dynamic environments with extensions to handle safety-critical edge-cases that can arise with the uncertain nature of the environment. The proposed method, Safe Lattice Planner (SLP), extends the Receding-Horizon Lattice Planner (RHLP) with enhanced replanning and survival capabilities to handle the dynamic habitat. We thoroughly evaluate SLP in a new benchmark suite against provided baselines. SLP is found to outperform the baselines in terms of safety and resilience in the dynamic environment while reaching the goal state in an efficient manner. We release the benchmark and SLP to accelerate the field of safe robotics.

Abstract:
Localized thermal convective updrafts in the atmosphere, commonly referred to as thermals, serve as a significant source of energy-efficient flight for birds, human pilots, and autonomous aircraft. Measuring the vertical airspeed distribution within these updrafts to estimate their strength and characteristics is a significant empirical challenge. In this study, we introduce a proof-of-concept distributed thermal measurement system that uses small, cost-effective multirotor drones equipped with standard sensors, eliminating the need for specialized airspeed instruments. These drones estimate updraft strength by analyzing performance parameters such as power consumption or rotor rotation speed. First, we conducted extensive investigations in a simulated environment that incorporated varying windy conditions to establish the relationship between the updraft speed (the vertical speed of the air) and the performance parameters of the drones. Following this, we conducted outdoor experiments involving up to 49 multirotor drones to demonstrate the effectiveness of the distributed measurement system in action. By advancing our understanding of thermal updrafts, this research contributes valuable information to analyze avian flight behavior and also facilitates the development of realistic simulation environments. These advancements can enhance the design of thermal-utilizing self-driving algorithms for unmanned aerial vehicles (UAVs), paving the way for more efficient and adaptive robotic systems.

Abstract:
The performance of a vision simultaneous localization and mapping (SLAM) system based on hand-crafted features degrades significantly in harsh environments due to unstable feature tracking. With the breakthrough of convolutional neural networks in deep feature extraction tasks, many researchers have tried to incorporate them into SLAM systems. However, it’s challenging to guarantee the real-time performance of the entire SLAM system, and the erroneous usage scenarios limit the superior performance of deep feature extraction methods. To overcome these problems, we propose a visual-inertial SLAM system with multi-level detector and descriptor, called VINS-MLD2. In our framework, we first design an efficient deep feature extraction network that has the same performance as R2D2 by concatenating multi-level features, but runs 3 times faster under the image resolution commonly used in SLAM. Then, based on the camera baseline, we introduce the Matching Fusion, a matching method that fuses deep descriptor matching and optical flow matching results to improve matching accuracy for both short and wide baselines. In addition, an adaptive matching strategy is proposed to balance the running time and accuracy by adaptively adjusting the matching method. Experimental results in unmanned aerial vehicle (UAV) deployments and real-world environments demonstrate that the proposed method tracks features more stably and accurately. The code is public at https://github.com/dongdong-cai/VINS-MLD2.

Abstract:
Large Language Models (LLMs) have demonstrated remarkable planning abilities across various domains, including robotic manipulation and navigation. While recent work in robotics deploys LLMs for high-level and low-level planning, existing methods often face challenges with failure recovery and suffer from hallucinations in long-horizon tasks. To address these limitations, we propose a novel multi-agent LLM framework, Multi-Agent Large Language Model for Manipulation (MALMM). Notably, MALMM distributes planning across three specialized LLM agents, namely high-level planning agent, low-level control agent, and a supervisor agent. Moreover, by incorporating environment observations after each step, our framework effectively handles intermediate failures and enables adaptive re-planning. Unlike existing methods, MALMM does not rely on pre-trained skill policies or in-context learning examples and generalizes to unseen tasks. In our experiments, MALMM demonstrates excellent performance in solving previously unseen long-horizon manipulation tasks, and outperforms existing zero-shot LLM-based methods in RLBench by a large margin. Experiments with the Franka robot arm further validate our approach in real-world settings.

Abstract:
Advancements in deep multi-agent reinforcement learning (MARL) have positioned it as a promising approach for decision-making in cooperative games. However, it still remains challenging for MARL agents to learn cooperative strategies for some game environments. Recently, large language models (LLMs) have demonstrated emergent reasoning capabilities, making them promising candidates for enhancing coordination among the agents. However, due to the model size of LLMs, it can be expensive to frequently infer LLMs for actions that agents can take. In this work, we propose You Only LLM Once for MARL (YOLO-MARL), a novel framework that leverages the high-level task planning capabilities of LLMs to improve the policy learning process of multi-agents in cooperative games. Notably, for each game environment, YOLO-MARL only requires one time interaction with LLMs in the proposed strategy generation, state interpretation and planning function generation modules, before the MARL policy training process. This avoids the ongoing costs and computational time associated with frequent LLMs API calls during training. Moreover, trained decentralized policies based on normal-sized neural networks operate independently of the LLM. We evaluate our method across two different environments and demonstrate that YOLO-MARL outperforms traditional MARL algorithms. The Github repository of our code can be found at https://github.com/paulzyzy/YOLO-MARL.

Abstract:
Domain adaptation is required for automated driving models to generalize well across diverse road conditions. This paper explores a training method for domain adaptation to adapt PilotNet, an end-to-end deep learning-based model, for left-hand driving conditions using real-world Australian highway data. Four training methods were evaluated: (1) a baseline model trained on U.S. right-hand driving data, (2) a model trained on flipped U.S. data, (3) a model pretrained on U.S. data and then fine-tuned on Australian highways, and (4) a model pretrained on flipped U.S. data and then fine-tuned on Australian highways. This setup examines whether incorporating flipped data enhances the model adaptation by providing an initial left-hand driving alignment. The paper compares model performance regarding steering prediction accuracy and attention, using saliency-based analysis to measure attention shifts across significant road regions. Results show that pretraining on flipped data alone worsens prediction stability due to misaligned feature representations, but significantly improves adaptation when followed by fine-tuning, leading to lower prediction error and stronger focus on left-side cues. To validate this approach across different architectures, the same experiments were done on ResNet, which confirmed similar adaptation trends. These findings emphasize the importance of preprocessing techniques, such as flipped-data pretraining, followed by fine-tuning to improve model adaptation with minimal retraining requirements.

Abstract:
Although acrobatic flight control has been studied extensively, one key limitation of the existing methods is that they are usually restricted to specific maneuver tasks and cannot change flight pattern parameters online. In this work, we propose a target-and-command-oriented reinforcement learning (TACO) framework, which can handle different maneuver tasks in a unified way and allows online parameter changes. We also propose a spectral normalization method with input-output rescaling to enhance the policy’s temporal and spatial smoothness, independence, and symmetry, thereby overcoming the sim-to-real gap. We validate the TACO approach through extensive simulation and real-world experiments, demonstrating its ability to achieve high-speed, high-accuracy circular flights and continuous multi-flips. The code is available at https://github.com/yinzikang/TACO.

Abstract:
Robot-assisted endovascular navigation provides significant advantages, including reduced radiation exposure for surgeons and improved patient safety. However, a major challenge is to control curvilinear instruments like guidewires precisely for smooth and accurate navigation while adapting to anatomical variations and external forces. Traditional segmentation-based approaches struggle with real-time prediction of the guidewire’s evolving shape, limiting their effectiveness in navigation tasks. In this paper, we propose SplineFormer, an explainable transformer network that predicts the continuous, structured representation of the guidewire as a B-spline. This formulation enables a compact, smooth, and explainable state representation that facilitates downstream navigation. By leveraging SplineFormer’s predictions within an imitation learning framework, our system successfully performs autonomous endovascular navigation. Experimental results show that SplineFormer achieves a 50% success rate when fully autonomously cannulating the Brachiocephalic Artery in a real robotic setup, demonstrating its potential for improved autonomous navigation in endovascular interventions.

Abstract:
Vision-Language-Action Models (VLAs) have demonstrated remarkable generalization capabilities in real-world experiments. However, their success rates are often not on par with expert policies, and they require fine-tuning when the setup changes. In this work, we introduce Refined Policy Distillation (RPD), a novel Reinforcement Learning (RL)-based policy refinement method that bridges this performance gap through a combination of on-policy RL with behavioral cloning. The core idea of RPD is to distill and refine VLAs into compact, high-performing expert policies by guiding the student policy during RL exploration using the actions of a teacher VLA, resulting in increased sample efficiency and faster convergence. We complement our method by fine-tuned versions of Octo and OpenVLA for ManiSkill3 to evaluate RPD in simulation. While this is a key requirement for applying RL, it also yields new insights beyond existing studies on VLA performance in real-world settings. Our experimental results across various manipulation tasks show that RPD enables the RL student to learn expert policies that outperform the VLA teacher in both dense and sparse reward settings, while also achieving faster convergence than the RL baseline. Our approach is even robust to changes in camera perspective and can generalize to task variations that the underlying VLA cannot solve. Our code, dataset, VLA checkpoints, and videos are available at https://refined-policy-distillation.github.io

Abstract:
Imitation Learning (IL) has shown significant promise in autonomous driving, but its performance heavily depends on the quality of training data. Noisy or corrupted sensor inputs can degrade learned policies, leading to unsafe behavior. This paper presents an unsupervised anomaly detection approach to automatically filter out abnormal images from driving datasets, thereby enhancing IL performance. Our method leverages a Convolutional Autoencoder with a novel latent reference loss, which forces abnormal images to reconstruct with higher errors than normal images. This enables effective anomaly detection without requiring manually labeled data. We validate our approach on the realistic DonkeyCar autonomous racing platform, demonstrating that filtering videos significantly improves IL policies, as measured by a 25-40% reduction in cross-track error. Compared to baseline and ablation models, our method achieves superior anomaly detection across three real-world video corruptions: collision-based occlusions, transparent obstructions, and raindrop interference. The results highlight the effectiveness of unsupervised video anomaly detection in improving the robustness and performance of IL-based autonomous control.Video: https://youtu.be/RjJ3nZR6RQ

Abstract:
Visual Place Recognition (VPR) is essential for robotics and autonomous driving, enabling localization by matching current observations with a database of known places. While monocular VPR methods rely on visual features, they are sensitive to environmental changes, and multimodal approaches using LiDAR or radar incur high costs and complexity. Multi-view camera configurations offer a cost-effective alternative by expanding perception range and providing richer structural information. In this work, we propose IMVPR, an implicit BEV-enhanced multi-view place recognition network that achieves consistent and parallel multi-view feature fusion and place descriptors aggregation. Unlike methods that explicitly construct BEV features, we introduce descriptor queries to implicitly represent 3D spatial locations, facilitating spatial point projection-based fusion. A cross-attention mechanism further enables end-to-end multi-view feature aggregation. We evaluate IMVPR on four scenes from the nuScenes dataset, including both in-domain and out-of-domain scenarios, demonstrating its superior accuracy and generalization compared to state-of-the-art methods, including multimodal approaches. Our results highlight the potential of multi-view vision-based methods as a scalable and robust solution for VPR.

Abstract:
In this paper, we present how a proximity array attached as an end-effector can perform a positioning task with respect to a cylinder. We develop the complete modeling of the system from which a classical control law is designed. Positioning with respect to the outside of a cylinder can be used for non-contact inspection of the exterior of pipes in chemical plants or in oil/gas industries, while positioning with respect to the inside of a hollow cylinder can be used for guidance inside a pipe for non-contact inspection. We provide simulation and experimental results to validate the presented theory.

Abstract:
Human Activity Recognition (HAR) plays a crucial role in applications such as healthcare, smart environments, and human-robot interaction. This study proposes a novel Hyperbolic optimization strategy to improve model generalization by leveraging the geometric structure of the parameter space. To evaluate its effectiveness while also benefiting from the capabilities of modern sequence models in capturing long-range dependencies and multimodal interactions, the loss function is integrated into Transformer and GPT-2 models, fine-tuned on both unimodal (UCI-HAR, Opportunity) and multimodal (UTD-MHAD, NTU RGB+D) datasets. Unlike prior work that typically leverages only one to three modalities, this study utilizes all available modalities—RGB, depth, skeleton, and inertial—for multimodal evaluation. The Transformer achieves 98.26% and 93.40% accuracy on UCI-HAR and Opportunity, respectively, and 99.08% and 89.93% on UTD-MHAD and NTU RGB+D. GPT-2 also performs competitively, achieving 86.33% and 83.57% on the unimodal datasets, and 83.23% and 86.51% on the multimodal ones. These results highlight the potential of hyperbolic optimization for HAR across diverse sensor modalities and architectures.

Abstract:
Deep learning-based multi-view stereo (MVS) methods enable dense point cloud reconstruction in texture-rich areas. However, existing methods incur significant computational costs to capture pixel dependencies for complete reconstruction in low-texture regions. Additionally, discrete depth layers in occluded environments hinder the cost volume’s ability to model object information effectively. To address these issues, we propose a spatial-aware multi-view stereo network with attention cost volume, termed SA-MVSNet. The network introduces the pixel-driven spatial interaction (PDSI) module, which integrates the hierarchical spatial location enhancement mechanism (HSLE) and the spatial context aggregation mechanism (SCA). Leveraging an efficient parallel architecture, the PDSI module captures pixel-level spatial dependencies with the HSLE and strengthens global contextual information through the SCA. This design improves the network’s ability to represent features in low-texture regions while maintaining high inference efficiency. Furthermore, SA-MVSNet incorporates an attention weight generation branch that refines the cost volume by aggregating multi-scale depth cues, effectively mitigating the impact of occlusion. Experiments on the DTU dataset and the Tanks and Temples dataset show that our method outperforms other learning-based methods, achieving superior performance and strong generalization ability.

Abstract:
Recent feature matching approaches like ETO have focused on developing lightweight matching algorithms for real-time applications. However, their lack of cross-image feature interaction and sufficient refinement often lead to a decline in matching accuracy. To address these challenges, we propose ETO+, a novel and accurate feature matching algorithm that incorporates a lightweight yet efficient bidirectional interaction module and multi-stage refinement. Specifically, we introduce Trans-CNN, a bidirectional feature interaction module that integrates CNN- and transformer-based techniques to enhance both intra-image feature refinement and inter-image feature fusion, all while maintaining a comparable computational cost. Furthermore, by leveraging the inherent sparsity of local feature matching, we propose an efficient strategy to adaptively reallocate computational resources within the network. Additionally, we design an adaptive loss function that mitigates the impact of large matching errors, thereby improving overall robustness. Extensive experiments on widely used datasets demonstrate that our approach achieves a strong balance between accuracy and computational efficiency. It outperforms ETO by 7.9 in AUC@5 on MegaDepth, respectively, while being about 40% faster than E-LoFTR.

Abstract:
Human action recognition (HAR) is a critical task in the field of robotics. Traditionally, HAR methods rely on either perceptual features from RGB images or skeletal features. While RGB-based features are typically represented in 2D Euclidean space, few approaches differentiate between methods developed for RGB data and those for skeletal features, often treating both as Euclidean representations. This conventional approach, which typically leverages standard deep learning techniques, limits the descriptive power of skeletal data, which naturally exhibits a tree-like structure. In this paper, we introduce a novel framework that, for the first time, utilizes skeletal data while preserving its inherent structure to fully capture its descriptive potential. Our proposed deep neural network embeds skeletal joints into hyperbolic space, followed by a spatio-temporal processing framework that incorporates established transformations to optimize performance while maintaining the advantages of hyperbolic analysis. Extensive experiments on publicly available datasets, including UAV-Human, UAV-Gesture, and DHG 14/28, demonstrate that our approach achieves state-of-the-art results, underscoring its ability to enhance robotic systems’ performance in dynamic environments.

Abstract:
Accurate dynamics modeling is crucial for achieving precise Ground Reaction Force (GRF) control and high-performance legged locomotion. However, real-world legged systems exhibit strong frictional effects with hysteresis and inter-joint coupling, which conventional static friction models or purely data-driven approaches often fail to capture. In this paper, we propose Mysteric-Net, a novel MIMO hysteretic friction-aware network that combines a Lagrangian-based formulation with a Temporal Convolutional Network (TCN). By embedding the physical laws of Lagrangian mechanics while modeling history-dependent frictional dissipation via the TCN, our framework accurately identifies the system dynamics, including complex friction and coupling effects. This paper demonstrates that the proposed method significantly improves the accuracy of inverse dynamics estimation on a robotic leg. Furthermore, this paper shows that the learned model enables the design of an effective feedforward controller that mitigates friction and enhances tracking performance over conventional baseline methods.

Abstract:
Ultrasound scanning is a critical imaging technique for real-time, non-invasive diagnostics. However, variations in patient anatomy and complex human-in-the-loop interactions pose significant challenges for autonomous robotic scanning. Existing ultrasound scanning robots are commonly limited to relatively low generalization and inefficient data utilization. To overcome these limitations, we present UltraDP, a Diffusion-Policy-based method that receives multi-sensory inputs (ultrasound images, wrist camera images, contact wrench, and probe pose) and generates actions that are fit for multi-modal action distributions in autonomous ultrasound scanning of carotid artery. We propose a specialized guidance module to enable the policy to output actions that center the artery in ultrasound images. To ensure stable contact and safe interaction between the robot and the human subject, a hybrid force-impedance controller is utilized to drive the robot to track such trajectories. Also, we have built a large-scale training dataset for carotid scanning comprising 210 scans with 460k sample pairs from 21 volunteers of both genders. By exploring our guidance module and DP’s strong generalization ability, UltraDP achieves a 95% success rate in transverse scanning on previously unseen subjects, demonstrating its effectiveness.

Abstract:
In the pursuit of deeper immersion in human-machine interaction, achieving higher-dimensional tactile input and output on a single interface has become a key research focus. This study introduces the Visual-Electronic Tactile (VET) System, which builds upon vision-based tactile sensors (VBTS) and integrates electrical stimulation feedback to enable bidirectional tactile communication. We propose and implement a system framework that seamlessly integrates an electrical stimulation film with VBTS using a screen-printing preparation process, eliminating interference from traditional methods. While VBTS captures multi-dimensional input through visuotactile signals, electrical stimulation feedback directly stimulates neural pathways, preventing interference with visuotactile information. The potential of the VET system is demonstrated through experiments on finger electrical stimulation sensitivity zones, as well as applications in interactive gaming and robotic arm teleoperation. This system paves the way for new advancements in bidirectional tactile interaction and its broader applications.

Abstract:
Efficient and safe integration of robots into human workspaces remains a significant challenge. The ISO 10218-2 standard defines permissible force thresholds that a robot is allowed to exert on humans, along with a simple model to estimate the impact force based on the impact velocity, the involved human body part, and the effective mass of the robot. In this work, we experimentally demonstrate that state-of-the-art approaches fail to compute the effective robot mass accurately, leading to unsafe or overly-restrictive robot behavior. We address this shortcoming by presenting a data-driven collision mass map that accurately predicts the effective mass perceived at the end effector for a given collision location for the entire workspace. These maps are trained using a limited set of impact data selected by our proposed measurement procedure and can serve as valuable references for safety-critical applications. We validate our method on two robots, demonstrating accurate force predictions in compliance with ISO 10218-2. In our experiments, we show that our approach greatly reduces the required force measurements compared to state-of-the-art data-driven methods for risk assessment. Furthermore, our approach allows one to easily integrate different payloads, making it highly adaptable to various collaborative tasks. The proposed collision mass map can be standardized and deployed for any collaborative robot, enabling simple integration of robots for safe and more efficient human-robot interaction.

Abstract:
The rapid growth of the aging population in developed countries makes healthcare an important social challenge. In this context, service robots can play a key role. This work presents a software application for a service robot (TIAGo, PAL Robotics) implementing a motor-cognitive game. The activity combines cognitive and physical stimulation, a design that is relatively uncommon in literature. The embodied interaction design of the game distinguishes it from classical touchscreen-based games. In the game, the robot mimes letters with its arm, and the user has to recognize and then imitate them. The letter sequence increases in length each turn to train memory. User gestures are tracked using an ArUCo marker, and classified via a neural network. The application was tested on 10 young subjects and 4 community-dwelling older adults (82.3 ± 3.5 years). Recognition accuracy reached 92.2% and 80.5%, respectively, for young and older adults. Post-Session questionnaires highlighted high engagement and perceived usefulness, especially among older users who appreciated the memory and physical training aspects. This pilot project demonstrates the potential of integrating service robots into eldercare to support both patients and caregivers.

Abstract:
Recent work has shown that 3D Gaussian-based SLAM enables high-quality reconstruction, accurate pose estimation, and real-time rendering of scenes. However, these approaches are built on a tremendous number of redundant 3D Gaussian ellipsoids, leading to high memory and storage costs and slow training speed. To address this limitation, we propose a compact 3D Gaussian Splatting SLAM system that reduces the number and the parameter size of Gaussian ellipsoids. A sliding window-based masking strategy is first proposed to reduce the redundant ellipsoids. Then, a novel geometry codebook-based quantization method is proposed to further compress 3D Gaussian geometric attributes. Robust and accurate pose estimation is achieved by a local-to-global bundle adjustment method with reprojection loss. Extensive experiments demonstrate that our method achieves faster training, rendering speed, and low memory usage while maintaining the state-of-the-art (SOTA) quality of the scene representation.

Abstract:
Adversarial patch defense has made significant progress recently, but defending against natural-looking patches remains a challenge due to their content-agnostic nature. We hypothesize that these patches exhibit position-related anomalies and are inspired by anomaly detection techniques. However, directly applying existing anomaly detection methods to patch detection behaves poorly because patches are a subset of anomalies, and using general anomalous features may introduce irrelevant anomalies. Additionally, anomaly detection datasets are primarily derived from industrial scenes, leading to out-of-distribution issues. To address these challenges, we propose a patch-agnostic defense method based on anomaly knowledge learning. It fine-tunes the Segment Anything Model in a self-supervised manner using an anomaly dataset, enabling the model’s image encoder to generate embeddings with enhanced activation for anomalous regions. It also designs a Cross Attention Patch Decoder based on cross-modal attention mechanisms to compute the mutual information between patch prediction probability maps and anomaly activation maps for patch localization. Our method shows strong performance on public datasets, and achieves a 16.9% mIoU improvement over the state-of-the-art in removing natural-looking patches with patch-to-target ratios over 0.6 on our constructed dataset.

Abstract:
Small object detection (SOD) given aerial images suffers from an information imbalance across different feature scales. This makes it extremely challenging to perform accurate SOD. Existing methods, e.g., Feature Pyramid Network (FPN)-based algorithms, focus on extracting high-resolution and low-resolution semantic features from different convolution layers. However, in deeper convolution layers, semantic feature misalignment and the loss of key information are inevitable. To tackle such issues, this work proposes a new encoder-decoder-based SOD framework with a Diffusion Model and Swin Transformer given aerial images. First, we reformulate an SOD task as a Noise-to-Box process. We then construct an encoder-decoder-based framework by using a diffusion model and Swin Transformer for dynamic bounding box generation. We introduce a decoupling training and inferencing strategy to recognize and locate small objects accurately. We finally evaluate the proposed framework on several public benchmarks. The experimental results well show its better SOD performance than the state of the art. Code is available at https://github.com/BrainPotter/EDSOD.

Abstract:
Current wheeled bipedal robots face significant mobility challenges when traversing discontinuous terrain such as gaps and step-like obstacles, and suffer from substantial dynamic inefficiencies. This paper presents a hybrid self-reconfigurable wheel-legged dual-arm robot equipped with an active docking mechanism, enabling transitions between wheeled bipedal and multi-wheel-legged configurations. Based on a self-developed robotic platform, this work addresses key control challenges in articulated multi-wheel-legged mode and proposes a novel distributed operation paradigm for wheeled bipedal robots. Each module utilizes its manipulators for stable grasping of elevated objects and collaborative tasks, while the multi-unit system achieves efficient, high-load, and stable locomotion. To manage the control complexities in multimodal operation, we develop a unified modular control architecture integrating Virtual Model Control (VMC) and Linear Quadratic Regulator (LQR). For the articulated multi-wheel-legged mode, a body-posture controller regulates global body configuration, and a turning controller adjusts the wheelbase and roll angle via distributed actuation to manage the passive degrees of freedom (DoF) at the articulation points. Experimental validation using a physical prototype confirms the effectiveness and practicality of the proposed approach.

Abstract:
Soft grippers conform to the shape and surface properties of the objects to be grasped, effectively avoiding damage to soft and fragile items. Despite the variety of existing soft gripper designs, their structures lack sufficient flexibility for effectively grasping slender objects or operating in narrow spaces. To address these challenges, we propose a soft gripper with single-finger grasping capabilities, inspired by the structure of crab claws. The structural design and the fabrication method of the gripper are introduced, and the analytical bending model is derived. Experiments are conducted under typical operating conditions to validate the model, and the results indicate that the measured data are in good accordance with the predicted responses. Furthermore, a series of grasping experiments are carried out to test the single-finger grasping capabilities of the proposed soft gripper. The results indicate that the proposed soft gripper can efficiently and stably grasp slender or irregular objects with a single finger. In particular, it demonstrates suitability for operations in narrow spaces and shows potential for handling complex tasks. This innovative design effectively reduces the complexity of the system, while exhibiting promising capabilities in grasping slender or irregular objects and operating within restricted spaces.

Abstract:
In the current era of rapid-changing market and supply chain fluctuations, manufacturing industries face significant challenges in maintaining agility and resilience. Machinability evaluation is one of the critical steps in manufacturing planning to ensure the industry can respond to rapid market changes or customization demands. However, this process is currently conducted manually and highly relies on engineers’ knowledge and experience. This paper presents a systematic approach that employs fuzzy logic, which can be automated by a software program, for evaluating the machinability of product features and recommending reconfiguration options for computer numerical control (CNC) machines. This approach assesses each product feature against the current machine kinematics. The validated results demonstrate that the proposed fuzzy logic can generate comprehensive results that reflect varying degrees of machinability. For product features that cannot be machined with the existing machine configurations, this approach identifies specific limitations and provides data-driven recommendations for reconfiguration. This will assist machine operators in making informed decisions, thereby reducing reliance on manual evaluation and planning. Furthermore, this automated evaluation process will enable a shorter turnaround time for new production line setups and enhance the overall operational efficiency, especially in high-variety, low-volume manufacturing environments.

Abstract:
We address the Multi-Robot Motion Planning (MRMP) problem of computing collision-free trajectories for multiple robots in shared continuous environments. While existing frameworks effectively decompose MRMP into singlerobot subproblems, spatiotemporal motion planning with dynamic obstacles remains challenging, particularly in cluttered or narrow-corridor settings. We propose Space-Time Graphs of Convex Sets (ST-GCS), a novel planner that systematically covers the collision-free space-time domain with convex sets instead of relying on random sampling. By extending Graphs of Convex Sets (GCS) into the time dimension, ST-GCS formulates time-optimal trajectories in a unified convex optimization that naturally accommodates velocity bounds and flexible arrival times. We also propose Exact Convex Decomposition (ECD) to "reserve" trajectories as spatiotemporal obstacles, maintaining a collision-free space-time graph of convex sets for subsequent planning. Integrated into two prioritized-planning frameworks, ST-GCS consistently achieves higher success rates and better solution quality than state-of-the-art sampling-based planners— often at orders-of-magnitude faster runtimes—underscoring its benefits for MRMP in challenging settings. Project page: https://sites.google.com/view/stgcs.

Abstract:
Offline reinforcement learning can enable policy learning from pre-collected, sub-optimal datasets without online interactions. This makes it ideal for real-world robots and safety-critical scenarios, where collecting online data or expert demonstrations is slow, costly, and risky. However, most existing offline RL works assume the dataset is already labeled with the task rewards, a process that often requires significant human effort, especially when ground-truth states are hard to ascertain (e.g., in the real-world). In this paper, we build on prior work, specifically RL-VLM-F, and propose a novel system that automatically generates reward labels for offline datasets using preference feedback from a vision-language model and a text description of the task. Our method then learns a policy using offline RL with the reward-labeled dataset. We demonstrate the system’s applicability to a complex real-world robot-assisted dressing task, where we first learn a reward function using a vision-language model on a sub-optimal offline dataset, and then we use the learned reward to employ Implicit Q learning to develop an effective dressing policy. Our method also performs well in simulation tasks involving the manipulation of rigid and deformable objects, and significantly outperforms baselines such as behavior cloning and inverse RL. In summary, we propose a new system that enables automatic reward labeling and policy learning from unlabeled, sub-optimal offline datasets. Videos can be found on our project website1.

Abstract:
3D visual grounding is pivotal for enabling intelligent agents to find the target object in the 3D scene given the linguistic descriptions. However, contemporary methods are typically hindered by the scarcity of large-scale 3D datasets with fine-grained annotations and the complexity of modeling spatial relationships in the 3D space. Inspired by the exceptional performance of Vision Foundation Models (VFMs) and Vision Language Models (VLMs), we propose a novel label-free 3D visual grounding method, termed LF-3DVG, that minimizes the heavy reliance on fine-grained annotations and leverages the off-the-shelf vision foundation models for zero-shot 3D visual grounding. Our LF-3DVG is comprised of two main components, i.e., VFM-guided 3D Object Detection and VLM-based 3D Visual Grounding. Specifically, we first utilize SAM3D to generate high-quality instance masks for the objects in the 3D scene. Since SAM3D cannot provide categorical information, we further employ Semantic-SAM to assign class labels for the detected masks. As to the VLM-based 3D Visual Grounding, we first feed multi-view images and textual descriptions to the VLM for 2D visual grounding. To lift the 2D predictions to the 3D space, we design the 2D-3D object association module to effectively match the 2D detection results with the 3D boxes produced by the 3D detector, yielding the ultimate 3D visual grounding results. Extensive experiments on the ScanRefer and Sr3D/Nr3D benchmarks demonstrate that our method consistently outperforms previous approaches. Our algorithm can also boost the performance of 3D visual grounding when given labeled training samples and can be seamlessly integrated into contemporary 3D visual grounding models.

Abstract:
Object re-identification and tracking lay the foundation for various computer vision and robotics applications. In this study, we propose a method for personalizing a neural network to enhance and continuously adapt the re-identification of a specific target. Employing an unsupervised continual learning approach in conjunction with an intelligent image pool collection, we can effectively track the target and mitigate the issue of catastrophic forgetting, a challenge prevalent in this research domain. Our primary goal is to provide a robust person re-identification approach to extend the capabilities of recent tracking frameworks employed in robotics, which we have adopted as our baselines for evaluation. Our results demonstrate our approach’s efficacy in successfully re-identifying the target, even when the target drastically changes his clothing appearance and the baseline frameworks struggle. To optimally tune the framework parameters, we conducted an ablation study and substantiated our findings with saliency maps to elucidate the reasons behind the effectiveness of our approach.

Abstract:
Evaluating autonomous driving systems in complex and diverse traffic scenarios through controllable simulation is essential to ensure their safety and reliability. However, existing traffic simulation methods face challenges in their controllability. To address this, we propose a novel diffusion-based and LLM-enhanced traffic simulation framework. Our approach incorporates a high-level understanding module and a low-level refinement module, which systematically examines the hierarchical structure of traffic elements, guides LLMs to thoroughly analyze traffic scenario descriptions step by step, and refines the generation by self-reflection, enhancing their understanding of complex situations. Furthermore, we propose a Frenet-frame-based cost function framework that provides LLMs with geometrically meaningful quantities, improving their grasp of spatial relationships in a scenario and enabling more accurate cost function generation. Experiments on the Waymo Open Motion Dataset (WOMD) demonstrate that our method can handle more intricate descriptions and generate a broader range of scenarios in a controllable manner.

Abstract:
Current autonomous driving systems rely on specialized models for perceiving and predicting motion, which demonstrate reliable performance in standard conditions. However, generalizing cost-effectively to diverse real-world scenarios remains a significant challenge. To address this, we propose Plug-and-Forecast (PnF), a plug-and-play approach that augments existing motion forecasting models with multimodal large language models (MLLMs). PnF builds on the insight that natural language provides a more effective way to describe and handle complex scenarios, enabling quick adaptation to targeted behaviors. We design prompts to extract structured scene understanding from MLLMs and distill this information into learnable embeddings to augment existing behavior prediction models. Our method leverages the zero-shot reasoning capabilities of MLLMs to achieve significant improvements in motion prediction performance, while requiring no fine-tuning—making it practical to adopt. We validate our approach on two state-of-the-art motion forecasting models using the Waymo Open Motion Dataset and the nuScenes Dataset, demonstrating consistent performance improvements across both benchmarks.

Abstract:
Tactile sensors are indispensable in robotic systems because they deliver vital contact information during environmental interactions. In our work, we leverage the variable compliance of a porous material—where different interaction forces induce varying degrees of compliance—to achieve self-adapting tactile sensing. This distinctive non-linear characteristic allows its sensitivity to be automatically tuned over a range from 0.008 mT/N to 0.045 mT/N. After coating with a magnetic polymer, the porous material functions as a 3-axis magnetic sensing medium. Its length and width are set at 30 mm and 35 mm respectively to accommodate the printed circuit board. To preserve the overall measuring range, it is designed with a thickness of 15 mm. This thickness enables monitoring of the volumetric changes due to the enhanced compliance, which is suitable for three-dimensional shape recognition. In this work, we present the design, fabrication, experimental characterization, and applications of an lightweight 3-axis magnetic sponge sensor with overall dimensions of 30 mm (width) × 35 mm (length) × 17 mm (height) and a detection range of 60 N. Notably, the sensing material weighs only 2 g, thanks to its porous structure.

Abstract:
Recent progress in stereo-based 3D Gaussian Splatting (3DGS) SLAM has enabled small-scale robots, which are too small to carry depth cameras, to achieve localization and reconstruct photorealistic scenes with high-speed rendering. However, initializing 3D Gaussians from binocular vision still requires further improvement, and the potential of robot proprioception has not been fully leveraged. This work presents a robust stereo 3DGS SLAM with efficient inertial-legged fusion for small-scale quadruped robots (SaQu-SLAM). We develop a light-weight network to densely initialize the 3D Gaussians in the space. Besides, an efficient fusion method of inertial and legged encoder data based on Kalman filter is introduced. To improve the cross-platform generalization of our algorithm, multiple configuration combinations of these three types of sensors are provided. Moreover, we propose a mode-switching mechanism to handle intermittent visual failures. At last, we perform evaluation on a benchmark dataset, which includes large- and small-scale scenes, and a small quadruped robot in real-world confined-scale scenes, reducing the absolute trajectory error by an average of 19%, 13% and 25% respectively, when compared with other state-of-the-art methods in a similar context. It is also the only successful method in our self-customized confined mixed textured and textureless scene, whereas all vision-based or visual-inertial methods fail. Our system achieves real-time performance even on an embedded platform (Jetson AGX Orin).

Abstract:
This paper proposes a model-free steering control method to address the path tracking challenges of Hydraulic Center-articulated Scooptrams (HCS) in narrow underground mining environments. Due to the nonlinear and time-delay characteristics of the hydraulic steering system, the HCS exhibits response lag when executing control commands. The lag time demonstrates dynamic uncertainty influenced by operating conditions, hydraulic pressure, and load variations. To address this challenge, an adaptive steering control strategy is designed. This strategy leverages the geometric relationship between the HCS and the reference path to dynamically adjust the look-ahead distance, thereby compensating for the uncertainty caused by the hydraulic system lag. Additionally, the error is mapped to the actual control input in real-time through a feedback error controller, effectively correcting control errors caused by lag without relying on a complex hydraulic system model. The proposed method was experimentally validated in a full-scale simulated mining tunnel, demonstrating considerable robustness and precise path tracking performance under uneven terrain, heavy loads, significant initial error, and bidirectional movement. This method provides a viable solution for the autonomous navigation of the HCS.

Abstract:
We introduce MotionScript, a novel framework for generating highly detailed, natural language descriptions of 3D human motions. Unlike existing motion datasets that rely on broad action labels or generic captions, MotionScript provides fine-grained, structured descriptions that capture the full complexity of human movement—including expressive actions (e.g., emotions, stylistic walking) and interactions beyond standard motion capture datasets. MotionScript serves as both a descriptive tool and a training resource for text-to-motion models, enabling the synthesis of highly realistic and diverse human motions from text. By augmenting motion datasets with MotionScript captions, we demonstrate significant improvements in out-of-distribution motion generation, allowing large language models (LLMs) to generate motions that extend beyond existing data. Additionally, MotionScript opens new applications in animation, virtual human simulation, and robotics, providing an interpretable bridge between intuitive descriptions and motion synthesis. To the best of our knowledge, this is the first attempt to systematically translate 3D motion into structured natural language without requiring training data. Code, dataset, and examples are available at https://pjyazdian.github.io/MotionScript

Abstract:
Harnessing low-light enhancement and domain adaptation, nighttime UAV tracking has made substantial strides. However, over-reliance on image enhancement, limited high-quality nighttime data, and a lack of integration between daytime and nighttime trackers hinder the development of an end-to-end trainable framework. Additionally, current ViT-based trackers demand heavy computational resources due to their reliance on the self-attention mechanism. In this paper, we propose a novel pure Mamba-based tracking framework (MambaNUT) that employs a state space model with linear complexity as its backbone, incorporating a single-stream architecture that integrates feature learning and template-search coupling within Vision Mamba. We introduce an adaptive curriculum learning (ACL) approach that dynamically adjusts sampling strategies and loss weights, thereby improving the model’s ability of generalization. Our ACL is composed of two levels of curriculum schedulers: (1) sampling scheduler that transforms the data distribution from imbalanced to balanced, as well as from easier (daytime) to harder (nighttime) samples; (2) loss scheduler that dynamically assigns weights based on the size of the training set and IoU of individual instances. Exhaustive experiments on multiple nighttime UAV tracking benchmarks demonstrate that the proposed MambaNUT achieves state-of-the-art performance while requiring lower computational costs. The code will be available at https://github.com/wuyou3474/MambaNUT.

Abstract:
Multimodal 3D object detectors leverage the strengths of both geometry-aware LiDAR point clouds and semantically rich RGB images to enhance detection performance. However, the inherent heterogeneity between these modalities, including unbalanced convergence and modal misalignment, poses significant challenges. Meanwhile, the large size of the detection-oriented feature also constrains existing fusion strategies to capture long-range dependencies for the 3D detection tasks. In this work, we introduce a fast yet effective multimodal 3D object detector, incorporating our proposed Instance-level Contrastive Distillation (ICD) framework and Cross Linear Attention Fusion Module (CLFM). ICD aligns instance-level image features with LiDAR representations through object-aware contrastive distillation, ensuring fine-grained cross-modal consistency. Meanwhile, CLFM presents an efficient and scalable fusion strategy that enhances cross-modal global interactions within sizable multimodal BEV features. Extensive experiments on the KITTI and nuScenes 3D object detection benchmarks demonstrate the effectiveness of our methods. Notably, our 3D object detector outperforms state-of-the-art (SOTA) methods while achieving superior efficiency. The implementation of our method has been released as open-source at: https://github.com/nubot-nudt/ICD-Fusion.

Abstract:
Embodied Question Answering (EQA) is an essential yet challenging task for robot assistants. Large vision-language models (VLMs) have shown promise for EQA, but existing approaches either treat it as static video question answering without active exploration or restrict answers to a closed set of choices. These limitations hinder real-world applicability, where a robot must explore efficiently and provide accurate answers in open-vocabulary settings. To overcome these challenges, we introduce EfficientEQA, a novel framework that couples efficient exploration with free-form answer generation. EfficientEQA features three key innovations: (1) Semantic-Value-Weighted Frontier Exploration (SFE) with Verbalized Confidence (VC) from a black-box VLM to prioritize semantically important areas to explore, enabling the agent to gather relevant information faster; (2) a BLIP relevancy-based mechanism to stop adaptively by flagging highly relevant observations as outliers to indicate whether the agent has collected enough information; and (3) a Retrieval-Augmented Generation (RAG) method for the VLM to answer accurately based on pertinent images from the agent’s observation history without relying on predefined choices. Our experimental results show that EfficientEQA achieves over 15% higher answer accuracy and requires over 20% fewer exploration steps than state-of-the-art methods. Our code is available at: https://github.com/chengkaiAcademyCity/EfficientEQA

Abstract:
The transition of seaweed farming to an alternative food source on an industrial scale relies on automating its processes through smart farming, equivalent to land agriculture. Key to this process are autonomous underwater vehicles (AUVs) via their capacity to automate crop and structural inspections. However, the current bottleneck for their deployment is ensuring safe navigation within farms, which requires an accurate, online estimate of the AUV pose and map of the infrastructure. To enable this, we propose an efficient side scan sonar-based (SSS) simultaneous localization and mapping (SLAM) framework that exploits the geometry of kelp farms via modeling structural ropes in the back-end as sequences of individual landmarks from each SSS ping detection, instead of combining detections into elongated representations. Our method outperforms state of the art solutions in hardware in the loop (HIL) experiments on a real AUV survey in a kelp farm. The framework and dataset can be found at https://github.com/julRusVal/sss_farm_slam.

Abstract:
Vision-Language Place Recognition (VLPR) enhances robot localization performance by incorporating natural language descriptions from images. By utilizing language information, VLPR directs robot place matching, overcoming the constraint of solely depending on vision. However, general multimodal information integration methods are not well equipped to capture the dynamics of cross-modal interactions, especially in the presence of complex intra-modal and inter-modal correlations. To this end, this paper proposes a novel coarse-to-fine and end-to-end connected cross-modal place recognition framework, called MambaPlace. In the coarse-localization stage, the text description and 3D point cloud are encoded by the pre-trained T5 and instance encoder, respectively. They are then processed using Text-Attention Mamba (TAM) and Point Cloud Multi-Strategy Scanning Mamba (MSSM), with the latter mimicking the eye’s focusing mechanism, for data enhancement and alignment. In the subsequent fine-localization stage, the features of the text description and 3D point cloud are cross-modally fused and further enhanced through Cascaded Cross-Attention Mamba (CCAM). Finally, we predict the positional offset from the fused text-point cloud features, achieving the most accurate localization. Extensive experiments show that MambaPlace achieves improved localization accuracy on the KITTI360Pose dataset compared to the state-of-the-art methods. Specifically, as shown in Fig. 1, when ϵ<5, MambaPlace achieves 5% higher test accuracy compared to the existing state-of-the-art.

Abstract:
Unrestricted multi-agent racing presents a significant research challenge, requiring decision-making at the limits of a robot's operational capabilities. While previous approaches have either ignored spatiotemporal information in the decision-making process or been restricted to single-opponent scenarios, this work enables arbitrary multi-opponent head-to-head racing while considering the opponents' future intent. The proposed method employs a Kalman Filter (KF)-based multi-opponent tracker to effectively perform opponent Re-Identification (reID) by associating them across observations. Simultaneously, spatial and velocity Gaussian Process Regression (GPR) is performed on all observed opponent trajectories, providing predictive information to compute the overtaking maneuvers. This approach has been experimentally validated on a physical 1:10 scale autonomous racing car achieving an overtaking success rate of up to 91.65% and demonstrating an average 10.13%-point improvement in safety at the same speed as the previous State-of-the-Art (SotA). These results highlight its potential for high-performance autonomous racing.

Abstract:
Circular accessible depth (CAD) provides a lightweight and robust traversability representation for autonomous navigation of unmanned ground vehicles (UGV). Aiming at the limitations of existing LiDAR-based methods in detecting low-thickness targets and executing semantic reasoning, we propose VCADNet, a vision-based neural network for circular accessible depth prediction. VCADNet comprises three core components: a geometry-based query module for multi-view bird’s eye view feature extraction, a polar coordinate transformation for CAD alignment, and a multi-scale U-Net architecture for depth prediction. In addition, we present a cross-modal contrastive learning scheme to enhance the spatial reasoning of VCADNet, which transfers knowledge from LiDAR-based encoders to vision-based counterparts. Extensive experiments demonstrate the superior performance of VCADNet in various UGV perception tasks.

Abstract:
In this work, we decouple calibrated dual-view 3D human pose estimation (HPE) into the well-studied problems of 2D pose estimation, and 2D-to-3D pose lifting, focusing on the latter task. The key challenges stem from: 1) 2D pose is noisy and unreliable due to occlusion and motion blur, and 2) the trained model cannot generalize well to unseen camera configurations. To overcome these limitations, we propose three interconnected innovations: First, a Normalized Triangulation that transforms the 2D pose from pixel space to 3D normalized rays, which makes our approach robust to the camera parameters change. Second, a hybrid neural-geometry framework (i.e., including refinement and triangulation) that explicitly incorporates multi-view geometry into our models. Third, an analytical inverse kinematics (AnalyIK) solver that decomposes articulated motion with human topology, which simultaneously considers symmetry constraint and joint angle limit. Experiments show that the proposed framework achieves state-of-the-art performance on two widely used benchmarks (i.e., Huamn3.6M and HumanEva-I). Code is available at: https://github.com/Z-Z-J/Normalized-Triangulation.

Abstract:
Multi-task reinforcement learning (MTRL) offers a promising approach to improve sample efficiency and generalization by training agents across multiple tasks, enabling knowledge sharing between them. However, applying MTRL to robotics remains challenging due to the high cost of collecting diverse task data. To address this, we propose MT-Lévy, a novel exploration strategy that enhances sample efficiency in MTRL environments by combining behavior sharing across tasks with temporally extended exploration inspired by Lévy flight [1]. MT-Lévy leverages policies trained on related tasks to guide exploration towards key states, while dynamically adjusting exploration levels based on task success ratios. This approach enables more efficient state-space coverage, even in complex robotics environments. Empirical results demonstrate that MT-Lévy significantly improves exploration and sample efficiency, supported by quantitative and qualitative analyses. Ablation studies further highlight the contribution of each component, showing that combining behavior sharing with adaptive exploration strategies can significantly improve the practicality of MTRL in robotics applications.

Abstract:
In offline reinforcement learning, value overestimation caused by out-of-distribution (OOD) actions significantly limits policy performance. Recently, diffusion models have been leveraged for their strong distribution-matching capabilities, enforcing conservatism through behavior policy constraints. However, existing methods often apply indiscriminate regularization to redundant actions in low-quality datasets, resulting in excessive conservatism and an imbalance between the expressiveness and efficiency of diffusion modeling. To address these issues, we propose DIffusion policies with Value-conditional Optimization (DIVO), a novel approach that leverages diffusion models to generate high-quality, broadly covered in-distribution state-action samples while facilitating efficient policy improvement. Specifically, DIVO introduces a binary-weighted mechanism that utilizes the advantage values of actions in the offline dataset to guide diffusion model training. This enables a more precise alignment with the dataset’s distribution while selectively expanding the boundaries of high-advantage actions. During policy improvement, DIVO dynamically filters high-return-potential actions from the diffusion model, effectively guiding the learned policy toward better performance. This approach achieves a critical balance between conservatism and explorability in offline RL. We evaluate DIVO on the D4RL benchmark and compare it against state-of-the-art baselines. Empirical results demonstrate that DIVO achieves superior performance, delivering significant improvements in average returns across locomotion tasks and outperforming existing methods in the challenging AntMaze domain, where sparse rewards pose a major difficulty.

Abstract:
This paper presents a novel feature alignment strategy for cross-view geo-localization to bridge the perspective gap between ground and satellite images. Existing methods for cross-view geo-localization often overlook factors such as occlusion and distortion errors caused by viewpoint transformation. These issues lead to reduced accuracy in complex scenes. To address this issue, we propose a framework comprising two novel components: a perspective-driven attention fusion (PDAF) module that aligns ground and satellite features through cross-view semantic correlation, effectively preserving structural consistency during view transformation; and a projection-stable patch-guided pose optimizer (PSPG) that enhances geometric reliability by selectively focusing on projection-stable patch to refine pose estimation. The PDAF module mitigates information loss through attention fusion between ground and bird’s-eye-view (BEV) feature maps representations, while the PSPG refines pose estimation by dynamically suppressing unstable features through geometrically unstable token merging. Comprehensive evaluations on KITTI and Ford Multi-AV datasets demonstrate our method’s superiority in orientation estimation and competitive location accuracy compared to state-of-the-art approaches. Qualitative results further confirm the framework’s robustness in complex localization scenarios. The code is available at https://github.com/RobVisLab-NJUST/CVLGSA

Abstract:
The recent 3D Gaussian Splatting simultaneous localization and mapping (3DGS-SLAM) has achieved high-fidelity reconstruction from RGB-D images. However, the input lacks edge information in the scene, resulting in inability of Gaussians to accurately model edges and appearance of artifacts at object edges. 3DGS-SLAM involves in a point-based representation method. It produces massive Gaussians for obtaining a detailed map, causing low rendering speed and high storage. To overcome above shortages, we propose a novel SLAM frame with edge priors constraint that adds edge attributes to Gaussians for expressing edge information and edge loss is further introduced to enable Gaussians to accurately reconstruct edges and suppress artifacts. Furthermore, we propose graph signal processing for local Gaussians to establish relationships among irregular Gaussians and extract geometric features from Gaussian scene representations, which are used to efficiently reduce redundant Gaussians without sacrificing performance of tracking and reconstruction. Experiments performed on synthetic and real-world datasets show that our method achieves over 2× compression in memory usage and increases nearly 250% rendering speed while maintaining tracking and mapping performance. Additional information can be found on our project page: yoona12.github.io/MeGS-SLAM.github.io

Abstract:
Unmanned aerial vehicle (UAV) swarms find extensive applications in diverse fields, including search and rescue, logistics delivery, and environmental surveillance, necessitating meticulous task and temporal scheduling to meet intricate spatiotemporal requirements. A market-based strategy emerges as a suitable option for self-organizing swarm coordination. However, the consensus mechanisms employed by most market-based algorithms necessitate synchronous communication, leading to waiting times. Researchers have turned to asynchronous approaches for enhanced efficiency, yet the communication burden of existing asynchronous methods escalates swiftly with the growth of the swarm size. Therefore, this paper proposes an Asynchronous Harmony-based Decentralized Auctions (AHDA) method for networked UAV swarm to reduce the communication load and scheduling time required by a market-based approach. First, proximity communication is proposed to reduce the broadcast range and content of UAVs. Second, new conflict resolution protocols are designed to eliminate task conflict between UAVs faster. Third, propagation rules are designed to limit the scope of task information diffusion. Ultimately, it brings a decrease in communication load and scheduling time because it is expected to achieve the minimum requirement of no task conflict between UAVs, rather than swarm scheduling consistency. Monte Carlo simulations spanning 32 to 128 UAVs demonstrate that compared with the Asynchronous Consensus-Based Bundle Algorithm (ACBBA), the proposed AHDA achieves reductions of up to 70.16% in transmitted messages, 75.78% in communication traffic, and 63.12% in scheduling time.

Abstract:
Imitation Learning offers a promising approach to learn directly from data without requiring explicit models, simulations, or detailed task definitions. During inference, actions are sampled from the learned distribution and executed on the robot. However, sampled actions may fail for various reasons, and simply repeating the sampling step until a successful action is obtained can be inefficient. In this work, we propose an enhanced sampling strategy that refines the sampling distribution to avoid previously unsuccessful actions. We demonstrate that by solely utilizing data from successful demonstrations, our method can infer recovery actions without the need for additional exploratory behavior or a high-level controller. Furthermore, we leverage the concept of diffusion model decomposition to break down the primary problem—which may require long-horizon history to manage failures—into multiple smaller, more manageable sub-problems in learning, data collection, and inference, thereby enabling the system to adapt to variable failure counts. Our approach yields a low-level controller that dynamically adjusts its sampling space to improve efficiency when prior samples fall short. We validate our method across several tasks, including door opening with unknown directions, object manipulation, and button-searching scenarios, demonstrating that our approach outperforms traditional baselines. Supplementary materials for this paper are available on our website: https://hri-eu.github.io/ccdp/.

Abstract:
The multi-agent pathfinding aims to compute conflict-free paths for multiple agents in shared environments. Traditional methods, such as conflict-based search (CBS), guarantee optimality but suffer from high computational costs due to constraint tree expansion. Learning-based approaches improve efficiency but often compromise solution quality. We propose proactive conflict-aware prediction (PCAP), which improves CBS by predicting conflict-prone areas based on constraint data. This approach enables a more informed constraint application, reducing unnecessary expansions while preserving optimality. Experimental results show that PCAP reduces computation time by 40% compared to CBS while maintaining solution quality, making it a scalable and effective approach for complex MAPF scenarios.

Abstract:
Optimization-based methods are widely used for computing fast, diverse solutions for complex tasks such as collision-free movement or planning in the presence of contacts. However, most of these methods require enforcing non-penetration constraints between objects, resulting in a nontrivial and computationally expensive problem. This makes the use of optimization-based methods for planning and control challenging. In this paper, we present a method to efficiently enforce non-penetration of sets while performing optimization over their configuration, which is directly applicable to problems like collision-aware trajectory optimization. We introduce novel differentiable conditions with analytic expressions to achieve this. To enforce non-collision between non-smooth bodies using these conditions, we introduce a method to approximate polytopes as smooth semi-algebraic sets. We present several numerical experiments to demonstrate the performance of the proposed method and compare the performance with other baseline methods recently proposed in the literature.

Abstract:
Conventional in-ear electrodes face challenges such as inconsistent skin contact, motion artifacts, and discomfort, limiting their reliability in dynamic conditions. To overcome these limitations, this study presents a body-temperature responsive in-ear balloon actuator (BBA) based on liquid-to-vapor phase change, integrated with silver microneedle electrodes for stable electrophysiological monitoring. The dual-layer balloon structure encapsulates a phase-change fluid core, expanding at 36–37°C to ensure conformal skin contact while minimizing motion artifacts and mechanical pressure. Among 3 tested designs, the dual-material design proved optimal, achieving a peak insertion force of 0.05 N and maintaining impedance fluctuations below 5%. A wearable system was further validated through dynamic tests, demonstrating a 20% reduction in motion artifacts compared to conventional electrodes. These findings highlight the actuator’s potential for stable and comfortable wearable electrophysiological monitoring.

Abstract:
Drivable Free-space prediction is a fundamental and crucial problem in autonomous driving. Recent works have addressed the problem by representing the entire non-obstacle road regions as the free-space. In contrast our aim is to estimate the driving corridors that are a navigable subset of the entire road region. Unfortunately, existing corridor estimation methods directly assume a BEV-centric representation, which is hard to obtain. In contrast, we frame drivable free-space corridor prediction as a pure image perception task, using only monocular camera input. However such a formulation poses several challenges as one doesn’t have the corresponding data for such free-space corridor segments in the image. Consequently, we develop a novel self-supervised approach for free-space sample generation by leveraging future ego trajectories and front-view camera images, making the process of visual corridor estimation dependent on the ego trajectory. We then employ a diffusion process to model the distribution of such segments in the image. However, the existing binary mask-based representation for a segment poses many limitations. Therefore, we introduce ContourDiff, a specialized diffusion-based architecture that denoises over contour points rather than relying on binary mask representations, enabling structured and interpretable free-space predictions. We evaluate our approach qualitatively and quantitatively on both nuScenes and CARLA, demonstrating its effectiveness in accurately predicting safe multimodal navigable corridors in the image.

Abstract:
Traffic congestion remains a significant challenge in modern urban networks. Autonomous driving technologies have emerged as a potential solution. Among traffic control methods, reinforcement learning has shown superior performance over traffic signals in various scenarios. However, prior research has largely focused on small-scale networks or isolated intersections, leaving large-scale mixed traffic control largely unexplored. This study presents the first attempt to use decentralized multi-agent reinforcement learning for large-scale mixed traffic control in which some intersections are managed by traffic signals and others by robot vehicles. Evaluating a real-world network in Colorado Springs, CO, USA with 14 intersections, we measure traffic efficiency via average waiting time of vehicles at intersections and the number of vehicles reaching their destinations within a time window (i.e., throughput). At 80% RV penetration rate, our method reduces waiting time from 6.17 s to 5.09 s and increases throughput from 454 vehicles per 500 seconds to 493 vehicles per 500 seconds, outperforming the baseline of fully signalized intersections. These findings suggest that integrating reinforcement learning-based control large-scale traffic can improve overall efficiency and may inform future urban planning strategies.

Abstract:
Mapping and planning are fundamental to robotic navigation in unknown environments. This work introduces a probabilistic framework that combines Gaussian processes (GPs) for 2.5D object modeling with an informative motion planner, using LiDAR-based measurements. The mapping approach employs a flexible, nonparametric representation to process 3D point cloud data to create compact volumetric representations through contours and heights, enabling robust shape estimation even from sparse data. Building on this, the GP representation-based informative motion planner incorporates information gain into the dynamic window approach (DWA) to enhance navigation performance. Simulations validate the framework by comparing its mapping accuracy with OctoMap and elevation map, and its planning efficiency with a baseline DWA.

Abstract:
This paper investigates a problem called Multi-Agent Combinatorial Path Finding for Tractor-Trailers (MCPF-TT), which seeks collision-free paths for multiple agents from their start to goal locations, visiting a set of intermediate target locations in the middle of the paths, while minimizing the sum of arrival times. Additionally, each agent behaves like a tractor, and a trailer is attached to the agent at each intermediate target location, which increases the "body length" of that agent by one unit. Planning for those tractor-trailers in a cluttered environment introduces additional challenges, since the planner has to plan each agent in a larger state space that includes the position of the attached trailers to avoid self-collision. Furthermore, agents are more likely to collide with each other due to the increasing body lengths, and the conventional collision resolution techniques turn out to be computationally inefficient. This paper develops a new planner called CBSS-TT that includes both novel inter-agent conflict resolution techniques, and a new single-agent planner TTCA that finds optimal single-agent path while avoiding self-collision. Our test results show that CBSS-TT sometimes requires 60% fewer number of iterations while finding solutions with cheaper costs than the baselines.

Abstract:
Vision-based Bird’s-Eye-View (BEV) 3D object detection has recently become popular in autonomous driving. However, objects with a high similarity to the background from a camera perspective cannot be detected well by existing methods. In this paper, we propose a BEV-based 3D Object Detection Network with 2D Region-Oriented Attention (ROA-BEV), which enables the backbone to focus more on feature learning of the regions where objects exist. Moreover, our method further enhances the information feature learning ability of ROA through multi-scale structures. Each block of ROA utilizes a large kernel to ensure that the receptive field is large enough to catch information about large objects. Experiments on nuScenes show that ROA-BEV improves the performance based on BEVDepth. The source codes of this work will be available at https://github.com/DFLyan/ROA-BEV.

Abstract:
Endoscopic reconstruction plays a crucial role in surgical robotics. The dynamic lighting conditions and integrated camera-light source in endoscopic scenes create a distinct reconstruction challenge: shape ambiguity. To mitigate this, we propose a Gaussian Splatting (GS) based framework for endoscopic scene reconstruction, enhanced with reflectance regularization. We embed every 3D Gaussian point with physical reflective attributes and combine this representation with a physically based inverse rendering framework. By jointly training 3DGS for view synthesis with this reflectance regularization, we are able to attain high-quality geometry without changing the volume rendering pipeline. Our experiments demonstrate the superiority in both geometry representation and rendering performance compared to existing GS approaches, making it a practical solution for endoscopic applications. Project is available at: https://med-air.github.io/GSR2.

Abstract:
Multiview point cloud registration is a fundamental task for constructing globally consistent 3D models. Existing approaches typically rely on feature extraction and data association across multiple point clouds. However, these processes are challenging to obtain global optimal solution in complex environments. In this paper, we introduce a novel correspondence-free multiview point cloud registration method. Specifically, we represent the global map as a depth map and leverage raw depth information to formulate a non-linear least squares optimisation that jointly estimates poses of point clouds and the global map. Unlike traditional feature-based bundle adjustment methods, which rely on explicit feature extraction and data association, our method bypasses these by associating multi-frame point clouds with a global depth map through their corresponding poses. This data association is implicitly incorporated and dynamically refined during the optimisation process. Extensive evaluations on real-world datasets demonstrate that our method outperforms state-of-the-art approaches in accuracy, particularly in challenging environments where feature extraction and data association are difficult.

Abstract:
World Model-based Reinforcement Learning (WMRL) enables sample efficient policy learning by reducing the need for online interactions which can potentially be costly and unsafe, especially for autonomous driving. However, existing world models often suffer from low prediction fidelity and compounding one-step errors, leading to policy degradation over long horizons. Additionally, traditional RL policies, often deterministic or single Gaussian-based, fail to capture the multi-modal nature of decision-making in complex driving scenarios. To address these challenges, we propose Imagine-2-Drive, a novel WMRL framework that integrates a high-fidelity world model with a multi-modal diffusion-based policy actor. It consists of two key components: DiffDreamer, a diffusion-based world model that generates future observations simultaneously, mitigating error accumulation, and DPA (Diffusion Policy Actor), a diffusion-based policy that models diverse and multi-modal trajectory distributions. By training DPA within DiffDreamer, our method enables robust policy learning with minimal online interactions. We evaluate our method in CARLA using standard driving benchmarks and demonstrate that it outperforms prior world model baselines, improving Route Completion and Success Rate by 15% and 20% respectively.Project page: https://imagine-2-drive.github.io/

Abstract:
Recent work on visual world models shows significant promise in latent state dynamics obtained from pre-trained image backbones. However, most of the current approaches are sensitive to training quality, requiring near-complete coverage of the action and state space during training to prevent divergence during inference. To make a model-based planning algorithm more robust to the quality of the learned world model, we propose in this work to use a variational autoencoder as a novelty detector to ensure that proposed action trajectories during planning do not cause the learned model to deviate from the training data distribution. To evaluate the effectiveness of this approach, a series of experiments in challenging simulated robot environments was carried out, with the proposed method incorporated into a model-predictive control policy loop extending the DINO-WM architecture. The results clearly show that the proposed method improves over state-of-the-art solutions in terms of data efficiency.

Abstract:
The autonomous formation flight of fixed-wing drones is hard when the coordination requires the actuation over their speeds since they are critically bounded and aircraft are mostly designed to fly at a nominal airspeed. This paper proposes an algorithm to achieve formation flights of fixed-wing drones without requiring any actuation over their speed. In particular, we guide all the drones to travel over specific paths, e.g., parallel straight lines, and we superpose an oscillatory behavior onto the guiding vector field that drives the drones to the paths. This oscillation enables control over the average velocity along the path, thereby facilitating inter-drone coordination. Each drone adjusts its oscillation amplitude distributively in a closed-loop manner by communicating with neighboring agents in an undirected and connected graph. A novel consensus algorithm is introduced, leveraging a non-negative, asymmetric saturation function. This unconventional saturation is justified since negative amplitudes do not make drones travel backward or have a negative velocity along the path. Rigorous theoretical analysis of the algorithm is complemented by validation through numerical simulations and a real-world formation flight.

Abstract:
This study proposes a novel strategy for cross-scale position of robotic micromanipulation. The strategy utilizes multiple-scale augmented reality (AR) markers for locating the robotic manipulator on different scales. The macro-marker (3.0 cm-per side, 5 mm×5 mm each square) is applied to position the robot to the microscopic manipulation area. The micro-marker (2.4 mm-per side, 400 μm×400 μm each square) is used for positioning the end-effector under microscopic view. After the fabrication of the markers, the camera's internal parameter matrix was first calibrated. Subsequently, we conducted the detection effect of macro- and micro-markers. Since the observation effect of micro-markers is different under the microscope, the detection distance of the micro-marker was corrected and compensated, and the fixed reference marker was introduced for the correction in different focus heights. Finally, based on detection markers, a robotic manipulator, integrated with a microfluidic chip as an end-effector, was employed to demonstrate the micromanipulation of loading oocytes. The proposed strategy has a potential application in the biology laboratory automation.

Abstract:
Deep object pose estimators are notoriously overconfident. A grasping agent that both estimates the 6-DoF pose of a target object and predicts the uncertainty of its own estimate could avoid task failure by choosing not to act under high uncertainty. Even though object pose estimation improves and uncertainty quantification research continues to make strides, few studies have connected them to the downstream task of robotic grasping. We propose a method for training lightweight, deep networks to predict whether a grasp guided by an image-based pose estimate will succeed before that grasp is attempted. We generate training data for our networks via object pose estimation on real images and simulated grasping. We also find that, despite high object variability in grasping trials, networks benefit from training on all objects jointly, suggesting that a diverse variety of objects can nevertheless contribute to the same goal. Data, code, and guides are hosted at: https://github.com/EricCJoyce/Consensus-Driven-Uncertainty/

Abstract:
With the development of embodied intelligence, many studies have made progress by incorporating scene graphs and GNN into task planning. However, most methods still face challenges in fully capturing the sequential relationships between agent actions and the environment, making it difficult to handle dynamic changes and complexity inherent in embodied tasks. This paper proposes a Dynamic Graph Attention Network for Embodied Task Planning (DGETP) to process scene graph sequences and robot graphs for dynamic environment perception. In DGETP, we design a Hierarchical Dynamic Graph Attention network (H-DGAT) by employing both structural and temporal attention mechanisms to model the dynamic evolution feature of the scene. A Dual-branch Action-object Predictor (DAP) is proposed in DGETP through introducing sequences of previous actions and objects to efficiently aggregate historical information. DAP captures temporal dependencies between past and future actions through explicit sequence modeling, and reduces prediction complexity via a dual-branch architecture that separates action and object prediction while preserving their correlations through targeted feature fusion. Experiments show that DGETP improves task accuracy by over 30% in seen scenes and over 15% in unseen scenes compared to other baselines. In complex scenes, DGETP demonstrates strong generalization ability. Finally, the simulation environment indicates that DGETP achieves more goals than most of the advanced task planning method.

Abstract:
Sparse point-based trackers struggle with texture-less and incomplete point clouds. Conversely, dense voxel-based trackers have richer spatial and semantic information, but filtering out interference from complex backgrounds remains a challenge. Additionally, there is still a gap between point and voxel-based trackers in exploiting their complementary strengths. To address these issues, we propose UTracker, which uses unidirectional point-voxel fusion to construct a bridge between point and voxel tracking features, enabling them to complement and enhance each other. Specifically, we design template-enhanced unidirectional attention (TEUA) and historical template fusion (HTF), which enable unidirectional interaction from historical templates to the search area in the point branch, retaining the pure template features. Then, a point-guided adaptive feature transformer (PGAFT) is developed to unidirectionally enhance the interaction between point and voxel features. Extensive experiments demonstrate that UTracker achieves superior performance, reaching an average accuracy of 89.5%, 72.58%, and 63.4% on the KITTI, NuScenes, and Waymo Open Dataset, respectively.

Abstract:
Dexterous manipulation has received considerable attention in recent research. Predominantly, existing studies have concentrated on reinforcement learning methods to address the substantial degrees of freedom in hand movements. Nonetheless, these methods typically suffer from low efficiency and accuracy. In this work, we introduce a novel reinforcement learning approach that leverages prior dexterous grasp pose knowledge to enhance both efficiency and accuracy. Unlike previous work, they always make the robotic hand go with a fixed dexterous grasp pose, We decouple the manipulation process into two distinct phases: initially, we generate a dexterous grasp pose targeting the functional part of the object; after that, we employ reinforcement learning to comprehensively explore the environment. Our findings suggest that the majority of learning time is expended in identifying the appropriate initial position and selecting the optimal manipulation viewpoint. Experimental results demonstrate significant improvements in learning efficiency and success rates across four distinct tasks.

Abstract:
Accurate and robust simultaneous localization and mapping (SLAM) is crucial for autonomous mobile systems, typically achieved by leveraging the geometric features of the environment. Incorporating semantics provides a richer scene representation that not only enhances localization accuracy in SLAM but also enables advanced cognitive functionalities for downstream navigation and planning tasks. Existing pointwise semantic LiDAR SLAM methods often suffer from poor efficiency and generalization, making them less robust in diverse real-world scenarios. In this paper, we propose a semantic graph-enhanced SLAM framework, named SG-SLAM, which effectively leverages the geometric, semantic, and topological characteristics inherent in environmental structures. The semantic graph serves as a fundamental component that facilitates critical functionalities of SLAM, including robust relocalization during odometry failures, accurate loop closing, and semantic graph map construction. Our method employs a dual-threaded architecture, with one thread dedicated to online odometry and relocalization, while the other handles loop closure, pose graph optimization, and map update. This design enables our method to operate in real time and generate globally consistent semantic graph maps and point cloud maps. We extensively evaluate our method across the KITTI, MulRAN, and Apollo datasets, and the results demonstrate its superiority compared to state-of-the-art methods. Our method has been released at https://github.com/nubot-nudt/SG-SLAM.

Abstract:
This paper addresses the path-tracking problem of time-optimal trajectories under model uncertainties, by proposing a real-time predictive scaling algorithm. The algorithm is formulated as a convex optimization problem, designed to balance the trade-off between improving feasibility and time optimality of a trajectory. The predicted trajectory is scaled based on the presence of path segments that are particularly sensitive to model uncertainties within the prediction horizon. Numerical simulations and experiments demonstrate that the proposed scaling algorithm reduces the path traversal time, while preserving similar path-tracking accuracy compared to an existing non-predictive method.

Abstract:
Dense colored point clouds enhance visual perception and are of significant value in various robotic applications. However, existing learning-based point cloud upsampling methods are constrained by computational resources and batch processing strategies, which often require subdividing point clouds into smaller patches, leading to distortions that degrade perceptual quality. To address this challenge, we propose a novel 2D-3D hybrid colored point cloud upsampling framework (GaussianPU) based on 3D Gaussian Splatting (3DGS) for robotic perception. This approach leverages 3DGS to bridge 3D point clouds with their 2D rendered images in robot vision systems. A dual scale rendered image restoration network transforms sparse point cloud renderings into dense representations, which are then input into 3DGS along with precise robot camera poses and interpolated sparse point clouds to reconstruct dense 3D point clouds. We have made a series of enhancements to the vanilla 3DGS, enabling precise control over the number of points and significantly boosting the quality of the upsampled point cloud for robotic scene understanding. Our framework supports processing entire point clouds on a single consumer-grade GPU, eliminating the need for segmentation and thus producing high-quality, dense colored point clouds with millions of points for robot navigation and manipulation tasks. Extensive experimental results on generating million-level point cloud data validate the effectiveness of our method, substantially improving the quality of colored point clouds and demonstrating significant potential for applications involving large-scale point clouds in autonomous robotics and human-robot interaction scenarios.

Abstract:
Achieving efficient remote teleoperation is particularly challenging in unknown environments, as the teleoperator must rapidly build an understanding of the site’s layout. Online 3D mapping is a proven strategy to tackle this challenge, as it enables the teleoperator to progressively explore the site from multiple perspectives. However, traditional online map-based teleoperation systems struggle to generate visually accurate 3D maps in real-time due to the high computational cost involved, leading to poor teleoperation performances. In this work, we propose a solution to improve teleoperation efficiency in unknown environments. Our approach proposes a novel, modular and efficient GPU-based integration between recent advancement in gaussian splatting SLAM and existing online map-based teleoperation systems. We compare the proposed solution against state-of-the-art teleoperation systems and validate its performances through real-world experiments using an aerial vehicle. The results show significant improvements in decision-making speed and more accurate interaction with the environment, leading to greater teleoperation efficiency. In doing so, our system enhances remote teleoperation by seamlessly integrating photorealistic mapping generation with real-time performances, enabling effective teleoperation in unfamiliar environments.Video: https://www.youtube.com/watch?v=-Md49rKkV8I(Code: https://github.com/ian-pge/GS_SLAM_teleoperation.git

Abstract:
While camera–radar fusion has led to notable progress in autonomous driving, many existing approaches overlook the risk of sensor failures, which can critically compromise system safety. To address this limitation, we propose RoCaRS, a robust camera–radar fusion model designed for bird’s-eye view (BEV) segmentation under sensor failure scenarios. RoCaRS incorporates two key components—Radar-aware Backbone (RB) and Feature Spreading (FS)—to enhance BEV feature representation, along with a Dynamic Input Dropout Strategy (DIDS) and Bidirectional Feature Refinement (BFR) to address missing sensor inputs. Experiments on the nuScenes benchmark show that RoCaRS not only outperforms state-of-the-art fusion models under normal conditions but also maintains high performance under various sensor failure settings. Notably, in the complete absence of camera input, RoCaRS exceeds the baseline by +23.2 mIoU for map and +30.0 IoU for vehicle. Furthermore, it retains 99% of the radar-only model’s performance and achieves 103% of the camera-only model’s performance when either all cameras or all radars are disabled—without any retraining. These results highlight the potential of intermediate fusion to match the robustness of late fusion, while more effectively leveraging complementary modalities.

Abstract:
Robot-assisted feeding has the potential to enhance the independence of individuals requiring assistance, yet the bite transfer process remains particularly challenging, especially for those with complex conditions. In this paper, we present ORBiT, a novel Real2Sim2Real framework designed to optimize bite transfer in robot-assisted feeding. By integrating motion capture-driven, high-fidelity soft-body simulation with systematic parameter tuning, ORBiT effectively replicates realistic head, neck and jaw dynamics during feeding interactions to provide a safe simulation-driven approach to optimize bite transfer strategies. In our approach, motion capture data drives a personalized dynamic head model that, together with a comprehensive parameter search over variables such as entry angle, exit angle, exit depth, height offset, and distance to mouth, identifies the bite transfer parameters that minimize contact forces on the user. The optimal parameters are then transferred to a real-world robotic system and validated through a pilot user study involving five subjects. Results from real user evaluations mirror the trends in simulation, indicating that bite transfer parameters, especially those related to entry and exit angles, substantially affect user comfort and overall satisfaction. Our findings validate that simulation-derived optimizations can effectively guide improvements in bite transfer strategies, laying the groundwork for a safe, personalized approach to robot-assisted feeding. Supplementary videos can be found at: https://youtu.be/a2pklEIAkOA.

Abstract:
Simulating object dynamics from real-world perception shows great promise for digital twins and robotic manipulation but often demands labor-intensive measurements and expertise. We present a fully automated Real2Sim pipeline that generates simulation-ready assets for real-world objects through robotic interaction. Using only a robot’s joint torque sensors and an external camera, the pipeline identifies visual geometry, collision geometry, and physical properties such as inertial parameters. Our approach introduces a general method for extracting high-quality, object-centric meshes from photometric reconstruction techniques (e.g., NeRF, Gaussian Splatting) by employing alpha-transparent training while explicitly distinguishing foreground occlusions from background subtraction. We validate the full pipeline through extensive experiments, demonstrating its effectiveness across diverse objects. By eliminating the need for manual intervention or environment modifications, our pipeline can be integrated directly into existing pick-and-place setups, enabling scalable and efficient dataset creation. Project page (with code and data): https://scalable-real2sim.github.io/.

Abstract:
Global localization is a critical problem in autonomous navigation, enabling precise positioning without reliance on GPS. Modern techniques often depend on dense LiDAR maps, which, while precise, require extensive storage and computational resources. Alternative approaches have explored sparse maps and learned features, but suffer from poor robustness and generalization. We propose SparseLoc1, a global localization framework that leverages vision-language foundation models to generate sparse, semantic-topometric maps in a zero-shot manner. Our approach combines this representation with Monte Carlo localization enhanced by a novel late optimization strategy for improved pose estimation. By constructing compact yet discriminative maps and refining poses through retrospective optimization, SparseLoc overcomes limitations of existing sparse methods, offering a more efficient and robust solution. Our system achieves over 5× improvement in localization accuracy compared to existing sparse mapping techniques. Despite utilizing only 1/500th of the points used by dense methods, it achieves comparable performance, maintaining average global localization error below 5m and 2° on KITTI. We further demonstrate the practical applicability of our method through cross-sequence localization experiments and downstream navigation tasks.

Abstract:
Learning from Demonstration (LfD) is a widely used approach for teaching robot motion, but more sophisticated strategies are required to address complex tasks such as surface processing. Sanding is an example where comprehensive strategies are necessary to ensure complete and efficient coverage of the surface of a workpiece. In this paper, we present a system that captures human motions and contact forces during surface processing using a powered sanding tool. We provide a publicly available dataset that consists of demonstrations for various geometric shapes with the goal to extract robot execution strategies through LfD from a variety of users. This is in contrast to conventional LfD, which generates a policy directly from one or multiple trajectories provided by a single user. Further, we provide a data analysis that reveals key insights into how humans adapt their strategies to different surface geometries and extract robot execution strategies from it. Finally, we conduct two basic robotic experiments justifying the approach of strategy extraction. Our findings contribute to the understanding of human surface-processing behavior and lay the foundation for developing more effective robotic surface processing strategies.

Abstract:
Terrain classification is crucial for robotic navigation especially in unknown environment. Existing terrain classification methods usually have high requirements for environment conditions and robot motions, making them challenging to apply to real-world scenarios. In this paper, we develop a novel terrain classification system with the planar electrical capacitance tomography (ECT) sensor, which provides a non-contact, real-time, and cost-effective way for terrain classification. Specifically, we design a planar ECT sensor and integrate it at the bottom of a mobile robot. The proposed system leverages the collected capacitance measurements to reflect the inherent differences in dielectric permittivity across various terrain types. And a multilayer perception networks is used to fuse the collected capacitance and IMU measurements for classification. Additionally, a large scale ECT dataset including 10 different types of terrains is collected with the proposed system. Extensive experiments are conducted demonstrating the effectiveness and robustness of the proposed system.

Abstract:
Soft robots rely on soft actuators, whose nonlinear responses are challenging to model, simulate, and integrate into designs. This complexity hinders the development of advanced soft robots and rigid mechanisms that require soft actuators for compliant actuation. To address this, in this study we present a framework for model development and actuator integration that offers both real and virtual environments for testing and validation. This framework includes high-fidelity digital twins and an actuator integration bench. The digital twin allows for the testing of virtual actuator models in a validated digital environment, replicating various load profiles. The actuator integration bench provides a safe, reproducible platform for validating both models and controllers under different loads, load profiles, and inertial conditions. Together, these tools enable rapid and reliable validation of actuator models and controllers, accelerating the development cycle of complex soft robots. We conclude by demonstrating the proposed workflow using liquid-gas phase transition actuators as a demonstrative test subject. Our source code is available at: https://github.com/softrobotic/antagonistic.

Abstract:
Accurate hand motion capture (MoCap) is vital for applications in robotics, virtual reality, and biomechanics, yet existing systems face limitations in capturing high-degree-of-freedom (DoF) joint kinematics and personalized hand shape. Commercial gloves offer up to 21 DoFs, which are insufficient for complex manipulations while neglecting shape variations that are critical for contact-rich tasks. We present FSGlove, an inertial-based system that simultaneously tracks up to 48 DoFs and reconstructs personalized hand shapes via DiffHCal, a novel calibration method. Each finger joint and the dorsum are equipped with IMUs, enabling high-resolution motion sensing. DiffHCal integrates with the parametric MANO model through differentiable optimization, resolving joint kinematics, shape parameters, and sensor misalignment during a single streamlined calibration. The system achieves state-of-the-art accuracy, with joint angle errors of less than 2.7°, and outperforms commercial alternatives in shape reconstruction and contact fidelity. FSGlove’s open-source hardware and software design ensures compatibility with current VR and robotics ecosystems, while its ability to capture subtle motions (e.g., fingertip rubbing) bridges the gap between human dexterity and robotic imitation. Evaluated against Nokov optical MoCap, FSGlove advances hand tracking by unifying the kinematic and contact fidelity. Hardware design, software, and more results are available at: https://sites.google.com/view/fsglove.

Abstract:
Ultrasound-guided therapeutic procedures rely heavily on operator skill, leading to variability and high training costs. The shortage of trained ultra-sonographers further exacerbates the issue, increasing workloads and associated health risks. Robotic technology has the potential to effectively tackle these issues, yet there has been limited research on fully autonomous robotic ultrasound-guided biopsy systems based on the entire workflow. To address this challenge, this paper presents an autonomous robotic operative framework for superficial organ biopsy. The system integrates real-time slice-to-volume registration and navigation, along with a needle insertion mechanism, following operational protocols to autonomously perform the entire biopsy procedure. The feasibility, robustness, and generalizability of the system are validated through experimental studies.

Abstract:
Quadrupedal locomotion via Reinforcement Learning (RL) is commonly addressed using the teacher-student paradigm, where a privileged teacher guides a proprioceptive student policy. However, key challenges such as representation misalignment between privileged teacher and proprioceptive-only student, covariate shift due to behavioral cloning, and lack of deployable adaptation; lead to poor generalization in real-world scenarios. We propose Teacher-Aligned Representations via Contrastive Learning (TAR), a framework that leverages privileged information with self-supervised contrastive learning to bridge this gap. By aligning representations to a privileged teacher in simulation via contrastive objectives, our student policy learns structured latent spaces and exhibits robust generalization to Out-of-Distribution (OOD) scenarios, surpassing the fully privileged “Teacher”. Results showed accelerated training by 2× compared to state-of-the-art baselines to achieve peak performance. OOD scenarios showed better generalization by 40% on average compared to existing methods. Moreover, TAR transitions seamlessly into learning during deployment without requiring privileged states, setting a new benchmark in sample-efficient, adaptive locomotion and enabling continual fine-tuning in real-world scenarios. Open-source code and videos are available at https://amrmousa.com/TARLoco/.

Abstract:
This paper presents a multi-particle parallel manipulation optoelectronic tweezers system integrated with computer vision technology, enabling the parallel and precise manipulation of dozens of particles. This system significantly enhances manipulation efficiency while maintaining high precision. By real-time monitoring of particle motion and light patterns, the system can rapidly adjust and optimize its manipulation strategy, thereby improving the stability and reliability of multi-particle synchronization in complex environments. Extensive experimental results demonstrate the system’s outstanding performance. For instance, it can quickly arrange complex patterns and letter sequences, facilitate the coordinated assembly of organoids from particle groups, and efficiently perform the precise separation and arrangement of mixed particles. The core advantage of this system lies in its high parallelism and flexibility, enabling it to handle large-scale synchronous manipulation tasks with exceptional operating accuracy. With continuous technological advancements and the broadening of application scenarios, this system is expected to have a profound impact in fields such as cell sorting, micro-device assembly, and organoid construction, providing robust support for research and technological development in these areas.

Abstract:
The whisker-inspired tactile sensor is advantageous for enhancing robotic perception in proximate range and darkness via non-intrusive contacts. However, localizing contact along the whisker shaft is challenging due to the non-injective mapping between tangential contacts and the resulting bending moments at the whisker base. Previous studies suggest that incorporating axial force measurements can resolve this ambiguity. In this work, we develop a magnetically transduced whisker sensor that integrates axial force sensing as an additional mechanical signal. The sensor features a tapered whisker with a custom slope and a 3-DoF suspension mechanism, enabling axial displacement at the base, which is proportional to the applied axial force. We construct a Penalized Gaussian Process model trained on synthetic data to estimate the whisker’s motion and refine it with real-data constraints. The design is compact, low-cost, and validated through simulations and real-world experiments to differentiate tangential contacts. Furthermore, we propose an optimization-based approach for estimating instantaneous contact locations. Experimental results demonstrate that the proposed method can effectively track contacts in millimeter-level accuracy with a mean error of 7.17 mm, achieving a higher accuracy with only 4.02 mm in large-deflection and close-to-base regions.

Abstract:
We introduce a robust framework, RGBTrack, for real-time 6D pose estimation and tracking that operates solely on RGB data, thereby eliminating the need for depth input for such dynamic and precise object pose tracking tasks. Building on the FoundationPose architecture, we devise a novel binary search strategy combined with a render-and-compare mechanism to efficiently infer depth and generate robust pose hypotheses from true-scale CAD models. To maintain stable tracking in dynamic scenarios, including rapid movements and occlusions, RGBTrack integrates state-of-the-art 2D object tracking (XMem) with a Kalman filter and a state machine for proactive object pose recovery. In addition, RGBTrack’s scale recovery module dynamically adapts CAD models of unknown scale using an initial depth estimate, enabling seamless integration with modern generative reconstruction techniques. Extensive evaluations on benchmark datasets demonstrate that RGBTrack’s novel depth-free approach achieves competitive accuracy and real-time performance, making it a promising practical solution candidate for applications areas including robotics, augmented reality, computer vision.The source code for our implementation will be made publicly available at https://github.com/GreatenAnoymous/RGBTrack.git.

Abstract:
Camera-to-robot (also known as eye-to-hand) calibration is a critical component of vision-based robot manipulation. Traditional marker-based methods often require human intervention for system setup. Furthermore, existing autonomous markerless calibration methods typically rely on pre-trained robot tracking models that impede their application on edge devices and require fine-tuning for novel robot embodiments. To address these limitations, this paper proposes a model-based markerless camera-to-robot calibration framework, ARC-Calib, that is fully autonomous and generalizable across diverse robots and scenarios without requiring extensive data collection or learning. First, exploratory robot motions are introduced to generate easily trackable trajectory-based visual patterns in the camera’s image frames. Then, a geometric optimization framework is proposed to exploit the coplanarity and collinearity constraints from the observed motions to iteratively refine the estimated calibration result. Our approach eliminates the need for extra effort in either environmental marker setup or data collection and model training, rendering it highly adaptable across a wide range of real-world autonomous systems. Extensive experiments are conducted in both simulation and the real world to validate its robustness and generalizability.

Abstract:
Text spotting for industrial panels is a key task for intelligent monitoring. However, achieving efficient and accurate text spotting for complex industrial panels remains challenging due to issues such as cross-scale localization and ambiguous boundaries in dense text regions. Moreover, most existing methods primarily focus on representing a single text shape, neglecting a comprehensive exploration of multi-scale feature information across different texts. To address these issues, this work proposes a novel multi-scale dense text spotter for edge AI-based vision system (EdgeSpotter) to achieve accurate and robust industrial panel monitoring. Specifically, a novel Transformer with efficient mixer is developed to learn the interdependencies among multi-level features, integrating multi-layer spatial and semantic cues. In addition, a new feature sampling with Catmull-Rom splines is designed, which explicitly encodes the shape, position, and semantic information of text, thereby alleviating missed detections and reducing recognition errors caused by multi-scale or dense text regions. Furthermore, a new benchmark dataset for industrial panel monitoring (IPM) is constructed. Extensive qualitative and quantitative evaluations on this challenging benchmark dataset validate the superior performance of the proposed method in different challenging panel monitoring tasks. Finally, practical tests based on the self-designed edge AI-based vision system demonstrate the practicality of the method. The code and demo are available at https://github.com/vision4robotics/EdgeSpotter.

Abstract:
Robotic grasping, crucial for robot interaction with objects, still struggles with counter-intuitive or long-tailed scenarios like uncommon materials and shapes. Humans, however, intuitively adjust grasps with their physics-informed interpretations of the object, using visual and linguistic cues. This work introduces PhyGrasp, a large multimodal model and dataset that enhance robotic manipulation by combining natural language and 3D point clouds using a bridge module to integrate these inputs. The language modality exhibits robust reasoning capabilities concerning the impacts of diverse physical properties on grasping, while the 3D modality comprehends object shapes and parts. With these two capabilities, PhyGrasp is able to accurately assess the physical properties of object parts and determine optimal grasping poses. Additionally, the model’s language comprehension enables human instruction interpretation, generating grasping poses that align with human preferences. To train PhyGrasp, we construct a dataset PhyPartNet with 195K object instances with varying physical properties and human preferences, alongside their corresponding language descriptions. Extensive experiments conducted in the simulation and on the real robots demonstrate that PhyGrasp achieves state-of-the-art performance, particularly in long-tailed cases, e.g., about 10% improvement in success rate over GraspNet. More demos and information are available on https://sites.google.com/view/phygrasp.

Abstract:
In this paper, we propose a novel network framework for indoor 3D object detection to handle variable input frame numbers in practical scenarios. Existing methods only consider fixed frames of input data for a single detector, such as monocular RGB-D images or point clouds reconstructed from dense multi-view RGB-D images. While in practical application scenes such as robot navigation and manipulation, the raw input to the 3D detectors is the RGB-D images with variable frame numbers instead of the reconstructed scene point cloud. However, the previous approaches can only handle fixed frame input data and have poor performance with variable frame input. In order to facilitate 3D object detection methods suitable for practical tasks, we present a novel 3D detection framework named AnyView for our practical applications, which generalizes well across different numbers of input frames with a single model. To be specific, we propose a geometric learner to mine the local geometric features of each input RGB-D image frame and implement local-global feature interaction through a designed spatial mixture module. Meanwhile, we further utilize a dynamic token strategy to adaptively adjust the number of extracted features for each frame, which ensures consistent global feature density and further enhances the generalization after fusion. Extensive experiments on the ScanNet dataset show our method achieves both great generalizability and high detection accuracy with a simple and clean architecture containing a similar amount of parameters with the baselines.

Abstract:
Predicting hand motion is critical for understanding human intentions and bridging the action space between human movements and robot manipulations. Existing hand trajectory prediction (HTP) methods forecast the future hand waypoints in 3D space conditioned on past egocentric observations. However, such models are only designed to accommodate 2D egocentric video inputs. There is a lack of awareness of multimodal environmental information from both 2D and 3D observations, hindering the further improvement of 3D HTP performance. In addition, these models overlook the synergy between hand movements and headset camera egomotion, either predicting hand trajectories in isolation or encoding egomotion only from past frames. To address these limitations, we propose novel diffusion models (MMTwin) for multimodal 3D hand trajectory prediction. MMTwin is designed to absorb multi-modal information as input encompassing 2D RGB images, 3D point clouds, past hand waypoints, and text prompt. Besides, two latent diffusion models, the egomotion diffusion and the HTP diffusion as twins, are integrated into MMTwin to predict camera egomotion and future hand trajectories concurrently. We propose a novel hybrid Mamba-Transformer module as the denoising model of the HTP diffusion to better fuse multimodal features. The experimental results on three publicly available datasets and our self-recorded data demonstrate that our proposed MMTwin can predict plausible future 3D hand trajectories compared to the state-of-the-art baselines, and generalizes well to unseen environments. The code and pretrained models will be released at https://github.com/IRMVLab/MMTwin.

Abstract:
Underwater localization is a crucial capability for ensuring robust and accurate vehicle navigation. Although various well-developed localization systems exist, their primary focus is on ground and aerial applications. The challenges posed by underwater environments, such as sparse textures and dynamic disturbances, enable the multi-modal fusion a promising solution for localization. This paper presents AVIP, a localization method that fuses Acoustic, Visual, Inertial, and Pressure modalities for underwater applications. To integrate the information from all modalities during initialization, visual and inertial modalities are alternately assigned as centric sensors to pairwise predict and update estimations of other modalities. The multi-centric calibration problem is addressed through factor graph optimization, which is fully integrated into the graph-based AVIP system as the calibration factor. To evaluate the performance and compare to state-of-the-art approaches, the proposed method is evaluated using semi-physical datasets recorded by a BlueROV2 robot and public real-world datasets. Extensive experiments demonstrate that AVIP achieves superior localization accuracy and exhibits adaptability across a range of sensor configurations.

Abstract:
A drone trajectory planner should be able to dynamically adjust the safety–efficiency trade-off according to varying mission requirements in unknown environments. Although traditional polynomial-based planners offer computational efficiency and smooth trajectory generation, they require expert knowledge to tune multiple parameters to adjust this trade-off. Moreover, even with careful tuning, the resulting adjustment may fail to achieve the desired trade-off. Similarly, although reinforcement learning-based planners are adaptable in unknown environments, they do not explicitly address the safety-efficiency trade-off. To overcome this limitation, we introduce a Decision Transformer–based trajectory planner that leverages a single parameter, Return-to-Go (RTG), as a temperature parameter to dynamically adjust the safety–efficiency trade-off. In our framework, since RTG intuitively measures the safety and efficiency of a trajectory, RTG tuning does not require expert knowledge. We validate our approach using Gazebo simulations in both structured grid and unstructured random environments. The experimental results demonstrate that our planner can dynamically adjust the safety–efficiency trade-off by simply tuning the RTG parameter. Furthermore, our planner outperforms existing baseline methods across various RTG settings, generating safer trajectories when tuned for safety and more efficient trajectories when tuned for efficiency. Real-world experiments further confirm the reliability and practicality of our proposed planner.

Abstract:
Current methods for human-object interaction segmentation excel in closed-world settings but struggle to generalize to open-world scenarios where novel actions emerge. Since collecting exhaustive training data for all possible dynamic human activities is impractical, a model capable of detecting and segmenting novel, out-of-distribution (OOD) actions without manual annotation is needed. To address this, we formally define the open-world action segmentation problem and propose a novel framework featuring three key components: 1) an Enhanced Pyramid Graph Convolutional Network with a new decoder for robust spatiotemporal upsampling, 2) hybrid-based training synthesizing OOD data to eliminate reliance on manual labels, and 3) a temporal clustering loss that groups in-distribution actions while distancing OOD samplesWe evaluate our framework on two challenging human-object interaction recognition datasets: Bimanual Actions and Two Hands and Object datasets. Experimental results demonstrate significant improvements over state-of-the-art action segmentation models across multiple open-set evaluation metrics, achieving 16.9% and 34.6% relative gains in open-set segmentation (F1@50) and out-of-distribution detection performances (AUROC), respectively. Additionally, we conduct an in-depth ablation study to assess the impact of each proposed component, identifying the optimal framework configuration for open-world action segmentation.

Affiliations: School of Information and Electronics, Advanced Research Institute of Multidisciplinary Science (ARIMS), Beijing Institute of Technology, Beijing, China; School of Computer Science and Engineering, State Key Laboratory of Complex & Critical Software Environment, Jiangxi Research Institute, Beihang University, Beijing, China; School of Computing, Macquarie University, Sydney, Australia; Institute for Integrated and Intelligent Systems, Griffith University, Brisbane, Australia

Abstract:
Ground-to-aerial person search leverages cooperative efforts between unmanned aerial vehicles (UAV) and ground surveillance cameras to locate person individuals. Despite the progress made by recent works, the impact of the discrepancy between the two views is underestimated. This limits the overall person search performance when training the model in a view-agnostic way. To address this, we propose a view-aware decomposition and unification (VADU) framework for ground-to-aerial person search. Specifically, we decompose the person search model to learn view-oriented modules for image feature encoding and person proposal generation. The data sampling and retrieval feature learning are also composed to cope with the decomposed model. This decomposition improves both person detection and discriminative feature learning within each view. On top of the decomposition, we propose view-aware unification to produce unified cross-view person features. Cross-view prototypical contrastive learning is introduced to enhance the unification between different views, enhancing model robustness to retrieve a target person in cameras of a different view. As the decomposed parts of the model are deployed on different devices for inference, this overall framework adds no extra computation cost in real-world applications. Extensive experiments demonstrate that the proposed method achieves superior person search performance and guarantees the efficiency of inference. The source code is available at https://github.com/QFWang-11/vadu.

Abstract:
This work considers a large class of systems composed of multiple quadrotors manipulating deformable and extensible cables. The cable is described via a discretized representation, which decomposes it into linear springs interconnected through lumped-mass passive spherical joints. Sets of flat outputs are found for the systems. Numerical simulations support the findings by showing cable manipulation relying on flatness-based trajectories. Eventually, we present an experimental validation of the effectiveness of the proposed discretized cable model for a two-robot example. Moreover, a closed-loop controller based on the identified model and using cable-output feedback is experimentally tested.

Abstract:
Mobile manipulation in dynamic environments is challenging due to movable obstacles blocking the robot’s path. Traditional methods, which treat navigation and manipulation as separate tasks, often fail in such "manipulate-to-navigate" scenarios, as obstacles must be removed before navigation. In these cases, active interaction with the environment is required to clear obstacles while ensuring sufficient space for movement. To address the manipulate-to-navigate problem, we propose a reinforcement learning-based approach for learning manipulation actions that facilitate subsequent navigation. Our method combines manipulability priors to focus the robot on high manipulability body positions with affordance maps for selecting high-quality manipulation actions. By focusing on feasible and meaningful actions, our approach reduces unnecessary exploration and allows the robot to learn manipulation strategies more effectively.We present two new manipulate-to-navigate simulation tasks called Reach and Door with the Boston Dynamics Spot robot. The first task tests whether the robot can select a good hand position in the target area such that the robot base can move effectively forward while keeping the end effector position fixed. The second task requires the robot to move a door aside in order to clear the navigation path. Both of these tasks need first manipulation and then navigating the base forward. Results show that our method allows a robot to effectively interact with and traverse dynamic environments. Finally, we transfer the learned policy to a real Boston Dynamics Spot robot, which successfully performs the Reach task.

Abstract:
Physics-based simulations and learning-based models are vital for complex robotics tasks like deformable object manipulation and liquid handling. However, these models often struggle with accuracy due to epistemic uncertainty or the sim-to-real gap. For instance, accurately pouring liquid from one container to another poses challenges, particularly when models are trained on limited demonstrations and may perform poorly in novel situations. This paper proposes an uncertainty-aware Monte Carlo Tree Search (MCTS) algorithm designed to mitigate these inaccuracies. By incorporating estimates of model uncertainty, the proposed MCTS strategy biases the search to-wards actions with lower predicted uncertainty. This approach enhances the reliability of planning under uncertain conditions. Applied to a liquid pouring task, our method demonstrates improved success rates even with models trained on minimal data, outperforming traditional methods and showcasing its potential for robust decision-making in robotics.

Abstract:
Object-Centric Motion Generation (OCMG) is instrumental in advancing automated manufacturing processes, particularly in domains requiring high-precision expert robotic motions, such as spray painting and welding. To realize effective automation, robust algorithms are essential for generating extended, object-aware trajectories across intricate 3D geometries. However, contemporary OCMG techniques are either based on ad-hoc heuristics or employ learning-based pipelines that are still reliant on sensitive post-processing steps to generate executable paths. We introduce FoldPath, a novel, end-to-end, neural field based method for OCMG. Unlike prior deep learning approaches that predict discrete sequences of end-effector waypoints, FoldPath learns the robot motion as a continuous function, thus implicitly encoding smooth output paths. This paradigm shift eliminates the need for brittle post-processing steps that concatenate and order the predicted discrete waypoints. Particularly, our approach demonstrates superior predictive performance compared to recently proposed learning-based methods, and attains generalization capabilities even in real industrial settings, where only a limited amount of expert samples are provided. We validate FoldPath through comprehensive experiments in a realistic simulation environment and introduce new, rigorous metrics designed to comprehensively evaluate long-horizon robotic paths, thus advancing the OCMG task towards practical maturity.

Abstract:
Training robots to operate effectively in environments with uncertain states—such as ambiguous object properties or unpredictable interactions—remains a longstanding challenge in robotics. Imitation learning methods typically rely on successful examples and often neglect failure scenarios where uncertainty is most pronounced. To address this limitation, we propose the Uncertainty-driven Foresight Recurrent Neural Network (UF-RNN), a model that combines standard time-series prediction with an active "Foresight" module. This module performs internal simulations of multiple future trajectories and refines the hidden state to minimize predicted variance, enabling the model to selectively explore actions under high uncertainty. We evaluate UF-RNN on a door-opening task in both simulation and a real-robot setting, demonstrating that, despite the absence of explicit failure demonstrations, the model exhibits robust adaptation by leveraging self-induced chaotic dynamics in its latent space. When guided by the Foresight module, these chaotic properties stimulate exploratory behaviors precisely when the environment is ambiguous, yielding improved success rates compared to conventional stochastic RNN baselines. These findings suggest that integrating uncertainty-driven foresight into imitation learning pipelines can significantly enhance a robot’s ability to handle unpredictable real-world conditions.

Abstract:
Understanding 3D scenes semantically and spatially is crucial for the safe navigation of robots and autonomous vehicles, aiding obstacle avoidance and accurate trajectory planning. Camera-based 3D semantic occupancy prediction, which infers complete voxel grids from 2D images, is gaining importance in robot vision for its resource efficiency compared to 3D sensors. However, this task inherently suffers from a 2D–3D discrepancy, where objects of the same size in 3D space appear at different scales in a 2D image depending on their distance from the camera due to perspective projection. To tackle this issue, we propose a novel framework called VPOcc that leverages a vanishing point (VP) to mitigate the 2D-3D discrepancy at both the pixel and feature levels. As a pixel-level solution, we introduce a VPZoomer module, which warps images by counteracting the perspective effect using a VP-based homography transformation. In addition, as a feature-level solution, we propose a VP-guided cross-attention (VPCA) module that performs perspective-aware feature aggregation, utilizing 2D image features that are more suitable for 3D space. Lastly, we integrate two feature volumes extracted from the original and warped images to compensate for each other through a spatial volume fusion (SVF) module. By effectively incorporating VP into the network, our framework achieves improvements in both IoU and mIoU metrics on SemanticKITTI and SSCBench-KITTI360 datasets. Additional details are available at https://vision3d-lab.github.io/vpocc/.

Abstract:
Acoustic tweezers have been a valuable tool across various fields, from nano-microfabrication to biology. Their unique characteristics enable three-dimensional particle manipulation, where acoustic trapping serves as a fundamental requirement. However, traditional methods struggle to maintain steady particle positioning due to nonlinear forces and complex dynamic coupling effects. As a result, particle oscillations are inevitable and cannot be effectively compensated by predesigned acoustic trapping. To address these challenges, this study introduces a novel visual feedback control approach that dynamically adjusts the acoustic field distribution to mitigate oscillations along the z-axis of the acoustic trapping. A binocular microscopic vision system is employed for precise particle localization, while a disturbance observer estimates the effects of strong nonlinearity and uncertainties of the acoustic trapping. The proposed methodology is validated through simulations and experiments, demonstrating a significant reduction in z-axis oscillations from 1.33× wavelength to within 0.03× wavelength. This advancement marks a step forward in achieving precise and complex acoustic manipulation using traveling-wave acoustic tweezers.

Abstract:
Micro-robots are emerging as powerful tools for search-and-rescue, precision agriculture, and cooperative manipulation, where their small size and low cost offer advantages over larger robots. However, enabling autonomous navigation on these robots remains challenging due to severe hardware constraints, such as limited memory, energy, and computational power. We explore a brain-inspired learning paradigm called Hyperdimensional Computing (HDC) to equip a cheap, lightweight navigation model that runs onboard micro-robots. We present NavHD, which features an adaptive HD encoder that learns spatial representations and incorporates loss-based training for both imitation learning and off-policy reinforcement learning. Our hardware implementation of NavHD uses eight ultrasound sensors and is optimized to run on an ARM Cortex-M4 core, using only 10.2 kB of memory, 900 clock cycles and 1.1 mJ of energy per inference. Through experiments in both simulation and the real world, we demonstrate that NavHD outperforms DNN-based and prior HDC-based RL methods in obstacle avoidance by more than 2x the performance, while achieving 2-26x more superior resource efficiency.

Abstract:
Safe real-time control of robotic manipulators in unstructured environments requires handling numerous safety constraints without compromising task performance. Traditional approaches, such as artificial potential fields (APFs), suffer from local minima, oscillations, and limited scalability, while model predictive control (MPC) can be computationally expensive. Control barrier functions (CBFs) offer a promising alternative due to their high level of robustness and low computational cost, but these safety filters must be carefully designed to avoid significant reductions in the overall performance of the manipulator. In this work, we introduce an Operational Space Control Barrier Function (OSCBF) framework that integrates safety constraints while preserving task-consistent behavior. Our approach scales to hundreds of simultaneous constraints while retaining real-time control rates, ensuring collision avoidance, singularity prevention, and workspace containment even in highly cluttered settings or during dynamic motions. By explicitly accounting for the task hierarchy in the CBF objective, we prevent degraded performance across both joint-space and operational-space tasks, when at the limit of safety. We validate performance in both simulation and hardware, and release our open-source high-performance code and media on our project webpage, https://stanfordasl.github.io/oscbf/

Abstract:
Vision-based object detectors are a crucial basis for robotics applications as they provide valuable information about object localization in the environment. These need to ensure high reliability in different lighting conditions, occlusions, and visual artifacts, all while running in real-time. Collecting and annotating real-world data for these networks is prohibitively time consuming and costly, especially for custom assets, such as industrial objects, making it untenable for generalization to in-the-wild scenarios. To this end, we present Synthetica, a method for large-scale synthetic data generation for training robust state estimators. This paper focuses on the task of object detection, an important problem which can serve as the front-end for most state estimation problems, such as pose estimation. Leveraging data from a photorealistic ray-tracing renderer, we scale up data generation, generating 2.7 million images, to train highly accurate real-time detection transformers. We present a collection of rendering randomization and training-time data augmentation techniques conducive to robust sim-to-real performance for vision tasks. We demonstrate state-of-the-art performance on the task of object detection while having detectors that run at 50–100Hz which is 9 times faster than the prior state-of-the-art (SOTA). We further demonstrate the usefulness of our training methodology for robotics applications by showcasing a pipeline for use in the real world with custom objects for which there do not exist prior datasets. Our work highlights the importance of scaling synthetic data generation for robust sim-to-real transfer while achieving the fastest real-time inference speeds. Videos and supplementary information can be found at https://sites.google.com/view/synthetica-vision

Abstract:
The acoustic response of an object can reveal a lot about its global state, for example its material properties or the extrinsic contacts it is making with the world. In this work, we build an active acoustic sensing gripper equipped with two piezoelectric fingers: one for generating signals, the other for receiving them. By sending an acoustic vibration from one finger to the other through an object, we gain insight into an object’s acoustic properties and contact state. We use this system to classify objects, estimate grasping position, estimate poses of internal structures, and classify the types of extrinsic contacts an object is making with the environment. Using our contact type classification model, we tackle a standard long-horizon manipulation problem: peg insertion. We use a simple simulated transition model based on the performance of our sensor to train an imitation learning policy that is robust to imperfect predictions from the classifier. We finally demonstrate the policy on a UR5 robot with active acoustic sensing as the only feedback. Videos can be found at https://roamlab.github.io/vibecheck.

Abstract:
Monocular depth estimation is essential for applications such as autonomous navigation and 3D reconstruction. However, achieving accurate and temporally consistent depth estimation in dynamic environments remains challenging due to scale ambiguity, sensitivity to dynamic objects, and inconsistent depth predictions. Traditional SLAM-based methods ensure global consistency but perform poorly in dynamic scenes, while deep learning-based approaches suffer from the absence of absolute scale and temporal stability. To address these issues, we propose a Real-Time Consistent Monocular Depth Recovery System that combines ORB-SLAM3 for sparse depth initialization, a ViT-based depth completion network, and a motion segmentation module to improve robustness in dynamic environments. Additionally, we introduce a dual-weight fusion module that adaptively balances RGB semantic features and geometric depth priors, ensuring high accuracy and consistency. Our system jointly optimizes both static and dynamic regions to produce globally scale-consistent dense depth maps with improved temporal stability. Extensive experiments on benchmark datasets demonstrate that our approach outperforms existing methods in terms of depth accuracy, temporal consistency, and robustness in dynamic scenes, while maintaining real-time performance.

Abstract:
Super-resolution (SR) can greatly promote the development of edge electro-optical (EO) devices. However, most existing SR models struggle to simultaneously achieve effective thermal reconstruction and real-time inference on edge EO devices with limited computing resources. To address these issues, this work proposes a novel fast thermal SR model (EdgeSR) for edge EO devices. Specifically, reparameterized scale-integrated convolutions (RepSConv) are proposed to deeply explore high-frequency features, incorporating multi-scale information and enhancing the scale-awareness of the backbone during the training phase. Furthermore, an inter-active reparameterization module (IRM), combining historical high-frequency with low-frequency information, is introduced to guide the extraction of high-frequency features, ultimately boosting the high-quality reconstruction of thermal images. Edge EO deployment-oriented reparameterization (EEDR) is designed to reparameterize all modules into standard convolutions that are hardware-friendly for edge EO devices and onboard real-time inference. Additionally, a new benchmark for thermal SR on cityscapes (CS-TSR) is built. The experimental results on this benchmark show that, compared to state-of-the-art lightweight SR networks, EdgeSR delivers superior reconstruction quality and faster inference speed on edge EO devices. In real-world applications, EdgeSR exhibits robust performance on edge EO devices, making it suitable for real-world deployment. The code and demo is available at https://github.com/vision4robotics/EdgeSR.

Abstract:
Obstacles on railroads significantly increase the risk of traveling with a lot of train accidents caused by undetected obstacles. The obstacles disturb both the shipments of goods and the transportation of people leading to delays and damage which then result in substantial financial losses. Following natural disasters, manually locating and removing obstacles is not only time-consuming but also hazardous for the personnel involved. To address these challenges, this paper proposes an object detection system that can be implemented on an aerial drone to detect obstacles on the railway. This approach aims to enhance railway safety, reduce costs, and ensure the timely delivery of essential goods such as food and medical supplies during emergencies.

Abstract:
Tracking the position and orientation of objects in space (i.e., in 6-DoF) in real time is a fundamental problem in robotics for environment interaction. It becomes more challenging when objects move at high-speed due to frame rate limitations in conventional cameras and motion blur. Event cameras are characterized by high temporal resolution, low latency and high dynamic range, that can potentially overcome the impacts of motion blur. Traditional RGB cameras provide rich visual information that is more suitable for the challenging task of single-shot object pose estimation. In this work, we propose using event-based optical flow combined with an RGB based global object pose estimator for 6-DoF pose tracking of objects at high-speed, exploiting the core advantages of both types of vision sensors. Specifically, we propose an event-based optical flow algorithm for object motion measurement to implement an object 6-DoF velocity tracker. By integrating the tracked object 6-DoF velocity with low frequency estimated pose from the global pose estimator, the method can track pose when objects move at high-speed. The proposed algorithm is tested and validated on both synthetic and real world data, demonstrating its effectiveness, especially in high-speed motion scenarios.

Abstract:
Deep generative models, particularly diffusion and flow matching models, have recently shown remarkable potential in learning complex policies through imitation learning. However, the safety of generated motions remains overlooked, particularly in complex environments with inherent obstacles. In this work, we address this critical gap by proposing Potential Field-Guided Flow Matching Policy (PF2MP), a novel approach that simultaneously learns task policies and extracts obstacle-related information, represented as a potential field, from the same set of successful demonstrations. During inference, PF2MP modulates the flow matching vector field via the learned potential field, enabling safe motion generation. By leveraging these complementary fields, our approach achieves improved safety without compromising task success across diverse environments, such as navigation tasks and robotic manipulation scenarios. We evaluate PF2MP in both simulation and real-world settings, demonstrating its effectiveness in task space and joint space control. Experimental results demonstrate that PF2MP enhances safety, achieving a significant reduction of collisions compared to baseline policies. This work paves the way for safer motion generation in unstructured and obstacle-rich environments.

Abstract:
Magnetic microrobots are showing great potential in micromanipulation due to the capability of motion control under external fields. However, achieving selective control of magnetic microrobots in three-dimensional (3D) space using global magnetic fields still presents a challenge. In this work, we propose a selective control strategy based on a movable electromagnetic coil system, incorporating a mass-spring-damping model to achieve precise control of cell microrobots in 3D space. By combining theoretical analysis with vision-based feedback, experiments are demonstrated in different scenarios, including step climbing and ring traversal, validating the control capability in different environments. Furthermore, by utilizing the differences in magnetic responses among cell microrobots, this strategy enables selective manipulation of multiple cell microrobots, demonstrating real-time sorting manipulation in a 3D space. Our work presents a strategy that can be applied to selectively manipulate magnetic microrobots in complex environments.

Abstract:
Surgical planning and training based on machine learning requires a large amount of 3D anatomical models reconstructed from medical imaging, which is currently one of the major bottlenecks. Obtaining these data from real patients and during surgery is very demanding, if even possible, due to legal, ethical, and technical challenges. It is especially difficult for soft tissue organs with poor imaging contrast, such as the prostate. To overcome these challenges, we present a novel workflow for automated 3D anatomical data generation using data obtained from physical organ models. We additionally use a 3D Generative Adversarial Network (GAN) to obtain a manifold of 3D models useful for other downstream machine learning tasks that rely on 3D data. We demonstrate our workflow using an artificial prostate model made of biomimetic hydrogels with imaging contrast in multiple zones. This is used to physically simulate endoscopic surgery. For evaluation and 3D data generation, we place it into a customized ultrasound scanner that records the prostate before and after the procedure. A neural network is trained to segment the recorded ultrasound images, which outperforms conventional, non-learning-based computer vision techniques in terms of intersection over union (IoU). Based on the segmentations, a 3D mesh model is reconstructed, and performance feedback is provided.

Abstract:
Integrating functional wrist articulation in prosthetic robot arms is crucial for enhancing natural movement and reducing compensatory upper limb motions. However, two significant challenges remain in electromyography (sEMG)-based prosthetic control: (1) real-time processing via efficient model design and (2) cross-subject generalization to address the individual variability in muscle signals. This study employs the MAMBA2 architecture to address the first challenge, leveraging Structured State Space Models (SSM) for efficient long-sequence inference. This enables real-time control with minimal computational overhead, making it well-suited for prosthetic robot arm applications. To tackle the second challenge, we implement a Representation Subspace Distance (RSD)-based Unsupervised Domain Adaptation (UDA), which preserves feature scale while aligning inter-subject variations, mitigating domain shift effects, and improving subject-independent wrist movement estimation. The model is trained on the Ninapro DB2 dataset, utilizing multi-channel sEMG signals and corresponding wrist kinematics. Evaluation results demonstrate that the MAMBA architecture outperforms conventional recurrent neural networks, achieving lower Mean Squared Error (MSE) and higher R2 values, with the Attention variant exhibiting the best prediction performance. Furthermore, this study highlights that the proposed UDA approach, combined with RSD-based alignment, significantly enhances cross-subject performance, reducing the need for extensive calibration. By enabling real-time processing through a computationally efficient model structure and effectively addressing cross-subject variability, this study contributes to developing a more reliable and generalizable sEMG-based robotic prosthesis controller, ultimately improving its applicability across diverse individuals.

Abstract:
High-level robotic manipulation tasks demand flexible 6-DoF grasp estimation to serve as a basic function. Previous approaches either directly generate grasps from point-cloud data, suffering from challenges with small objects and sensor noise, or infer 3D information from RGB images, which introduces expensive annotation requirements and discretization issues. Recent methods mitigate some challenges by retaining a 2D representation to estimate grasp keypoints and applying Perspective-n-Point (PnP) algorithms to compute 6-DoF poses. However, these methods are limited by their non-differentiable nature and reliance solely on 2D supervision, which hinders the full exploitation of rich 3D information. In this work, we present KGN-Pro, a novel grasping network that preserves the efficiency and fine-grained object grasping of previous KGNs while integrating direct 3D optimization through probabilistic PnP layers. KGN-Pro encodes paired RGB-D images to generate Keypoint Map, and further outputs a 2D confidence map to weight keypoint contributions during re-projection error minimization. By modeling the weighted sum of squared re-projection errors probabilistically, the network effectively transmits 3D supervision to its 2D keypoint predictions, enabling end-to-end learning. Experiments on both simulated and real-world platforms demonstrate that KGN-Pro outperforms existing methods in terms of grasp cover rate and success rate. Project website: https://waitderek.github.io/kgnpro.

Abstract:
In this paper, we present a hierarchical question-answering (QA) approach for scene understanding in autonomous vehicles, balancing cost-efficiency with detailed visual interpretation. The method fine-tunes a compact vision-language model (VLM) on a custom dataset specific to the geographical area in which the vehicle operates to capture key driving-related visual elements. At the inference stage, the hierarchical QA strategy decomposes the scene understanding task into high-level and detailed sub-questions. Instead of generating lengthy descriptions, the VLM navigates a structured question tree, where answering high-level questions (e.g., "Is it possible for the ego vehicle to turn left at the intersection?") triggers more detailed sub-questions (e.g., "Is there a vehicle approaching the intersection from the opposite direction?"). To optimize inference time, questions are dynamically skipped based on previous answers, minimizing computational overhead. The extracted answers are then synthesized using handcrafted templates to ensure coherent, contextually accurate scene descriptions. We evaluate the proposed approach on the custom dataset using GPT reference-free scoring, demonstrating its competitiveness with state-of-the-art methods like GPT-4o in capturing key scene details while achieving significantly lower inference time. Moreover, qualitative results from real-time deployment highlight the proposed approach’s capacity to capture key driving elements with minimal latency. The code is available at https://github.com/knu-citac/HierarchicalQA.

Abstract:
Soft robots exhibit complex behaviors despite simple control, due to their inherently compliant hardware which passively deforms upon contact ± a concept commonly referred to as morphological computation. To fully determine the behavior of soft robots, not only their control software but also their passive behavior needs to be programmed. We show that deliberate programming of passive deformation in soft fingertips can significantly influence the grasping and manipulation performance of various robotic grippers. For this, the fingertips display strategically modulated compliance levels across their palmar surface, realized through adjustments to the local thickness of a lattice structure within their soft material, resulting in desired passive deformation. The grippers are operated by human participants, solving diverse tasks involving a variety of objects. We analyze 2025 human trials and show that the distinct passive behaviors programmed into the fingertips significantly affect grasping and manipulation performance. Furthermore, we discovered that specific compliance profiles consistently demonstrate superior performance, indicating that not merely inherent softness by itself, but a purposeful combination of varying compliance levels plays a pivotal role in successful soft interaction.

Abstract:
The strength of the human hand lies in its ability to manipulate objects precisely and robustly. In contrast, simple robotic grippers have low dexterity and fail to handle small objects effectively. This is why many automation tasks remain unsolved by robots. This paper presents an optimization-based framework for in-hand manipulation with a robotic hand equipped with compact Magnetic Tactile Sensors (MTSs). We formulate a trajectory optimization problem using Nonlinear Programming (NLP) for finger movements while ensuring contact points to change along the geometry of the fingers. Using the optimized trajectory from the solver, we implement and test an open-loop controller for rolling motion. To further enhance robustness and accuracy, we introduce a force controller for the fingers and a state estimator for the object utilizing MTSs. The proposed framework is validated through comparative experiments, showing that incorporating the force control with compliance consideration improves the accuracy and robustness of the rolling motion. Rolling an object with the force controller is 30% more likely to succeed than running an open-loop controller. The demonstration video is available at https://youtu.be/6J_muL_AyE8.

Abstract:
This research aims to enhance the interaction between humans and robots, especially in environments with multiple similar objects or semantic ambiguities. Traditional command-based interactions typically require users to provide precise descriptions, which often poses a significant challenge. To address this issue, we propose a framework named Tag-GuideBot, which leverages Visual Language Models (VLMs) and utilizes object markers to help locate and identify objects in the environment. By integrating positional point prompts of the target objects with robot motion planning models, we aim to achieve a more accurate understanding and execution of complex commands, thus improving the efficiency and naturalness of interactions. Experimental results demonstrate that TagGuideBot effectively addresses the challenges posed by complex commands and environmental complexities, achieving an accuracy of 66.3% on user instructions extended beyond the training set, providing solid support for further optimization of human-robot interaction.

Abstract:
Optimizing trajectory costs for nonlinear control systems remains a significant challenge. Model Predictive Control (MPC), particularly sampling-based approaches such as the Model Predictive Path Integral (MPPI) method, has recently demonstrated considerable success by leveraging parallel computing to efficiently evaluate numerous trajectories. However, MPPI often struggles to balance safe navigation in constrained environments with effective exploration in open spaces, leading to infeasibility in cluttered conditions. To address these limitations, we propose DBaS-Log-MPPI, a novel algorithm that integrates Discrete Barrier States (DBaS) to ensure safety while enabling adaptive exploration with enhanced feasibility. Our method is efficiently validated through three simulation missions and one real-world experiment, involving a 2D quadrotor and a ground vehicle navigating through cluttered obstacles. We demonstrate that our algorithm surpasses both Vanilla MPPI and Log-MPPI, achieving higher success rates, lower tracking errors, and a conservative average speed.

Abstract:
Perception-based navigation systems are useful for unmanned ground vehicle (UGV) navigation in complex terrains, where traditional depth-based navigation schemes are insufficient. However, these data-driven methods are highly dependent on their training data and can fail in surprising and dramatic ways with little warning. To ensure the safety of the vehicle and the surrounding environment, it is imperative that the navigation system is able to recognize the predictive uncertainty of the perception model and respond safely and effectively in the face of uncertainty. In an effort to enable safe navigation under perception uncertainty, we develop a probabilistic and reconstruction-based competency estimation (PaRCE) method to estimate the model’s level of familiarity with an input image as a whole and with specific regions in the image. We find that the overall competency score can accurately predict correctly classified, misclassified, and out-of-distribution (OOD) samples. We also confirm that the regional competency maps can accurately distinguish between familiar and unfamiliar regions across images. We then use this competency information to develop a planning and control scheme that enables effective navigation while maintaining a low probability of error. We find that the competency-aware scheme greatly reduces the number of collisions with unfamiliar obstacles, compared to a baseline controller with no competency awareness. Furthermore, the regional competency information is particularly valuable in enabling efficient navigation.

Abstract:
Precise monitoring of pig behavior has become pivotal for enhancing animal welfare and breeding efficiency. However, existing studies predominantly focus on behavior recognition while neglecting environmental influences, and lack specialized image captioning models and datasets tailored for farm scenarios, hindering textual analysis of behavior-environment interactions. In this study, a multimodal image captioning model was proposed to generate semantic textual descriptions of pig behavior, thereby supporting smart decision-making in farm management. The model employs a ResNet-18 encoder to extract pig visual biometric features from RGB and depth images, coupled with an innovative decoder integrating an enhanced Long Short-Term Memory (LSTM) network and Graph Convolutional Network (GCN) for pig behavior textual description, effectively resolving the input inconsistency between training and inference phases in traditional Encoder-Decoder architectures. Additionally, a dedicated pig behavior dataset comprising 9,052 annotated images was constructed, covering four behavioral categories: standing, sitting, lying, and eating. The experimental results show that the proposed approach achieves a METEOR score of 88.25%, which is outperformed baseline models by up to 21.58%. By recognizing pig behavior and interpreting environmental context, the proposed approach introduces a practical methodology for analyzing behavior-environment interactions and facilitates the integration of LLM-embedded robotic systems into smart livestock farming.

Abstract:
This work introduces BEV-LIO(LC), a novel LiDAR-Inertial Odometry (LIO) framework that combines Bird’s Eye View (BEV) image representations of LiDAR data with geometry-based point cloud registration and incorporates loop closure (LC) through BEV image features. By normalizing point density, we project LiDAR point clouds into BEV images, thereby enabling efficient feature extraction and matching. A lightweight convolutional neural network (CNN) based feature extractor is employed to extract distinctive local and global descriptors from the BEV images. Local descriptors are used to match BEV images with FAST keypoints for reprojection error construction, while global descriptors facilitate loop closure detection. Reprojection error minimization is then integrated with point-to-plane registration within an iterated Extended Kalman Filter (iEKF). In the back-end, global descriptors are used to create a KD-tree-indexed keyframe database for accurate loop closure detection. When a loop closure is detected, Random Sample Consensus (RANSAC) computes a coarse transform from BEV image matching, which serves as the initial estimate for Iterative Closest Point (ICP). The refined transform is subsequently incorporated into a factor graph along with odometry factors, improving the global consistency of localization. Extensive experiments conducted in various scenarios with different LiDAR types demonstrate that BEVLIO(LC) outperforms state-of-the-art methods, achieving competitive localization accuracy. Our code and video can be found at https://github.com/HxCa1/BEV-LIO-LC.

Abstract:
In modern healthcare, the demand for autonomous robotic assistants has grown significantly, particularly in the operating room, where surgical tasks require precision and reliability. Robotic scrub nurses have emerged as a promising solution to improve efficiency and reduce human error during surgery. However, challenges remain in terms of accurately grasping and handing over surgical instruments, especially when dealing with complex objects in dynamic environments. In this work, we introduce RoboNurse-VLA, a novel robotic scrub nurse system based on a Vision-Language-Action (VLA) model. RoboNurse-VLA integrates Segment Anything Model 2 (SAM 2) and Llama 2, leveraging an LLM head to enhance reasoning capabilities. By combining SAM 2’s mask generation with Llama 2’s advanced reasoning, RoboNurse-VLA can accurately interpret task requirements, identify optimal grasping points, and determine appropriate handover poses. Designed for real-time operation, RoboNurse-VLA enables precise grasping and seamless handover of surgical instruments based on voice commands from the surgeon. Utilizing state-of-the-art vision and language models, it effectively addresses challenges related to object detection, pose optimization, and handling difficult-to-grasp instruments. Extensive evaluations demonstrate that RoboNurse-VLA outperforms existing models, achieving high success rates in surgical instrument handovers, even for previously unseen tools and complex objects. This work represents a significant advancement in autonomous surgical assistance, highlighting the potential of VLA models for real-world medical applications. More details can be found at https:// robonurse-vla.github.io.

Abstract:
The increasingly complex and diverse planetary exploration environment requires more adaptable and flexible rover navigation strategy. In this study, we propose a VLM-empowered multi-mode system to achieve efficient while safe autonomous navigation for planetary rovers. Vision-Language Model (VLM) is used to parse scene information by image inputs to achieve a human-level understanding of terrain complexity. Based on the complexity classification, the system switches to the most suitable navigation mode, composing of perception, mapping and planning modules designed for different terrain types, to traverse the terrain ahead before reaching the next waypoint. By integrating the local navigation system with a map server and a global waypoint generation module, the rover is equipped to handle long-distance navigation tasks in complex scenarios. The navigation system is evaluated in various simulation environments. Compared to the single-mode conservative navigation method, our multi-mode system is able to bootstrap the time and energy efficiency in a long-distance traversal with varied type of obstacles, enhancing efficiency by 79.5%, while maintaining its avoidance capabilities against terrain hazards to guarantee rover safety. More system information is shown at https://chengsn1234.github.io/multi-mode-planetary-navigation/.

Abstract:
Visual Language Navigation (VLN) is a fundamental task within the field of Embodied AI, focusing on the ability of agents to navigate complex environments based on natural language instructions. Despite the progress made by existing methods, these methods often present some common challenges. First, they rely on pre-trained backbone models for visual perception, which struggle with the dynamic viewpoints in VLN scenarios. Second, the performance is limited when using pre-trained LLMs or VLMs without fine-tuning, due to the absence of VLN domain knowledge. Third, while fine-tuning LLMs and VLMs can improve results, their computational costs are higher than those without fine-tuning. To address these limitations, we propose Weakly-supervised Partial Contrastive Learning (WPCL), a method that enhances an agent’s ability to identify objects from dynamic viewpoints in VLN scenarios by effectively integrating pre-trained VLM knowledge into the perception process, without requiring VLM fine-tuning. Our method enhances the agent’s ability to interpret and respond to environmental cues while ensuring computational efficiency. Experimental results have shown that our method outperforms the baseline methods on multiple benchmarks, which validates the effectiveness, robustness, and generalizability of our method.

Abstract:
In autonomous driving, accurate motion prediction is crucial for safe and efficient motion planning. To ensure safety, planners require reliable uncertainty estimates of the predicted behavior of surrounding agents, yet this aspect has received limited attention. In particular, decomposing uncertainty into its aleatoric and epistemic components is essential for distinguishing between inherent environmental randomness and model uncertainty, thereby enabling more robust and informed decision-making. This paper addresses the challenge of uncertainty modeling in trajectory prediction with a holistic approach that emphasizes uncertainty quantification, decomposition, and the impact of model composition. Our method, grounded in information theory, provides a theoretically principled way to measure uncertainty and decompose it into aleatoric and epistemic components. Unlike prior work, our approach is compatible with state-of-the-art motion predictors, allowing for broader applicability. We demonstrate its utility by conducting extensive experiments on the nuScenes dataset, which shows how different architectures and configurations influence uncertainty quantification and model robustness.

Abstract:
LiDAR semantic segmentation models are typically trained from random initialization as universal pre-training is hindered by the lack of large, diverse datasets. Moreover, most point cloud segmentation architectures incorporate custom network layers, limiting the transferability of advances from vision-based architectures. Inspired by recent advances in universal foundation models, we propose BALViT, a novel approach that leverages frozen vision models as amodal feature encoders for learning strong LiDAR encoders. Specifically, BALViT incorporates both range-view and bird’s-eye-view LiDAR encoding mechanisms, which we combine through a novel 2D-3D adapter. While the range-view features are processed through a frozen image backbone, our bird’s-eye-view branch enhances them through multiple cross-attention interactions. Thereby, we continuously improve the vision network with domain-dependent knowledge, resulting in a strong label-efficient LiDAR encoding mechanism. Extensive evaluations of BALViT on the SemanticKITTI and nuScenes benchmarks demonstrate that it outperforms state-of-the-art methods on small data regimes. We make the code and models publicly available at http://balvit.cs.uni-freiburg.de.

Abstract:
Microscopic traffic simulation has become an important tool for autonomous driving training and testing. Although recent data-driven approaches advance realistic behavior generation, their learning still relies primarily on a single real-world dataset, which limits their diversity and thereby hinders downstream algorithm optimization. In this paper, we propose DriveGen, a novel traffic simulation framework with large models for more diverse traffic generation that supports further customized designs. DriveGen consists of two internal stages: the initialization stage uses a large language model and retrieval technique to generate map and vehicle assets; the rollout stage outputs trajectories with selected waypoint goals from a visual language model and a specifically designed diffusion planner. Through this two-staged process, DriveGen fully utilizes large models’ high-level cognition and reasoning of driving behavior, obtaining greater diversity beyond datasets while maintaining high realism. To support effective downstream optimization, we additionally develop DriveGen-CS, an automatic corner case generation pipeline that uses failures of the driving algorithm as additional prompt knowledge for large models without the need for retraining or fine-tuning. Experiments show that our generated scenarios and corner cases have superior performance compared to state-of-the-art baselines. Downstream experiments further verify that the synthesized traffic of DriveGen provides better optimization of the performance of typical driving algorithms, demonstrating the effectiveness of our framework.

Abstract:
This paper presents a sound source localization strategy that relies on a microphone array embedded in an unmanned ground vehicle and an asynchronous close-talking microphone near the operator. A signal coarse alignment strategy is combined with a time-domain acoustic echo cancellation algorithm to estimate a time-frequency ideal ratio mask to isolate the target speech from interferences and environmental noise. This allows selective sound source localization, and provides the robot with the direction of arrival of sound from the active operator, which enables rich interaction in noisy scenarios. Results demonstrate an average angle error of 4 degrees and an accuracy within 5 degrees of 95% at a signal-to-noise ratio of 1dB, which is significantly superior to the state-of-the-art localization methods.

Abstract:
Generalizable algorithms for tactile sensing remain underexplored, primarily due to the diversity of sensor modalities. Recently, many methods for cross-sensor transfer between optical (vision-based) tactile sensors have been investigated, yet little work focus on non-optical tactile sensors. To address this gap, we propose an encoder-decoder architecture to unify tactile data across non-vision-based sensors. By leveraging sensor-specific encoders, the framework creates a latent space that is sensor-agnostic, enabling cross-sensor data transfer with low errors and direct use in downstream applications. We leverage this network to unify tactile data from two commercial tactile sensors: the Xela uSkin uSPa 46 and the Contactile PapillArray. Both were mounted on a UR5e robotic arm, performing force-controlled pressing sequences against distinct object shapes (circular, square, and hexagonal prisms) and two materials (rigid PLA and flexible TPU). Another more complex unseen object was also included to investigate the model’s generalization capabilities. We show that alignment in latent space can be implicitly learned from joint autoencoder training with matching contacts collected via different sensors. We further demonstrate the practical utility of our approach through contact geometry estimation, where downstream models trained on one sensor’s latent representation can be directly applied to another without retraining.

Abstract:
This paper describes methodologies to estimate the collision probability, Euclidean distance and gradient between a robot and a surface, without explicitly constructing a free space representation. The robot is assumed to be an ellipsoid, which provides a tighter approximation for navigation in cluttered and narrow spaces compared to the commonly used spherical model. Instead of computing distances over point clouds or high-resolution occupancy grids, which is expensive, the environment is modeled using compact Gaussian mixture models and approximated via a set of ellipsoids. A parallelizable strategy to accelerate an existing ellipsoid-ellipsoid distance computation method is presented. Evaluation in 3D environments demonstrates improved performance over state-of-the-art methods. Execution times for the approach are within a few microseconds per ellipsoid pair using a single-thread on low-power embedded computers.

Abstract:
To avoid dangerous situations, such as collisions in dynamic environments, autonomous vehicles must predict the risks of the current scene to take safe actions. Traditional rule-based risk prediction methods and existing reinforcement learning (RL) approaches, which typically rely on manually designed driving decision rules or heuristic reward functions, often fail to capture the complexity of real-world dangerous scenarios, leading to suboptimal and unsafe driving decisions. To address this limitation, we develop a novel RL method, called Group Opinion Risk-Aware Reinforcement Learning (GORA-RL), for safer driving decisions that align with real-world conditions. Specifically, we first introduce surveys of human drivers to assess risk in real-world driving situations. Using these real group opinions as training data, we train a risk prediction model, referred to as the risk prediction model with a Transformer (RPT), that captures the crucial characteristics of these scenarios, resulting in more realistic and reliable risk predictions. This model is then integrated as a reward function to train an RL algorithm for making driving decisions in various scenarios. The experiments validate that our approach outperforms two state-of-the-art (SOTA) methods in challenging congested scenarios, such as merging and intersections, in terms of reward and several other metrics. Project site: https://github.com/naiyisiji/RPT.

Abstract:
Accurate aerodynamic interaction modeling in multi-drone tasks is crucial for enhancing system stability and efficiency, especially when facing major disturbances from downwash wake effects. Conventional data-driven and empirical models mainly address simplified cases where one drone hovers or all vehicles have low absolute and relative velocities (≤ 0.5 m/s), and rely merely on relative states. In this study, we use high-fidelity Computational Fluid Dynamics (CFD) simulations to explore quadrotor interactions at higher speeds (0.5-4.0 m/s). We find that as the absolute velocities of the UAVs rise, downwash effects change significantly. To account for these discrepancies, we present a data-driven model considering both the absolute and relative properties of the downwash problem. We propose a geometric deep neural network predictor and compare its performance with existing data-driven and empirical models. Validations on two quadrotor settings show that our model gives more reliable predictions in tough scenarios and performs better in training without rigorous fine-tuning. Finally, we combine our predictor with a nonlinear feedback controller to enhance flight control under downwash disturbances. However, we encounter limitations for our speed ranges during trajectory tracking such as delays and velocity loss. Despite these challenges, our encoding and prediction method shows to be a promising step to address the downwash effects at higher speeds.We release our dataset, method, and re-implementations at: https://github.com/pavelkharitenko/flare-dw.

Abstract:
By framing reinforcement learning as a sequence modeling problem, recent work has enabled the use of generative models, such as diffusion models, for planning. While these models are effective in predicting long-horizon state trajectories in deterministic environments, they face challenges in dynamic settings with moving obstacles. Effective collision avoidance demands continuous monitoring and adaptive decision-making. While re-planning at every time step could ensure safety, it introduces substantial computational overhead due to the repetitive prediction of overlapping state sequences—a process that is particularly costly with diffusion models, known for their intensive iterative sampling procedure. We propose an adaptive generative planning approach that dynamically adjusts re-planning frequency based on the uncertainty of action predictions. Our method minimizes the need for frequent, computationally expensive, and redundant re-planning while maintaining robust collision avoidance performance. In experiments, we obtain a 13.5% increase in the mean trajectory length and 12.7% increase in mean reward over long-horizon planning, indicating a reduction in collision rates, and improved ability to navigate the environment safely.

Abstract:
One indicator for evaluating autonomous vehicles’ capability is the accuracy of perceiving the surrounding environment. As an essential part of perception, MOT (multi-object tracking) algorithms provide vital guarantees for safe driving. However, many MOT algorithms based on the motion model only consider the information from previous frame when predicting the motion state of objects without taking into account the long-term motion state. Moreover, their motion model generally uses constant speed or acceleration models, which may cause tracking loss when the object suddenly changes its motion or is occluded by other objects. In this paper, we propose DA-MOT (Multiple Object Tracking with Dynamic Adaptive Object Motion Estimation), which utilizes information from lidar and camera sensors to calculate objects’ dynamic and static states under different sensor information. We modify the KF motion model parameters based on the object’s motion for better tracking performance. Furthermore, we design a re-association mechanism to re-assign IDs for inaccurate associations. We conducted experiments on the KITTI dataset, and the results show a significant improvement in accuracy. DA-MOT algorithm has about 1.5% improvement in MOTA metrics compared to other MOT algorithms in scenes with large changes in object state and can run about 1000 fps on the Intel Core Intel® Xeon(R) Gold 5217 CPU.

Abstract:
Given a demonstration of a complex manipulation task, such as pouring liquid from one container to another, we seek to generate a motion plan for a new task instance involving objects with different geometries. This is nontrivial since we need to simultaneously ensure that the implicit motion constraints are satisfied (glass held upright while moving), that the motion is collision-free, and that the task is successful (e.g., liquid is poured into the target container). We solve this problem by identifying the positions of critical locations and associating a reference frame (called motion transfer frames) on the manipulated object and the target, selected based on their geometries and the task at hand. By tracking and transferring the path of the motion transfer frames, we generate motion plans for arbitrary task instances with objects of different geometries and poses. We show results from simulation as well as robot experiments on physical objects to evaluate the effectiveness of our solution. A video supplement is available on YouTube: https://youtu.be/RuG9zMXnfR8

Abstract:
Place recognition is a fundamental technology of high relevance for autonomous robot navigation. Existing methods encounter significant challenges arising from scene variations (e.g., illumination changes, dynamic objects), view-point shifts, and difficulties in data fusion and alignment. These factors often lead to a substantial drop in recognition recall, which is typically addressed in the literature by training deep neural networks to learn invariant feature representations. In this paper, we propose MRMT-PR, a novel multi-scale reverse-view Mamba-Transformer architecture for LiDAR-based place recognition that uses a single-frame point cloud as its input. Our MRMT-PR framework consists of a multi-scale reverse-view preprocessing module for LiDAR point clouds, a Mamba-Transformer feature encoder, and a global feature fusion module. This architecture effectively mitigates the impact of perspective and illumination variations, enhances the global representational capacity of LiDAR features, and significantly improves recognition robustness under challenging conditions such as viewpoint changes and long-term localization. Experiments conducted on NCLT dataset with challenging scenarios demonstrate that MRMT-PR outperforms existing LiDAR-based place recognition baselines in terms of overall performance.

Abstract:
Renal calculi, while not inherently life-threatening, can induce excruciating pain during acute episodes. The predominant clinical treatment – ureteroscopic lithotripsy (URS) –currently faces challenges including restricted maneuverability, frequent manual adjustments during dynamic calculi movement, and tissue damage risk, highlighting the need for robotic assistance. This study proposes an autonomous lithotripsy system through three integrated technological advancements: 1) a robotic ureteroscope with sub-millimeter-scale positioning accuracy; 2) a concatenated Quenching-net semantic segmentation visual processing framework based on convolutional neural networks, achieving a segmentation accuracy of 98.5% (validated on 12,768 endoscopic images); 3) a control strategy developed through collaborative deep reinforcement learning(DRL), enabling a 93% success rate in tracking randomly moving calculi. This system's autonomous calculi localization capability reduces operator fatigue and may mitigate cognitive bias in calculi targeting. It demonstrates how embodied AI enhances medical procedural precision while preserving human oversight in critical decisions.

Abstract:
Three-dimensional (3D) Plant Phenotyping enables comprehensive trait analysis for evaluating plant growth in precision agriculture. Current phenotyping frameworks are low-throughput due to frequent manual intervention and inefficiencies in handling large volumes of repetitive data. Existing view planners for phenotyping are limited to static sampling methods or narrow focus on specific plant organs, restricting their utility in capturing the full complexity of plant structures. To address these limitations, we propose SemP-NBV, a novel semantic-aware predictive next-best-view approach for sample-efficient 3D plant phenotyping. Evaluated in a photorealistic simulator with 12 plant categories, SemPNBV achieves 15.3% more observed points than an even-space sampler at 8 images on average, while matching the reconstruction quality at 20 even-space images. We demonstrate that existing state-of-the-art predictive planners designed for artificial structures struggle with zero performance increase compared to even-space sampler for complex plant structures. Furthermore, our approach generates semantic information during reconstruction, reducing the need for post hoc semantic labeling, and streamlining the 3D phenotyping workflow. The project is available at https://github.com/ARLabXiang/SemPNBV.

Abstract:
Multi-Agent collaborative perception is currently experiencing a surge in attention as a novel approach to addressing autonomous driving challenges. Despite advances in previous efforts, challenges remain due to various dilemmas in the perception process, such as imperfect localization and collaboration heterogeneity. To tackle these issues, we propose CoDifFu, a novel diffusion-based collaborative perception framework that enhances robustness against localization uncertainty and improves efficiency in heterogeneous feature fusion. A diffusion-based detection head progressively denoises object centers through a learnable reverse process. During training, the center coordinates of objects diffuse from the ground truth to the Gaussian distribution, then the network learns to reverse the diffusion process. In the inference, the model progressively refines a set of random centers of boxes to align with the ground truth centers. Moreover, we devised a confidence-guided multi-agent communication module(CMC), utilizing the confidence map as guidance to effectively achieve complementary feature fusion of multi-agent’s features and alleviates collaboration heterogeneity. To thoroughly evaluate CoDifFu, we consider 3D object detection in both real-world and simulation scenarios. Extensive experiments demonstrate the superiority of CoDifFu and the effectiveness of all its vital components. The code will be released.

Abstract:
In recent years, significant breakthroughs have been made in audio-guided 3D facial animation. However, existing methods mainly focus on lip shape and audio consistency and still face key challenges to achieve alignment between facial emotions and speech emotions. To overcome this limitation, we introduce EmoRLTalk, a novel framework that integrates offline reinforcement learning to implicitly capture the intricate relationship between 3D facial landmarks and blendshape parameters, thereby enhancing the granularity of emotional expression. Furthermore, we harness the strength of conditional diffusion models to synthesize facial motions that are emotionally coherent with the input speech. Additionally, based on the multi-task learning paradigm, we construct a collaborative training framework of a regression main task and a classification sub-task. Specifically, we use emotion classification of blendshape as a sub-task to further improve the model’s ability to express facial emotions. To further enhance system controllability, we integrate the ControlNet module, allowing users to achieve precise facial expression control. Extensive experiments demonstrate that EmoRLTalk achieves superior emotional expressiveness and lip-sync performance compared to previous approaches.

Abstract:
Continuous and reliable underwater monitoring is essential for assessing marine biodiversity, detecting ecological changes and supporting autonomous exploration in aquatic environments. Underwater monitoring platforms rely on mainly visual data for marine biodiversity analysis, ecological assessment and autonomous exploration. However, underwater environments present significant challenges due to light scattering, absorption and turbidity, which degrade image clarity and distort colour information, which makes accurate observation difficult. To address these challenges, we propose DEEP-SEA, a novel deep learning-based underwater image restoration model to enhance both low- and high-frequency information while preserving spatial structures. The proposed Dual-Frequency Enhanced Self-Attention Spatial and Frequency Modulator aims to adaptively refine feature representations in frequency domains and simultaneously spatial information for better structural preservation. Our comprehensive experiments on EUVP and LSUI datasets demonstrate the superiority over the state of the art in restoring fine-grained image detail and structural consistency. By effectively mitigating underwater visual degradation, DEEP-SEA has the potential to improve the reliability of underwater monitoring platforms for more accurate ecological observation, species identification and autonomous navigation.

Abstract:
This paper presents an interaction-aware, modular framework for local trajectory planning in autonomous driving, particularly suited for multi-agent racing scenarios. Our framework first identifies viable drivable areas (tunnels), taking into account predictions of other agents’ behaviors, and subsequently utilizes a high-level decision-making module to select the optimal corridor considering both static and moving vehicles. This decision-making module also strategically determines when to follow an opponent or initiate an overtaking maneuver, while ensuring compliance with racing regulations. A Model Predictive Control (MPC) module is then employed to compute an optimal, collision-free trajectory within the chosen corridor. The proposed modular architecture simplifies the computational complexity typically associated with MPC optimization and facilitates independent component testing. Simulations and real-world tests on various racing tracks demonstrate the efficacy of our approach, even in highly dynamic interactive scenarios with multiple simultaneous opponents; videos of these and additional experiments are available at https://atoschi.github.io/tunnels-framework/.

Abstract:
Shape completion networks have been used recently in real-world robotic experiments to complete the missing/hidden information in environments where objects are only observed in one or few instances where self-occlusions are bound to occur. Nowadays, most approaches rely on deep neural networks that handle rich 3D point cloud data that lead to more precise and realistic object geometries. However, these models still suffer from inaccuracies due to its nondeterministic/stochastic inferences which could lead to poor performance in grasping scenarios where these errors compound to unsuccessful grasps. We present an approach to calculate the uncertainty of a 3D shape completion model during inference of single view point clouds of an object on a table top. In addition, we propose an update to grasp pose algorithms quality score by introducing the uncertainty of the completed point cloud present in the grasp candidates. To test our full pipeline we perform real world grasping with a 7dof robotic arm with a 2 finger gripper on a large set of household objects and compare against previous approaches that do not measure uncertainty. Our approach ranks the grasp quality better, leading to higher grasp success rate for the rank 5 grasp candidates compared to state of the art. Project Page: https://nunoduarte.github.io/pages.3dsgrasp++

Abstract:
A magnetically controlled catheter system is proposed to enhance the precision and safety of vascular interventions by reducing procedure time and radiation exposure. The system can also function as a support channel for guidewire deployment. A novel navigation approach is introduced, employing an external permanent magnet capable of controlled rotation to actuate a catheter with an embedded magnetic tip. Leveraging magnetic coupling and a redundant rotational DoF, the system achieves fine angular tip control with minimal spatial displacement, significantly enhancing maneuverability in constrained vascular environments. The magnetic field distribution and its influence on catheter response are characterized, and a kinematic model of the actuation mechanism is established. Experimental validation is conducted under varying magnetic field strengths and orientations, demonstrating reliable steering performance. Application-based experiments in simulated clinical environments further confirm precise navigation capability. The results highlight the advantages of rotational magnetic control in enhancing flexibility and accuracy. The proposed system presents a promising solution for automating catheter-based interventions, offering improved efficiency and power in minimally invasive procedures.

Abstract:
Chemical reactions constitute a cornerstone of fundamental scientific inquiry, yet traditional methodologies and platforms are encumbered by excessive reagent and consumable demands. Emerging alternatives, such as microfluidic systems, while innovative, suffer from intricate fabrication processes and elevated costs associated with operator training. Other contemporary approaches face limitations including reagent compatibility constraints and prohibitively expensive instrumentation. To address these challenges, this study introduces a contactless chemical reaction platform leveraging an ultrasonic vortex field to achieve stable capture, microscale droplet transport, and sequential multi-droplet mixing without direct contact. This platform substantially reduces contamination risks, minimizes reagent and consumable usage, accommodates a broad spectrum of reagent types, and imposes minimal demands on operator expertise. Demonstrating robust performance in microdose reaction control, the system offers significant potential for advancing chemical research and its applications.

Abstract:
In this paper, we propose a dynamic model and control framework for tendon-driven continuum robots (TDCRs) with multiple segments and an arbitrary number of tendons per segment. Our approach leverages the Clarke transform, the Euler-Lagrange formalism, and the piecewise constant curvature assumption to formulate a dynamic model on a two-dimensional manifold embedded in the joint space that inherently satisfies tendon constraints. We present linear and constraint-informed controllers that operate directly on this manifold, along with practical methods for preventing negative tendon forces without compromising control fidelity. This opens up new design possibilities for overactuated TDCRs with improved force distribution and stiffness without increasing controller complexity. We validate these approaches in simulation and on a physical prototype with one segment and five tendons, demonstrating accurate dynamic behavior and robust trajectory tracking under real-time conditions.

Abstract:
Search and rescue (SAR) operations in large-scale disaster sites, such as areas affected by earthquakes, require rapid victim detection. While drones equipped with cameras are commonly used for SAR, their effectiveness is limited in visually obstructed environments, because of debris, smoke, or fog. Under such situations, auditory information can play a crucial role in locating victims who are not visible. Existing drone audition research has demonstrated the feasibility of detecting sound sources using onboard microphone arrays. However, most studies focus on single-drone systems, which face limitations in coverage and accessibility, particularly in complex environments such as collapsed buildings or urban canyons. Additionally, real-world validation of multi-drone audition systems remains limited, with prior studies relying primarily on simulations or controlled environments. To address these challenges, we propose and evaluate a Multi-Drone and Robot-Based Active Audition System (SAAS-RD: Swarm Active Audition System with Robots and Drones) that integrates multiple drones and ground robots to enhance acoustic search capabilities. Our work focuses on real-world performance validation, conducting field experiments in outdoor environments and analyzing system feasibility through case studies. The results demonstrate the potential of SAAS-RD as a practical solution for large-scale SAR operations.

Abstract:
Reward engineering for policy learning has been a long-standing challenge in robotics. Recently, to avoid manual reward designs, vision-language models (VLMs) have shown promise in defining rewards for teaching robots manipulation skills. However, existing work often provides reward guidance that is too coarse, leading to insufficient learning processes. In this paper, we address this issue by implementing more fine-grained reward guidance. Specifically, we decompose tasks into simpler sub-tasks, using this decomposition to offer more informative reward guidance with VLMs. We also propose a VLM-based self imitation learning process to speed up learning. Empirical evidence demonstrates that our algorithm consistently outperforms baselines such as CLIP, LIV, and RoboCLIP. Specifically, our algorithm achieves a 5:4 higher average success rates compared to the best baseline, RoboCLIP, across a series of manipulation tasks.

Abstract:
Social robots have shown significant potential in enhancing learning experiences, and humor has been proven to be beneficial for learning. This study investigates the impact of both the presence and timing of humor on students’ learning outcomes and overall learning experience. A total of 24 participants were randomly assigned to one of the three conditions: (C1) interact with a robot with no humor, (C2) interact with a robot with humor at pre-defined moments during the lesson, and (C3) interact with a robot that triggers humor based on engagement levels. The results revealed that the humor at pre-defined moments condition (C2) led to significantly better learning outcomes and longer interaction times compared to the other two conditions. While the adaptive humor in Condition C3 did not significantly outperform Condition C1, it showed positive effects on participants’ perceived learning effectiveness and engagement. These findings contribute to the understanding of how humor, when strategically timed, can enhance the effectiveness of social robots in educational settings.

Abstract:
Odometry estimation remains a critical challenge for wheeled robots, as reducing its drift directly mitigates dependency on external localization systems. This paper proposes a distributed odometry framework for steerable wheels, named ICF-DO, which is applicable to both Steerable Wheeled Mobile Robots (SWMRs) and cooperative multi-single-wheel robot systems. The proposed method features low computational complexity and reduced drift, while demonstrating strong robustness in communication-restricted scenarios. Additionally, singularity can be processed in a distributed manner in the proposed framework. Experimental validation on a real physical SWMR platform demonstrates the effectiveness and practicality of the proposed method.

Abstract:
Deception is a crucial strategy in adversarial scenarios, yet its application in multi-agent confrontations remains understudied. This paper investigates deception in a multi-robot Target-Attacker-Defender (MR-TAD) game, where Attackers aim to capture Targets while evading Defenders. To model deception effectively, we propose a hierarchical decision-making framework that integrates multi-agent reinforcement learning (MARL) for high-level deceptive strategies and optimal control for low-level motion control. Furthermore, we introduce a novel composite deception-oriented reward function, which combines hitting rewards, belief switch rewards, and position advantage rewards to facilitate the training of deceptive behaviors. Simulation results across varying numbers of robots demonstrate that incorporating deception significantly increases the success rate of Attackers, with an average improvement of over 70% compared to non-deceptive strategies. Additionally, real-world experiments with omnidirectional mobile robots further confirm the effectiveness of the proposed method. This study establishes a generalizable framework for modeling deception in multi-agent systems, with potential applications in various multi-agent scenarios.

Abstract:
Automating labor-intensive tasks such as crop monitoring with robots is essential for enhancing production and conserving resources. However, autonomously monitoring horticulture crops remains challenging due to their complex structures, which often result in fruit occlusions. Existing view planning methods attempt to reduce occlusions but either struggle to achieve adequate coverage or incur high robot motion costs. We introduce a global optimization approach for view motion planning that aims to minimize robot motion costs while maximizing fruit coverage. To this end, we leverage coverage constraints derived from the set covering problem (SCP) within a shortest Hamiltonian path problem (SHPP) formulation. While both SCP and SHPP are well-established, their tailored integration enables a unified framework that computes a global view path with minimized motion while ensuring full coverage of selected targets. Given the NP-hard nature of the problem, we employ a region-prior-based selection of coverage targets and a sparse graph structure to achieve effective optimization outcomes within a limited time. Experiments in simulation demonstrate that our method detects more fruits, enhances surface coverage, and achieves higher volume accuracy than the motion-efficient baseline with a moderate increase in motion cost, while significantly reducing motion costs compared to the coverage-focused baseline. Real-World experiments further confirm the practical applicability of our approach.

Abstract:
We propose a Visual Place Recognition (VPR) framework by sharing lightweight keypoint extraction modules for local features. Current research on the joint learning of local keypoint matching and VPR is relatively scarce, and the application deployment of real-time spatial computing on edge devices has a high learning cost. There is also a significant spatial structural difference between existing VPR methods and the scenarios in practical applications. To address these issues, we design a joint learning framework for local keypoint extraction and VPR, which shares local features and fuses irregularly distributed key features in space through self-attention and cross-attention mechanisms. Our framework achieves excellent results on several VPR datasets. In particular, we introduce a new VPR dataset, called TJPark, which has a significant spatial information difference from common street view data. Our method demonstrates that local features with strong generalization capabilities effectively help enhance the generalization of VPR. Our open source code and dataset are available at: https://github.com/ShuaiAlger/LGPR.

Abstract:
Despite the nuclear industry’s reliance on advanced robots being operated by humans, much of the existing research overlooks the operator’s perspective in the context of nuclear decommissioning. This study aims to address this gap by identifying the specific needs and requirements of robot operators in nuclear environments. Three focus groups of experienced robot operators from the UK Atomic Energy Authority and Sellafield Ltd. were conducted to explore key themes, including the operator’s role, tasks where robots are employed, and the risks associated with robot use. Findings reveal that: (1) robots in critical tasks are typically controlled by a team of operators; (2) for human-robot interfaces safety and reliability are the most important features, before effectiveness, intuitiveness and task focus; (3) due to high task variety operators see a need for various types of robots; and (4) operator error is regarded as the most significant and unpredictable risk. Based on these insights, a comprehensive set of 10 robot-specific requirements and 10 overall user requirements has been formulated. The paper provides recommendations for robot operators and designers, detailing how these identified requirements can inform the development of future teleoperated robots for nuclear decommissioning tasks.

Abstract:
Aerial robots have demonstrated significant potential in suspended cargo transportation, especially in industries such as logistics and food delivery. Due to the underactuated and nonlinear dynamics of the cable-suspended system, directly tracking a given trajectory with a multicopter without modifying its controller often leads to significant payload swing. This compromises the safety and stability of the cargo. To address the aforementioned issue, this paper proposes an online trajectory refinement method for a variable-length cable-suspended aerial transportation robot, independent from the control layer. By incorporating payload swing angle information, the reference trajectory is refined in real-time, effectively suppressing payload oscillations during transportation. Specially, Lyapunov techniques and LaSalle’s invariance theorem are employed to rigorously guarantee the feasibility of the designed trajectory refinement scheme. Finally, hardware experiments are conducted to validate the effectiveness and superiority of the proposed method. The results demonstrate that the refined trajectory not only enables precise positioning of the multicopter, but also effectively suppresses payload oscillations during transportation, significantly enhancing the safety and reliability of the aerial cargo delivery.

Abstract:
Robotic-assisted dressing has the potential to significantly aid both patients as well as healthcare personnel, reducing the workload and improving the efficiency in clinical settings. While substantial progress has been made in robotic dressing assistance, prior works typically assume that garments are already unfolded and ready for use. However, in medical applications gowns and aprons are often stored in a folded configuration, requiring an additional unfolding step. In this paper, we introduce the pre-dressing step, the process of unfolding garments prior to assisted dressing. We leverage imitation learning for learning three manipulation primitives, including both high and low acceleration motions. In addition, we employ a visual classifier to categorise the garment state as closed, partly opened, and fully opened. We conduct an empirical evaluation of the learned manipulation primitives as well as their combinations. Our results show that highly dynamic motions are not effective for unfolding freshly unpacked garments, where the combination of motions can efficiently enhance the opening configuration.

Abstract:
With the rapid development of 5G technology and the increasing demand for autonomous mobile robots, there is a trend to leverage the ultra-low latency, high data rates, and reliable wireless connectivity offered by 5G to improve the perception and navigation of robots in unknown environments. This paper presents a novel approach for creating and exploiting radio-aware semantic maps to empower 5G-enabled mobile robots operating within an unknown environment. The proposed solution allows for smart offloading of robotic applications and task processing onto the edge systems while facilitating real-time data exchange, and enables robots to gather environment data from both onboard sensors and the mobile network for more efficient robot operation and resource orchestration decisions. A radio-aware semantic mapping framework is introduced, which combines radio signal quality information with semantic mapping techniques to create a comprehensive understanding of the environment, which may evolve over time. The semantic map, enriched with radio quality measurement data, enables mobile robots to make timely informed decisions by considering real-time radio quality variations. Our experimental evaluation demonstrates the effectiveness of adopting radio semantic maps to enhance real-time robot operations on navigation and task offloading in unstructured environments.

Abstract:
Accurately estimating sound source positions is crucial for robot audition. However, existing sound source localization methods typically rely on a microphone array with at least two spatially preconfigured microphones. This requirement hinders the applicability of microphone-based robot audition systems and technologies. To alleviate these challenges, we propose an online sound source localization method that uses a single microphone mounted on a mobile robot in reverberant environments. Specifically, we develop a lightweight neural network model with only 43k parameters to perform real-time distance estimation by extracting temporal information from reverberant signals. The estimated distances are then processed using an extended Kalman filter to achieve online sound source localization. To the best of our knowledge, this is the first work to achieve online sound source localization using a single microphone on a moving robot, a gap that we aim to fill in this work. Extensive experiments demonstrate the effectiveness and merits of our approach. To benefit the broader research community, we have open-sourced our code at https://github.com/JiangWAV/single-mic-SSL.

Abstract:
In this paper, we propose a design of an underwater soft snake-like robot prototype that uses two actuators made of 3D-printed soft materials to build the robot body. Control signals with appropriate displacement phases and different voltages are used to control the water pump to drive the soft actuator to bend to generate a sine wave with increasing amplitude along the body axis. We test customized tail materials, phase shifts, and voltage growth rate signals to observe the effects of different parameters on the movement of the snake robot in water. Experiments show that the movement speed is positively correlated with the swing amplitude of the snake robot's motion module. In addition, measured data show that swimming efficiency and movement speed are also affected by tail flexibility and movement gait. When the phase offset is 2/3π, the tail is made of harder PLA material, and the voltage growth rate is 1.2, the maximum underwater movement speed achieved by the snake robot is 4.464 cm/s (0.076 BL/s). We also found that when the phase offset increases, the snake motion speed and motion efficiency first increase and then decrease. The results obtained in this study will aid in the advancement of soft, slender swimming robots and improve the understanding of the swimming capabilities of both robots and sea snakes.

Abstract:
The growing use of service robots in dynamic environments requires flexible management of on-board compute resources to optimize the performance of diverse tasks such as navigation, localization, and perception. Current robot deployments often rely on static OS configurations and system over-provisioning. However, they are suboptimal because they ignore variations in resource usage, leading to system-wide issues like robot instability or inefficient resource utilization. This paper presents ConfigBot, a novel system designed to adaptively reconfigure robot applications to meet a predefined performance specification by leveraging runtime profiling and automated configuration tuning. Through experiments on multiple real robots, each running a different stack with diverse performance requirements, which could be context-dependent, we illustrate ConfigBot's efficacy in maintaining system stability and optimizing resource allocation. Our findings highlight the promise of automatic system configuration tuning for robot deployments, including adaptation to dynamic changes. Code available at: https://github.com/ldos-project/configbot

Abstract:
Magnetic microswarms have attracted significant attention in medical robotics, owing to their potential for performing complex tasks in challenging environments. However, developing microswarms that can operate effectively in both liquid and air environments remains a substantial challenge. This study presents the design and characterization of hydrogel-based microswarms composed of magnetic hydrogel particles prepared from agarose hydrogel and NdFeB magnetic microparticles. These microswarms form stable monolayer structures actuated by rotating magnetic fields at high frequencies (10 Hz) in liquid environments, enabling synchronization with the external magnet and achieving translational motion. Actuated by an oscillating magnetic field, the swarms transition from a monolayer configuration to a three-dimensional (3D) structure in the air environment. Experimental results demonstrate that the 3D swarms are capable of navigating complex terrains and interacting with tissue surfaces in air environments. Finally, we demonstrate the potential of these 3D swarms for targeted delivery and adaptive filling of gastric perforations using an ex vivo gastric tissue model, showcasing their potential for biomedical applications.

Abstract:
Locomotion in robots remains an unsolved challenge, particularly for those with complex structures and dynamic environments. Consequently, the control systems for such robots must place greater emphasis on risk mitigation and safety considerations to ensure reliable and stable operation. Existing studies have explicitly incorporated risk factors into policy training, but lacked the ability to adaptively adjust the risk sensitivity for hazardous environments. This deficiency impacts the agent’s exploration during training and thus fails to select the optimal action. We innovatively introduce Adaptive Risk-aware Control (ARC) policies based on Distributional Reinforcement Learning (Dist.RL), a novel framework that dynamically adjusts risk sensitivity levels in response to changing environmental conditions. Our approach uniquely integrates two key components: (1) the Inter Quartile Range (IQR) for quantifying intrinsic environmental uncertainty, and (2) Random Network Distillation (RND) for evaluating parameter uncertainty. This dual-mechanism architecture represents a significant advancement in risk assessment methodologies. Simulations conducted on a variety of robots have demonstrated that our method achieves significantly more robust performance compared to other approaches. Furthermore, sim2real validation on a humanoid robot confirms the practical viability of our approach.

Abstract:
Learning from Demonstration (LfD) provides an efficient approach to acquiring diverse underwater skills, with task-parameterized learning enhancing the generalization of policies. However, collecting comprehensive underwater demonstrations across various conditions remains a significant challenge. In this work, we propose an adversarial trajectory augmentation method for Task Parameterized Hidden Semi-Markov Models (TP-HSMM) based on digital twins, inspired by adversarial example generation. Our method aims to improve the performance of motion policies by utilizing adversarial trajectory generation and retraining, leveraging low-cost demonstrations from digital twins. We evaluate the proposed adversarial trajectory augmentation method on two datasets. Comparative experiments demonstrate its effectiveness in reducing trajectory generation errors in new scenarios. Finally, we validate the method through an underwater humanoid plugging experiment, showing that it achieves similar performance to the baseline with fewer demonstrations.

Abstract:
This paper introduces Dynamic Learning from Learned Hallucination (Dyna-LfLH), a self-supervised method for training motion planners to navigate environments with dense and dynamic obstacles. Classical planners struggle with dense, unpredictable obstacles due to limited computation, while learning-based planners face challenges in acquiring high-quality demonstrations for imitation learning or dealing with exploration inefficiencies in reinforcement learning. Building on Learning from Hallucination (LfH), which synthesizes training data from past successful navigation experiences in simpler environments, Dyna-LfLH incorporates dynamic obstacles by generating them through a learned latent distribution. This enables efficient and safe motion planner training. We evaluate Dyna-LfLH on a ground robot in both simulated and real environments, achieving up to a 25% improvement in success rate compared to baselines.

Abstract:
One of the primary challenges in building a General Purpose Service Robot (GPSR), i.e. a robot capable of executing generic human commands, lies in acting upon natural language instructions. These instructions often contain speech recognition errors and incomplete information, complicating the extraction of clear goals and the formulation of an efficient and effective action plan. This work presents a pipeline that leverages a Large Language Model to directly translate instruction transcripts into coherent action plans. The pipeline also integrates environmental context into the model’s input, allowing for the generation of more efficient and context-aware plans. The system’s performance was evaluated using a simulator based on generalized stochastic Petri Nets, achieving a success rate of around 55% on the ALFRED dataset, even in unseen environments. The entire pipeline was also successfully deployed at RoboCup 2024 in Eindhoven, where it secured second place in the GPSR task. The code, dataset, and models are available at https://github.com/socrob/llm_gpsr.

Abstract:
Legged robots, particularly quadrupeds, excel at navigating rough terrains, yet their performance under vertical ground perturbations, such as those from oscillating surfaces, remains underexplored. This study introduces a novel approach to enhance quadruped locomotion robustness by training the Unitree Go2 robot on an oscillating bridge—a 13.24-meter steel-and-concrete structure with a 2.0 Hz eigenfrequency designed to perturb locomotion. Using Reinforcement Learning (RL) with the Proximal Policy Optimization (PPO) algorithm in a MuJoCo simulation, we trained 15 distinct locomotion policies, combining five gaits (trot, pace, bound, free, default) with three training conditions: rigid bridge and two oscillating bridge setups with differing height regulation strategies (relative to bridge surface or ground). Domain randomization ensured zero-shot transfer to the real-world bridge. Our results demonstrate that policies trained on the oscillating bridge exhibit superior stability and adaptability compared to those trained on rigid surfaces. Our framework enables robust gait patterns even without prior bridge exposure. These findings highlight the potential of simulation-based RL to improve quadruped locomotion during dynamic ground perturbations, offering insights for designing robots capable of traversing vibrating environments.

Abstract:
In this paper, we employ multiple IMUs into a triple-section continuum manipulator to precisely capture the attitude data of each section’s end disk. Leveraging the sensory and mechanical hardware system, we construct a sophisticated coordinate transformation scheme to accurately identify the detailed configuration states of the manipulator. Additionally, we introduce a numerical optimization strategy to develop a unified forward and inverse kinematic modeling framework, ensuring both iterative efficiency and accuracy. Through the IMUs’ real-time attitude feedback, we implement a closed-loop controller, enhancing the manipulator’s operational robustness and agility. In our experimental evaluations, we assess the convergence performance of both forward and inverse kinematics within a simulated environment and validate the precision of these kinematic models through real-time experiments on an actual continuum manipulator. Moreover, we evaluate the performance of the proposed controller by examining its accuracy during the manipulator’s continuous motions and analyzing its response characteristics. In contrast to previous research on continuum robots in the literature, we pioneer a fully integrated kinematic control architecture that is successfully implemented on a physical continuum robotic system.

Abstract:
How to endow aerial robots with the ability to operate in close proximity remains an open problem. The core challenges lie in the propulsion system’s dual-task requirement: generating manipulation forces while simultaneously counter-acting gravity. These competing demands create dynamic coupling effects during physical interactions. Furthermore, rotor-induced airflow disturbances critically undermine operational reliability. Although fully-actuated unmanned aerial vehicles (UAVs) alleviate dynamic coupling effects via six-degree-of-freedom (6-DoF) force-torque decoupling, existing implementations fail to address the aerodynamic interference between drones and environments. They also suffer from oversized designs, which compromise maneuverability and limit their applications in various operational scenarios. To address these limitations, we present FLOAT Drone (FuLly-actuated cOaxial Aerial roboT), a novel fully-actuated UAV featuring two key structural innovations. By integrating control surfaces into fully-actuated systems for the first time, we significantly suppress lateral airflow disturbances during operations. Furthermore, a coaxial dual-rotor configuration enables a compact size while maintaining high hovering efficiency. Through dynamic modeling, we have developed hierarchical position and attitude controllers that support both fully-actuated and underactuated modes. Experimental validation through comprehensive real-world experiments confirms the system’s functional capabilities in close-proximity operations.

Abstract:
Multi-robot localization is a crucial task for implementing multi-robot systems. Numerous researchers have proposed optimization-based multi-robot localization methods that use camera, IMU, and UWB sensors. Nevertheless, characteristics of individual robot odometry estimates and distance measurements between robots used in the optimization are not sufficiently considered. In addition, previous researches were heavily influenced by the odometry accuracy that is estimated from individual robots. Consequently, long-term drift error caused by error accumulation is potentially inevitable. In this paper, we propose a novel visual-inertial-range-based multi-robot localization method, named SaWa-ML, which enables geometric structure-aware pose correction and weight adaptation-based robust multi-robot localization. Our contributions are twofold: (i) we leverage UWB sensor data, whose range error does not accumulate over time, to first estimate the relative positions between robots and then correct the positions of each robot, thus reducing long-term drift errors, (ii) we design adaptive weights for robot pose correction by considering the characteristics of the sensor data and visual-inertial odometry estimates. The proposed method has been validated in real-world experiments, showing a substantial performance increase compared with state-of-the-art algorithms.

Abstract:
Accurate modeling of dynamic systems is essential for robotics, enhancing system perception and control performance. This work tackles causal modeling challenges for mobile robots under complex uncertainties, including internal model inaccuracies and external environmental disturbances. Unlike first-principle or purely data-driven methods, we propose Pet-NODE, an advanced Neural Ordinary Differential Equation (NODE) framework that integrates physical priors with temporal features for high-fidelity system modeling. To further embed domain knowledge, we introduce a novel loss function with self-prediction objectives, ensuring adherence to physical principles. Extensive experiment evaluations, including ablation studies and comparisons against Nominal model, K-NODE and PI-TCN methods, demonstrate Pet-NODE’s robustness, interpretability, and superior localization accuracy on a self-collected wheeled robot dataset.

Abstract:
We present MATRICS, a traffic-aware multi-agent reinforcement learning (MARL)-based intelligent lane-change system designed for autonomous vehicles (AVs). While existing research primarily focuses on enhancing the local impact of the ego vehicle’s lane-change decisions, MATRICS stands out by optimizing both local and global performance, i.e., aiming not only to improve the traffic efficiency, driving safety, and driver comfort of the ego vehicle, but also to enhance overall traffic flow within a designated road segment. Through an extensive review of the transportation literature, we construct a novel state space integrating local traffic information collected from surrounding vehicles and global traffic data obtained from roadside units (RSUs). We develop a reward function to guide judicious lane-change decisions, considering both ego vehicle performance and traffic flow enhancement. Our local density-aware multi-agent double deep Q-network (DDQN) algorithm facilitates effective cooperation among agents in executing lane-change maneuvers. Simulation results demonstrate MATRICS’ superior performance across metrics of traffic efficiency, driving safety, and driver comfort in comparison with a state-of-the-art MARL model.

Abstract:
Cell rotation plays a crucial role in micromanipulation. Among manual cell rotation techniques, single-micropipette cell rotation is widely adopted due to its high efficiency and flexibility. However, there is currently no method capable of achieving automated single-micropipette cell rotation. In this study, we developed the first three-dimensional (3D) simulation system for single-micropipette cell rotation. Based on this simulation system, we successfully achieved single-micropipette cell rotation imitation learning (IL) for the first time. Specifically, we first analyze the forces acting on cells in the fluid, establishing a dynamic model that describes the cell's behavior in response to the flow velocity at the holding micropipette's orifice, the relative position of the micropipette, and time. We then developed the cell rotation simulation environment by discretizing the model and designing the simulation's cell and holding micropipette models based on real-world conditions. Finally, we designed a network architecture for IL using this model, achieving single-micropipette cell rotation in simulation. The results demonstrate that the simulation system exhibits a relative error range of 5.34% to 12.21% compared to real-world experiments, indicating a high degree of accuracy. Additionally, the single-micropipette cell rotation task achieved a success rate of 69% with an average completion time of 17.13 seconds, closely matching the expert data's average time of 17.69 seconds, confirming the feasibility of the simulation system.

Abstract:
Multi Robot Systems are increasingly deployed in applications, such as intralogistics or autonomous delivery, where multiple robots collaborate to complete tasks efficiently. One of the key factors enabling their efficient cooperation is Multi-Robot Task Allocation (MRTA). Algorithms solving this problem optimize task distribution among robots to minimize the overall execution time. In shared environments, apart from the relative distance between the robots and the tasks, the execution time is also significantly impacted by the delay caused by navigating around moving people. However, most existing MRTA approaches are dynamics-agnostic, relying on static maps and neglecting human motion patterns, leading to inefficiencies and delays. In this paper, we introduce Human-Aware Task Allocation (HATA). This method leverages Maps of Dynamics (MoDs), spatio-temporal queryable models designed to capture historical human movement patterns, to estimate the impact of humans on the task execution time during deployment. HATA utilizes a stochastic cost function that includes MoDs Experimental results show that integrating MoDs enhances task allocation performance, resulting in reduced mission completion times by up to 26% compared to the dynamics-agnostic method and up to 19% compared to the baseline. This work underscores the importance of considering human dynamics in MRTA within shared environments and presents an efficient framework for deploying multi-robot systems in environments populated by humans.

Abstract:
Robots with active variable stiffness (VS) capabilities can potentially achieve safer interactions with humans and better adaptabilities to uncertainties in complex environments. Currently, the conventional jamming or phase-change-based VS mechanisms simultaneously act on the stiffnesses of the robotic joint in all axes, making it difficult to achieve decoupled stiffness programming in different directions/axes. To overcome this challenge, a bio-inspired stiffness-programmable robotic flexible joint (SPRFJ) based on the electro-adhesive (EA) clutches is proposed. By programming the ON/OFF states of the EA clutches on different surfaces around the SPRFJ, customization of stiffness profiles in different directions/axes can be realized, and therefore, the load-bearing capacity and flexibility of the robotic arm can be adjusted. A SPRFJ prototype consisting of four EA clutch units is developed, and through extensive experiments, we demonstrate that it can achieve a stiffness change of 21 times and can withstand resisting forces up to 13.41 N at 1 kV. The reliable multi-directional stiffness programmability of the SPRFJ is shown via extensive tests. Demonstrations on stable position locking at different angles and free movements while carrying payloads are conducted to showcase its application in soft robotics. This SPRFJ developed in this work processes the potential in industrial robots, search-and-rescue missions, and space explorations.

Abstract:
Learning-based controllers are often purposefully kept out of real-world applications due to concerns about their safety and reliability. We explore how state-of-the-art world models in Model-Based Reinforcement Learning can be utilized beyond the training phase to ensure a deployed policy only operates within regions of the state-space it is sufficiently familiar with. This is achieved by continuously monitoring discrepancies between a world model’s predictions and observed system behavior during inference. It allows for triggering appropriate measures, such as an emergency stop, once an error threshold is surpassed. This does not require any task-specific knowledge and is thus universally applicable. Simulated experiments on established robot control tasks show the effectiveness of this method, recognizing changes in local robot geometry and global gravitational magnitude. Real-world experiments using an agile quadcopter further demonstrate the benefits of this approach by detecting unexpected forces acting on the vehicle. These results indicate how even in new and adverse conditions, safe and reliable operation of otherwise unpredictable learning-based controllers can be achieved.

Abstract:
In this study, a self-standing takeoff method was developed for a servo-driven flapping robot. This method does not use any additional mechanisms or external platforms to maintain the robot’s position before takeoff. In addition to its efficiency in terms of weight, this method is also relatively easy to implement. This takeoff method extends the function of the tail not only for longitudinal direction control during flight, but also as a base to support the standing posture of the flapping robot. The objective of this investigation was to determine the optimal parameters to be implemented in the self-standing takeoff algorithm. To enhance the probability of successful takeoff, an investigation was conducted into the various parameters that influence the self-takeoff process. The investigative process was initiated with a static experiment to determine the thrust generated by the flapping robot. Subsequently, the variables of the center flapping angle, timing adjustment, and initial flapping direction were examined. A series of indoor flight experiments were conducted to evaluate the self-standing takeoff performance of the flapping robot. The experiment tested two weighted robots, the first 43 g and the second 45 g (thrust/weight (T/W) ratios of 1.02 and 0.97, respectively). The results showed that the proposed takeoff method requires only a T/W ratio of 1.02 for takeoff, less than the 1.2 previously required.

Abstract:
Visual bird’s eye view (BEV) perception, dute to its excellent perceptual capabilities, is progressively replacing costly LiDAR-based perception systems, especially in the realm of urban intelligent driving. However, this type of perception still relies on LiDAR data to construct ground truth databases, a process that is both cumbersome and time-consuming. Additionally, most mass-produced autonomous driving systems are equipped solely with surround camera sensors and lack the LiDAR data necessary for precise annotation. To tackle this challenge, we propose a fine-tuning method for BEV perception network based on visual 2D semantic perception, aimed at enhancing the model’s generalization capabilities in new scene data. Leveraging the maturity of 2D perception technologies, our method utilizes only 2D semantic segmentation labels and monocular depth estimations, thereby significantly reducing the dependence on expensive BEV ground truths and offering strong potential for industrial deployment. Extensive experiments and comparative analyses on the nuScenes and Waymo datasets demonstrate the effectiveness of our method. Specifically, it improves mAP and NDS by 2.51% and 1.93% on nuScenes, and by 1.21% and 0.78% on Waymo, respectively, validating its practical utility and robustness across diverse domains.

Abstract:
This study investigates the problem of large-scale swarm robots shape self-assembly problem under conditions of information asymmetry. Existing methods assume complete sharing of global information; however, this assumption has significant limitations in terms of resource consumption and swarm emergence. On the other hand, strategies that rely entirely on local information struggling to achieve the self-assembly of complex shapes, especially disconnected shapes. To address these challenges, this study proposes a novel bio-inspired distributed self-assembly strategy specifically designed for information asymmetric swarm. The strategy draws inspiration from the task specialization mechanism between scout ants and worker ants in social insects, guiding individuals to efficiently complete shape-assembly under information asymmetry through local perception, neighborhood interactions, and dynamic rule adjustments. Experimental results show that the proposed strategy successfully achieves the self-assembly of various shapes, including simple shapes (e.g., circle, rectangle), complex shapes (e.g., human, flower, and letter "A"), and disconnected shapes (e.g., letter "IO"). This demonstrates the strategy’s adaptability to shape complexity. Furthermore, experiments with varying swarm sizes validate the strategy’s robustness and scalability across different scales. During the experiments, we unexpectedly observed emergent behaviors within the swarm, further confirming that the proposed strategy not only significantly enhances task flexibility but also strengthens swarm emergence. These results indicate that the proposed method provides an efficient, scalable, and innovative solution for swarm robots self-assembly under information asymmetry.

Abstract:
Multi-robot exploration in unknown environments is a fundamental task for multi-robot systems, which requires the coordination of the robots to avoid collisions and conflicts while performing task allocation. Existing exploration strategies improve the efficiency of multi-robot exploration by modeling the multi-robot task allocation problem as a variant of the multiple traveling salesman problem. However, this is computationally intensive and difficult to deploy on physical platforms. Hence, this paper develops a hybrid strategy for range-sensing multi-robot exploration with effective team coordination, enabling a larger team dispersion degree and higher exploration efficiency. In addition, we present a novel multi-robot exploration point detection method suitable for narrow and dynamic environments, effectively reducing exploration failure and incompleteness. The Gazebo simulations demonstrate better exploration efficiency and the least time cost of our exploration framework compared with state-of-the-art methods, and real-world experiments also validate the effectiveness. The code is released at https://github.com/NeSC-IV/DHC_ME.

Abstract:
This paper proposes a new turning theory for spherical robots, which better describes the turning mechanism of spherical robots under turning constraints, using a pendulum-driven spherical robot as an example. Compared to the previous turning theory, the new theory shows greater alignment with real-world data, especially at high speeds. Based on this new turning theory, we have constructed and optimized a new kinematic model and used this model to design a trajectory tracking algorithm that remains reliable even at high speeds. Physical experiments demonstrate that the new algorithm significantly improves trajectory tracking accuracy at high speeds. Through enhancements to the trajectory tracking algorithm, this study improves the autonomous cruising speed of spherical robots.

Abstract:
We present TartanGround, a large-scale, multi-modal dataset to advance the perception and autonomy of ground robots operating in diverse environments. This dataset, collected in various photorealistic simulation environments includes multiple RGB stereo cameras for 360-degree coverage, along with depth, optical flow, stereo disparity, LiDAR point clouds, ground truth poses, semantic segmented images, and occupancy maps with semantic labels. Data is collected using an integrated automatic pipeline, which generates trajectories mimicking the motion patterns of various ground robot platforms, including wheeled and legged robots. We collect 878 trajectories across 63 environments, resulting in 1.44 million samples. Evaluations on occupancy prediction and SLAM tasks reveal that state-of-the-art methods trained on existing datasets struggle to generalize across diverse scenes. TartanGround can serve as a testbed for training and evaluation of a broad range of learning-based tasks, including occupancy prediction, SLAM, neural scene representation, perception-based navigation, and more, enabling advancements in robotic perception and autonomy towards achieving robust models generalizable to more diverse scenarios. The dataset and codebase are available on the webpage: https://tartanair.org/tartanground

Abstract:
Recent advancements in category-agnostic pose estimation have focused on developing a unified model capable of localizing keypoint coordinates across arbitrary categories, which enables robots to accurately interact with diverse objects by understanding their poses. While existing methods predominantly concentrate on local features surrounding the keypoints of the support image, they often overlook the importance of global features, leading to potential misalignment between the support and query image. To address the inherent conflicts between the two images, we propose AlignCAPE, a novel approach designed to mitigate such misalignment and enhance the model performance. Our method formulates a two-stage pipeline, generating initial proposals in the first stage, followed by another stage to refine iteratively. Specifically, we introduce two modules, Feature Alignment Module(FAM) and Keypoint Perception Module(KPM). FAM utilizes bidirectional cross-attention operation to align the support image feature and query image feature, thereby compensating for the limitations of previous methods. KPM employs self-attention mechanism to capture the interactions among keypoints, facilitating to localize keypoints in the query image. Experiments on MP-100 benchmark demonstrate that our method outperforms the widely-used baseline model in CAPE by 0.68% in PCK@0.2 metric under 1-shot setting.

Abstract:
Personalization is critical for the advancement of service robots. Robots need to develop tailored understandings of the environments they are put in. Moreover, they need to be aware of changes in the environment to facilitate long-term deployment. Long-Term understanding as well as personalization is necessary to execute complex tasks like prepare dinner table or tidy my room. A precursor to such tasks is that of Object Search. Consequently, this paper focuses on locating and searching multiple objects in indoor environments. In this paper, we propose two crucial novelties. Firstly, we propose a novel framework that can enable robots to deduce Personalized Ontologies of indoor environments. Our framework consists of a personalization schema that enables the robot to tune its understanding of ontologies. Secondly, we propose an Adaptive Inferencing strategy. We integrate Dynamic Belief Updates into our approach which improves performance in multi-object search tasks. The cumulative effect of personalization and adaptive inferencing is an improved capability in long-term object search. This framework is implemented on top of a multi-layered semantic map. We conduct experiments in real environments and compare our results against various stateof-the-art (SOTA) methods to demonstrate the effectiveness of our approach. Additionally, we show that personalization can act as a catalyst to enhance the performance of SOTAs. Video Link: https://bit.ly/3WHk9i9

Abstract:
Multi-robot cooperative tracking, as a vital sub-field of multi-robot collaboration, exhibits significant potential in areas such as military reconnaissance and emergency rescue. Conventional dynamic object tracking methods often face issues of incomplete target detection and even loss in complex scenes, owing to variations in viewpoint or occlusion. To address these problems, this paper proposes STC-Tracker, a multi-robot collaborative tracking system aimed at extending the lifecycle of dynamic objects. On the one hand, the system restores the original appearance of objects by retracing historical point clouds from keyframes while monitoring their motion trajectories in real time. On the other hand, by estimating the motion model of each target, our system is capable of maintaining the lifecycle of specific objects, even in cases of brief disappearance. Experiments are conducted on public and self-collected datasets. The results demonstrate that our algorithm outperforms SOTAs in both single-robot and multi-robot configurations while exhibiting low computational resource consumption. In addition, our algorithm supports LiDARs of different scanning patterns, including spinning LiDARs and solid-state LiDARs, and is capable of real-time dynamic object tracking and global map construction.

Abstract:
Flexible continuum robots operating in constrained and dynamic environments face significant challenges, especially when interacting with uncertain and potentially unknown conditions. Traditional model-based methods face significant difficulties due to the inherent nonlinearities and uncertainties in robot dynamics, as well as the complexities introduced by environmental interactions. This work presents a new data-driven, model-free control strategy for flexible continuum robots operating in constrained environments, leveraging Lie bracket approximations to achieve effective regulation. The method enables effective visual servoing without requiring explicit kinematic or dynamic models, making it highly adaptable to diverse scenarios where environmental constraints and robot deformation impact system performance. Additionally, it does not rely on initial state estimation, further enhancing its suitability for dynamic, uncertain environments. The effectiveness of the proposed method is validated through simulations and experiments, showing enhanced robustness and adaptability in real-time control scenarios.

Abstract:
Recent advancements in language modeling have enabled robots to more easily generate complex behaviors. However, ensuring that the generated behaviors align with the intended emotional states of the robot is necessary in many domains where robots are used. In this paper, we present an adversarial-like training regime in which a generative model of emotional behavior is enhanced through feedback from both an emotion discriminator and a novelty loss, to ensure that the generated behaviors are non-redundant. Our generative model, fine-tuned on a dataset of robot behaviors labeled with emotions, generates behavior sequences perceived as reflecting the emotional qualities of the input emotion labels. Through our training regime, the generative model is refined by minimizing the discrepancies in both emotion classification and behavioral novelty. We evaluated our approach through multiple experiments and human evaluations, where participants were asked to appraise the emotions conveyed by robot behaviors and rate the novelty of the behaviors. Experimental results demonstrate that our two models, one for classifying and one for generating emotional behaviors, are effective, with the generative model producing emotionally rich behaviors that differ from previously generated outputs.

Abstract:
This paper introduces the lacrosse mobile manipulator, a robotic system designed to play lacrosse. We focus on the task of ball passing between two robots, where challenges exist, including managing a small ball in the soft lacrosse head and interacting with a fast-moving ball. In this study, we develop innovative neural-network based learning approaches to enhance performance in dynamic environments. The robots are autonomously controlled in a decentralized manner. A combination of analytical physical-based and machine learning methods is employed to refine motion planner and ball landing predictions in both the throwing and catching processes. The system achieves satisfactory performance in real-world ball-passing experiments.

Abstract:
Modern autopilot systems are prone to sensor attacks that can jeopardize flight safety. To mitigate this risk, we proposed a modular solution: the secure safety filter, which extends the well-established control barrier function (CBF)-based safety filter to account for, and mitigate, sensor attacks. This module consists of a secure state reconstructor (which generates plausible states) and a safety filter (which computes the safe control input that is closest to the nominal one). Differing from existing work focusing on linear, noise-free systems, the proposed secure safety filter handles bounded measurement noise and, by leveraging reduced-order model techniques, is applicable to the nonlinear dynamics of drones. Software-in-the-loop simulations and drone hardware experiments demonstrate the effectiveness of the secure safety filter in rendering the system safe in the presence of sensor attacks.

Abstract:
Reinforcement Learning (RL) has shown its remarkable and generalizable capability in legged locomotion through sim-to-real transfer. However, while adaptive methods like domain randomization are expected to enhance policy robustness across diverse environments, they potentially compromise the policy’s performance in any specific environment, leading to suboptimal real-world deployment due to the No Free Lunch Theorem. To address this, we propose LoopSR, a lifelong policy adaptation framework that continuously refines RL policies in the post-deployment stage. LoopSR employs a transformer-based encoder to map real-world trajectories into a latent space and reconstruct a digital twin of the real world for further improvement. Autoencoder architecture and contrastive learning methods are adopted to enhance feature extraction of real-world dynamics. Simulation parameters for continual training are derived by combining predicted values from the decoder with retrieved parameters from a pre-collected simulation trajectory dataset. By leveraging simulated continual training, LoopSR achieves superior data efficiency compared with strong baselines, yielding eminent performance with limited data in both sim-to-sim and sim-to-real experiments.

Abstract:
Decision-making in urban autonomous driving scenarios presents significant challenges due to the highly stochastic and interactive nature of traffic participants. While reinforcement learning-based approaches show promise for developing high-level driving policies, they often struggle with low sample efficiency and inadequate generalization. In this paper, we introduce a regularization-based value network to enhance the decision-making capabilities of distributional reinforcement learning, resulting in improved sample efficiency and stability. Specifically, we apply regularization techniques along with random ensemble methods to strengthen the learning process of the value network and address Q-value overestimation. Additionally, we utilize monotonic rational-quadratic splines to learn the quantile functions, naturally resolving the quantile crossing issue. Through extensive experiments, our approach demonstrates superior performance compared to baseline methods on the NoCrash benchmark, Town05, Town06 and our closed-loop OpenDD scenario. The experiment results indicate that our proposed method outperforms baseline performance in terms of success rate, safety and efficiency.

Abstract:
Enabling robots to understand human gaze target is a crucial step to allow capabilities in downstream tasks, for example, attention estimation and movement anticipation in real-world human-robot interactions. Prior works have addressed the in-frame target localization problem with data-driven approaches by carefully removing out-of-frame samples. Vision-based gaze estimation methods, such as OpenFace, do not effectively absorb background information in images and cannot predict gaze target in situations where subjects look away from the camera. In this work, we propose a system to address the problem of 360-degree gaze target estimation from an image in generalized visual scenes. The system, named GazeTarget360, integrates conditional inference engines of an eye-contact detector, a pre-trained vision encoder, and a multi-scale-fusion decoder. Cross validation results show that GazeTarget360 can produce accurate and reliable gaze target predictions in unseen scenarios. This makes a first-of-its-kind system to predict gaze targets from realistic camera footage which is highly efficient and deployable. Our source code is made publicly available at: https://github.com/zdai257/DisengageNet.

Abstract:
Hand-coded deliberation components are prone to flaws that may not be discovered before deployment and that can be harmful to the robot and its execution environment, including the people within it. To reduce development effort and at the same time increase confidence in robot’s safety, we propose to model deliberation components at a conceptual level, to automatically generate code from such models and also to monitor their execution during robot operation. We present two tools, one which compiles models of deliberation components into executable code, and one which generates runtime monitors from the models. We have tested them in simulation, to demonstrate the usefulness of combining together model-based development, code generation, and monitoring.

Abstract:
On-robot Reinforcement Learning is a promising approach to train embodiment-aware policies for legged robots. However, the computational constraints of real-time learning on robots pose a significant challenge. We present a framework for efficiently learning quadruped locomotion in just 8 minutes of raw real-time training utilizing the sample efficiency and minimal computational overhead of the new off-policy algorithm CrossQ. We investigate two control architectures: Predicting joint target positions for agile, high-speed locomotion and Central Pattern Generators for stable, natural gaits. While prior work focused on learning simple forward gaits, our framework extends on-robot learning to omnidirectional locomotion. We demonstrate the robustness of our approach in different indoor and outdoor environments and provide the videos and code for our experiments at: https://nico-bohlinger.github.io/gait_in_eight_website

Abstract:
Concentric tube robots (CTRs) hold great potential for minimally invasive surgery, offering flexibility, small diameters, and the ability to navigate within complex anatomical structures. While machine learning models have been increasingly used to predict the kinematics of CTRs, there is a lack of an established framework for evaluating generative inverse kinematic models, which are able to solve the inverse kinematic problem by providing various joint solutions for a desired end position. In this study, we introduce a workspace-based measure to assess the diversity of solutions produced by three generative models: an invertible neural network (INN), a conditional invertible neural network (cINN), and a conditional variational autoencoder (cVAE). We find that all three models record similar end position errors (3-6 mm) on dexterous subsets of the workspace, but that a cINN outperforms the others in generating diverse solutions using a workspace-based 1-Wasserstein distance by at least 2.38 standard deviations. To further test the applicability of these models, we integrate the best-performing cINN into a CTR controller and demonstrate the first use of a generative CTR model with real-time teleoperation under task-based constraints.

Abstract:
In this letter, we propose a new method for visual locomotion controller in quadruped robots, aimed at enhancing their capability to traverse challenging terrain. Our approach integrates computer vision techniques with robust locomotion control to improve terrain traversal. To facilitate terrain perception, we use onboard cameras and body sensors to collect real-world visual and proprioceptor data, and utilize forward kinematics to convert joint angles into precise foot positions. This enables accurate estimation of terrain height, which serves as supervised training data for our visual motion controller. This integrated approach improves the robot's ability to anticipate and adapt to diverse terrain conditions, potentially advancing quadruped locomotion in unstructured environments. Our model is deployed on A1 robot from Unitree. Experimental results show that our proposed method can enable the robot to stably climb stairs and pass through sand, grass, snow, and uneven roads.

Abstract:
Insect-scale flapping-wing micro aerial vehicles (FWMAVs) employing piezoelectric direct-drive configurations eliminate traditional kinematic chains through direct coupling of the wing and actuator. While this design approach significantly reduces structural complexity and manufacturing costs compared to transmission-dependent systems, it inherently limits wing stroke amplitude and consequent lift generation. This paper presents a novel lift-enhancement strategy for piezoelectric direct-drive FWMAVs, effectively improving payload capacity through optimized aerodynamic performance. The redesigned X-configuration prototype demonstrates outstanding metrics: 68 mm wingspan with 212 mg total mass achieves 7.47 mN maximum lift (exceeding 3.5:1 lift-to-weight ratio) and 1.25 m/s takeoff speed. Experimental validation confirms 39% payload capacity improvement and 34% lift-to-weight ratio enhancement compared to baseline designs. This enhancement establishes our robot as the current state-of-the-art in piezoelectric direct-drive FWMAVs regarding lift-to-weight ratio.

Abstract:
Recent advances in Large Language Models (LLMs) have permitted the development of language-guided multi-robot systems, which allow robots to execute tasks based on natural language instructions. However, achieving effective coordination in distributed multi-agent environments remains challenging due to (1) misalignment between instructions and task requirements and (2) inconsistency in robot behaviors when they independently interpret ambiguous instructions. To address these challenges, we propose Instruction-Conditioned Coordinator (ICCO), a Multi-Agent Reinforcement Learning (MARL) framework designed to enhance coordination in language-guided multi-robot systems. ICCO consists of a Coordinator agent and multiple Local Agents, where the Coordinator generates Task-Aligned and Consistent Instructions (TACI) by integrating language instructions with environmental states, ensuring task alignment and behavioral consistency. The Coordinator and Local Agents are jointly trained to optimize a reward function that balances task efficiency and instruction following. A Consistency Enhancement Term is added to the learning objective to maximize mutual information between instructions and robot behaviors, further improving coordination. Simulation and real-world experiments validate the effectiveness of ICCO in achieving language-guided task-aligned multi-robot control. The demonstration can be found at https://yanoyoshiki.github.io/ICCO/.

Abstract:
Recent advancements in 3D Gaussian Splatting(3DGS) have significantly improved semantic scene understanding, enabling natural language queries to localize objects within a scene. However, existing methods primarily focus on embedding compressed CLIP features to 3D Gaussians, suffering from low object segmentation accuracy and lack spatial reasoning capabilities. To address these limitations, we propose GaussianGraph, a novel framework that enhances 3DGS-based scene understanding by integrating adaptive semantic clustering and scene graph generation. We introduce a ‘Control-Follow’ clustering strategy, which dynamically adapts to scene scale and feature distribution, avoiding feature compression and significantly improving segmentation accuracy. Additionally, we enrich scene representation by integrating object attributes and spatial relations extracted from 2D foundation models. To address inaccuracies in spatial relationships, we propose 3D correction modules that filter implausible relations through spatial consistency verification, ensuring reliable scene graph construction. Extensive experiments on three datasets demonstrate that GaussianGraph outperforms state-of-the-art methods in both semantic segmentation and object grounding tasks, providing a robust solution for complex scene understanding and interaction. We provide supplementary video and code at https://wangxihan-bit.github.io/GaussianGraph.

Abstract:
Perception systems play a crucial role in autonomous driving, incorporating multiple sensors and corresponding computer vision algorithms. 3D LiDAR sensors are widely used to capture sparse point clouds of the vehicle’s surroundings. However, such systems struggle to perceive occluded areas and gaps in the scene due to the sparsity of these point clouds and their lack of semantics. To address these challenges, Semantic Scene Completion (SSC) jointly predicts unobserved geometry and semantics in the scene given raw LiDAR measurements, aiming for a more complete scene representation. Building on promising results of diffusion models in image generation and super-resolution tasks, we propose their extension to SSC by implementing the noising and denoising diffusion processes in the point and semantic spaces individually. To control the generation, we employ semantic LiDAR point clouds as conditional input and design local and global regularization losses to stabilize the denoising process. We evaluate our approach on autonomous driving datasets, and it achieves state-of-the-art performance for SSC, surpassing most existing methods.

Abstract:
Robotics Reinforcement Learning (RL) often relies on carefully engineered auxiliary rewards to supplement sparse primary learning objectives to compensate for the lack of large-scale, real-world, trial-and-error data. While these auxiliary rewards accelerate learning, they require significant engineering effort, may introduce human biases, and cannot adapt to the robot’s evolving capabilities during training. In this paper, we introduce Reward Training Wheels (RTW), a teacher-student framework that automates auxiliary reward adaptation for robotics RL. To be specific, the RTW teacher dynamically adjusts auxiliary reward weights based on the student’s evolving capabilities to determine which auxiliary reward aspects require more or less emphasis to improve the primary objective. We demonstrate RTW on two challenging robot tasks: navigation in highly constrained spaces and off-road vehicle mobility on vertically challenging terrain. In simulation, RTW outperforms expert-designed rewards by 2.35% in navigation success rate and improves off-road mobility performance by 122.62%, while achieving 35% and 3X faster training efficiency, respectively. Physical robot experiments further validate RTW’s effectiveness, achieving a perfect success rate (5/5 trials vs. 2/5 for expert-designed rewards) and improving vehicle stability with up to 47.4% reduction in orientation angles.

Abstract:
Navigating an arbitrary-shaped ground robot safely in cluttered environments remains a challenging problem. The existing trajectory planners that account for the robot’s physical geometry severely suffer from the intractable runtime. To achieve both computational efficiency and Continuous Collision Avoidance (CCA) of arbitrary-shaped ground robot planning, we proposed a novel coarse-to-fine navigation framework that significantly accelerates planning. In the first stage, a sampling-based method selectively generates distinct topological paths that guarantee a minimum inflated margin. In the second stage, a geometry-aware front-end strategy is designed to discretize these topologies into full-state robot motion sequences while concurrently partitioning the paths into SE(2) sub-problems and simpler ℝ2 sub-problems for back-end optimization. In the final stage, an SVSDF-based optimizer generates trajectories tailored to these sub-problems and seamlessly splices them into a continuous final motion plan. Extensive benchmark comparisons show that the proposed method is one to several orders of magnitude faster than the cutting-edge methods in runtime while maintaining a high planning success rate and ensuring CCA.

Abstract:
Recent research has begun exploring novel view synthesis (NVS) for LiDAR point clouds, aiming to generate realistic LiDAR scans from unseen viewpoints. However, most existing approaches do not reconstruct semantic labels, which are crucial for many downstream applications such as autonomous driving and robotic perception. Unlike images, which benefit from powerful segmentation models, LiDAR point clouds lack such large-scale pre-trained models, making semantic annotation time-consuming and labor-intensive. To address this challenge, we propose SN-LiDAR, a method that jointly performs accurate semantic segmentation, high-quality geometric reconstruction, and realistic LiDAR synthesis. Specifically, we employ a coarse-to-fine planar-grid feature representation to extract global features from multi-frame point clouds and leverage a CNN-based encoder to extract local semantic features from the current frame point cloud. Extensive experiments on SemanticKITTI and KITTI-360 demonstrate the superiority of SN-LiDAR in both semantic and geometric reconstruction, effectively handling dynamic objects and large-scale scenes. Codes will be available on https://github.com/dtc111111/SN-Lidar.

Abstract:
Sampling-based path planning algorithms, such as Rapidly-exploring Random Tree (RRT), are widely used for motion planning in high degree-of-freedom robotic systems due to their efficiency in exploring high-dimensional spaces. However, traditional methods rely on binary collision detection, which only determines whether a sampled configuration is in a collision without quantifying its safety, often resulting in trajectories that are overly close to obstacles and reducing planning success rates, especially in complex environments with narrow passages. To address this issue, we propose Safety-Guided RRT (SG-RRT), which integrates a quantitative safety metric based on signed distance functions (SDFs) with a hyperoctant sampling strategy, enabling the planner to prioritize safer configurations and steer tree expansion toward collision-free regions. This approach significantly improves path planning success rates while generating safer trajectories with greater clearance from obstacles. Extensive simulations and real-world experiments demonstrate that SG-RRT outperforms state-of-the-art methods, including RRT, Informed-RRT, TRRT, and Bi-TRRT, by achieving higher success rates and reducing collision risks, with only a slight increase in trajectory length.

Abstract:
Social robot navigation in crowded public spaces such as university campuses, restaurants, grocery stores, and hospitals, is an increasingly important area of research. One of the core strategies for achieving this goal is to understand humans’ intent–underlying psychological factors that govern their motion–by learning how humans assign rewards to their actions, typically via inverse reinforcement learning (IRL). Despite significant progress in IRL, learning reward functions of multiple agents simultaneously in dense unstructured pedestrian crowds has remained intractable due to the nature of the tightly coupled social interactions that occur in these scenarios e.g. passing, intersections, swerving, weaving, etc. In this paper, we present a new multi-agent maximum entropy inverse reinforcement learning algorithm for real world unstructured pedestrian crowds. Key to our approach is a simple, but effective, mathematical trick which we name the so-called "tractability-rationality trade-off" trick that achieves tractability at the cost of a slight reduction in accuracy. We compare our approach to the classical single-agent MaxEnt IRL as well as state-of-the-art trajectory prediction methods on several datasets including the ETH, UCY, SCAND, JRDB, and a new dataset, called Speedway, collected at a busy intersection on a University campus focusing on dense, complex agent interactions. Our key findings show that, on the dense Speedway dataset, our approach ranks 1st among top 7 baselines with > 2× improvement over single-agent IRL, and is competitive with state-of-the-art large transformer-based encoder-decoder models on sparser datasets such as ETH/UCY (ranks 3rd among top 7 baselines).

Abstract:
Human action recognition is a crucial task for intelligent robotics, particularly within the context of human-robot collaboration research. In self-supervised skeleton-based action recognition, the mask-based reconstruction paradigm learns the spatial structure and motion patterns of the skeleton by masking joints and reconstructing the target from unlabeled data. However, existing methods focus on a limited set of joints and low-order motion patterns, limiting the model’s ability to understand complex motion patterns. To address this issue, we introduce MaskSem, a novel semantic-guided masking method for learning 3D hybrid high-order motion representations. This novel framework leverages Grad-CAM based on relative motion to guide the masking of joints, which can be represented as the most semantically rich temporal orgions. The semantic-guided masking process can encourage the model to explore more discriminative features. Furthermore, we propose using hybrid high-order motion as the reconstruction target, enabling the model to learn multi-order motion patterns. Specifically, low-order motion velocity and high-order motion acceleration are used together as the reconstruction target. This approach offers a more comprehensive description of the dynamic motion process, enhancing the model’s understanding of motion patterns. Experiments on the NTU60, NTU120, and PKU-MMD datasets show that MaskSem, combined with a vanilla transformer, improves skeleton-based action recognition, making it more suitable for applications in human-robot interaction. The source code of our MaskSem is available at https://github.com/JayEason66/MaskSem.

Abstract:
Protecting and restoring forest ecosystems has become an important conservation issue. Although various robots have been used for field data collection to protect forest ecosystems, the complex terrain and dense canopy make the data collection less efficient. To address this challenge, an aerial platform with bio-inspired behaviour facilitated by a bio-inspired mechanism is proposed. The platform spends minimum energy during data collection by perching on tree branches. A raptor inspired vision algorithm is used to locate a tree trunk, and then a horizontal branch on which the platform can perch is identified. A tendon-driven mechanism inspired by bat claws which requires energy only for actuation, secures the platform onto the branch using the mechanism’s passive compliance. Experimental results show that the mechanism can perform perching on branches ranging from 30 mm to 80 mm in diameter. The real-world tests validated the system’s ability to select and adapt to target points, and it is expected to be useful in complex forest ecosystems. Project website: https://aerialroboticsgroup.github.io/branch-perching-project/

Abstract:
Deep-learning-based object detection has gained widespread application in surface defect inspection, with anchor-based detectors achieving remarkable success by utilizing dense anchors to align with defects. Determining the optimal anchor configuration, i.e., sizes and aspect ratios of anchor boxes, remains a critical challenge, particularly when addressing defects with significant shape variations. While previous studies have predominantly focused on developing more efficient network architectures and learning strategies, the problem of anchor configuration determination has not been thoroughly explored. To address this gap, this paper proposes the IoU-Aware Clustering (IAC) algorithm, which autonomously learns suitable anchor configurations by extracting shape priors from diverse defects. IAC takes the training bounding boxes as potential clustering centers and selects a subset that aligns with the shape distribution of the training samples. The algorithm involves only a single hyper-parameter, the anchor number k, making it highly adaptable to various scenarios. Experimental results demonstrate that IAC can effectively generate anchor configurations tailored to defect shapes, significantly improving the mean Average Precision (mAP) by 6.9% and 14.4% on two industrial defect datasets with substantial shape variations.

Abstract:
In-hand manipulation tasks using hand-like robotic grippers offer a promising approach to accomplish various tasks in a human-centered environment. Due to the inherent safety of soft robots, in-hand manipulations performed by soft robots provide great opportunities for future human-robot collaboration, which is the scope of this paper. By modeling a new and innovative soft-robotic gripper known as the Anthropomorphic Soft Gripper and synthesis of an open-loop controller with deep reinforcement learning, it is demonstrated how the movement of objects by in-hand manipulations can be accomplished. Moreover, this work explores the application of deep reinforcement learning methods without the employment for domain randomization. As noted by Bhatt et al. the inherent soft properties of soft robotic grippers enable remarkably robust in-hand manipulation in open-loop control, giving the impetus for the approach that is being followed in this work. Motion sequences generated in simulation are successfully transferred to the real anthropomorphic soft gripper and validated in experiments.

Abstract:
Robotization is considered a key solution to labor shortages in the agri-food industry. However, deploying robots in natural environments is challenging due to unpredictable factors such as plant variances and occlusions. This paper focuses on the localization of tomato trusses for autonomous harvesting by servoing a robot-mounted camera to different viewpoints. We build on previous work where the robot is provided with prior knowledge of the tomato plant. Specifically, the geometric relations between the trusses are modeled as ranges, which reflect uncertainty. Our main contribution is an approach that represents this uncertainty as polytope volumes. Polytopes enable scalable reasoning that facilitates likelihood estimation for viewpoint selection. Our method first constructs polytope hypotheses regarding the truss locations based on prior plant knowledge. It then refines the polytope shapes using Bayesian updates based on camera observations. Finally, the polytopes are used to select the next viewpoint that maximizes the chance of observing a new tomato truss. Experiments show that polytope-based viewpoint selection speeds up truss localization compared to earlier methods, advancing robotic harvesting.

Abstract:
In the deep learning era, 2D LiDAR perception is often overlooked as research prioritizes 3D point clouds. Yet, 2D LiDAR remains essential for low-cost robotic systems due to its affordability. Despite its simplicity, it faces major challenges-not only in perception but also in learning itself, as data sparsity and limited features hinder effective framework development. Additionally, fundamental structural differences prevent direct adaptation of 3D perception networks. To address these aforementioned concerns, we proposes LSW-Net (Laser Scan Weight-Net), a self-supervised framework for 2D LiDAR perception, enabling end-to-end learning from raw point clouds to adaptive weight extraction. This framework provides a scalable and lightweight solution for 2D LiDAR environmental perception, and its self-supervised nature reduces annotation costs.It includes: (i) a general 2D Laser Encoder (LS-Encoder) that integrates local convolutional perception with global attention perception to extract multi-scale spatio-temporal features; and (ii) an interpretable weight extraction module (Weight Extractor) that dynamically quantifies the contribution of each point in environmental perception tasks through contrastive learning and physical consistency constraints. Evaluated on diverse scenes for point cloud registration and SLAM tasks, LSW-Net outperforms traditional methods in feature discriminability and adaptability. Additionally, we performed ablation experiments to substantiate the rationality of our approach. Our code is available at https://github.com/cuiyujie0113/LSW-NET.

Abstract:
Large language models (LLMs) showcase many desirable traits for intelligent and helpful robots. However, they are also known to hallucinate predictions. This issue is exacerbated in robotics where LLM hallucinations may result in robots confidently executing plans that are contrary to user goals or relying more frequently on human assistance. In this work, we present LBAP, a novel approach for utilizing off-the-shelf LLMs, alongside Bayesian inference for uncertainty Alignment in robotic Planners that minimizes hallucinations and human intervention. Our key finding is that we can use Bayesian inference to more accurately calibrate a robots confidence measure through accounting for both scene grounding and world knowledge. This process allows us to mitigate hallucinations and better align the LLM’s confidence measure with the probability of success. Through experiments in both simulation and the real world on tasks with a variety of ambiguities, we show that LBAP significantly increases success rate and decreases the amount of human intervention required relative to prior art. For example, in our real-world testing paradigm, LBAP decreases the human help rate of previous methods by over 33% at a success rate of 70%.

Abstract:
Quadrupedal locomotion have advantages of a low center of gravity, broad support base, and four-legged coordination, enabling outstanding stability in complex terrains. Drawing inspiration from this, researchers have developed robots emulate such locomotion through multiple actuations controlling and structural reconfiguration. However, complex control sequences and slow tethered actuation limit their locomotion in confined endoluminal spaces. Here, we present a quadrupedal magnetic millirobot (beamrobot) fabricated via multi-material direct ink writing (DIW) for medical applications in complex endoluminal spaces. The millirobot combines a soft body with magnetic feet, enabling controlled shape-morphing and locomotion under external magnetic fields. The printing parameters are optimized. The numerical simulations and experiments validate the static deformation and dynamic locomotion modes. Experimental results demonstrate the versatility of the beamrobots, including following "U"-shaped trajectories, displaying movement in the branch vessel model, and clearing the obstruction in the vessel model. The proposed quadrupedal magnetic millirobot and hard-magnetic actuation approach open up new possibilities for the medical applications.

Abstract:
Manipulating articulated tools, such as tweezers or scissors, has rarely been explored in previous research. Unlike rigid tools, articulated tools change their shape dynamically, creating unique challenges for dexterous robotic hands. In this work, we present a hierarchical, goal-conditioned reinforcement learning (GCRL) framework to improve the manipulation capabilities of anthropomorphic robotic hands using articulated tools. Our framework comprises two policy layers: (1) a low-level policy that enables the dexterous hand to manipulate the tool into various configurations for objects of different sizes, and (2) a high-level policy that defines the tool’s goal state and controls the robotic arm for object-picking tasks. We employ an encoder, trained on synthetic pointclouds, to estimate the tool’s affordance states—specifically, how different tool configurations (e.g., tweezer opening angles) enable grasping of objects of varying sizes—from input point clouds, thereby enabling precise tool manipulation. We also utilize a privilege-informed heuristic policy to generate replay buffer, improving the training efficiency of the high-level policy. We validate our approach through real-world experiments, showing that the robot can effectively manipulate a tweezer-like tool to grasp objects of diverse shapes and sizes with a 70.8% success rate. This study highlights the potential of RL to advance dexterous robotic manipulation of articulated tools.

Abstract:
The generation of temporally consistent, high-fidelity driving videos over extended horizons presents a fundamental challenge in autonomous driving world modeling. Existing approaches often suffer from error accumulation and feature misalignment due to inadequate decoupling of spatio-temporal dynamics and limited cross-frame feature propagation mechanisms. To address these limitations, we present STAGE (Streaming Temporal Attention Generative Engine), a novel auto-regressive framework that pioneers hierarchical feature coordination and multi-phase optimization for sustainable video synthesis. To achieve high-quality long-horizon driving video generation, we introduce Hierarchical Temporal Feature Transfer (HTFT) and a novel multi-stage training strategy. HTFT enhances temporal consistency between video frames throughout the video generation process by modeling the temporal and denoising process separately and transferring denoising features between frames. The multi-stage training strategy is to divide the training into three stages, through model decoupling and auto-regressive inference process simulation, thereby accelerating model convergence and reducing error accumulation. Experiments on the Nuscenes dataset show that STAGE has significantly surpassed existing methods in the long-horizon driving video generation task. In addition, we also explored STAGE’s ability to generate unlimited-length driving videos. We generated 600 frames of high-quality driving videos on the Nuscenes dataset, which far exceeds the maximum length achievable by existing methods. Our project homepage:https://4dvlab.github.io/STAGE/

Abstract:
Trajectory optimization in multi-vehicle scenarios faces challenges due to its non-linear, non-convex properties and sensitivity to initial values, making interactions between vehicles difficult to control. In this paper, inspired by topological planning, we propose a differentiable local homotopy invariant metric to model the interactions. By incorporating this topological metric as a constraint into multi-vehicle trajectory optimization, our framework is capable of generating multiple interactive trajectories from the same initial values, achieving controllable interactions as well as supporting user-designed interaction patterns. Extensive experiments demonstrate its superior optimality and efficiency over existing methods. We will release open-source code to advance relative research1.

Abstract:
State-of-the-art model-based control designs have been shown to be successful in realizing dynamic locomotion behaviors for robotic systems. The precision of the realized behaviors in terms of locomotion performance via fly, hopping, or walking has not yet been well investigated, despite the fact that the difference between the robot model and physical hardware is doomed to produce inaccurate trajectory tracking. To address this inaccuracy, we propose a referencing-steering method to bridge the model-to-real gap by establishing a data-driven input-output (DD-IO) model on top of the existing model-based design. The DD-IO model takes the reference tracking trajectories as the input and the realized tracking trajectory as the output. By utilizing data-driven predictive control, we steer the reference input trajectories online so that the realized output ones match the actual desired ones. We demonstrate our method on the robot PogoX to realize hyper-accurate hopping and flying behaviors in both simulation and hardware. This data-driven reference-steering approach is straightforward to apply to general robotic systems for performance improvement via hyper-accurate trajectory tracking.

Abstract:
Robotic graspers are essential for enhancing the efficiency and versatility of robots in grasping tasks. In this paper, we propose a novel inflatable deployable origami grasper with a rigid-flexible coupling structure. The proposed grasper can achieve multiple deployment configurations under a single pneumatic actuation, enabling both deployment and grasping operations while also allowing for passive self-folding during deflation. The design and fabrication of the grasper are presented. Then, the stiffness model for the inflatable deployable origami unit is developed based on the equivalent truss method. Experimental results show that the grasper successfully grasps objects of various shapes and sizes in both enveloping and fingertip grasping modes, using either two or four fingers. With its simple mechanical system and high deploy/fold ratio, the proposed grasper holds significant potential for applications in industrial automation and space exploration.

Abstract:
Gas emissions play a crucial role in many environmental and industrial processes, driving a growing effort to understand their dispersion in air. Nonetheless, gas distribution mapping is inherently challenging due to the complex interplay between gas diffusion and wind flows. Mobile robots provide a compelling alternative to static sensor networks for gas sensing, having greater mobility and minimizing the need to permanently deploy assets in the environment. However, robotic platforms typically collect only sparse measurements due to constraints, such as limited battery life, and state-of-the-art methods often fail to accurately interpolate between scattered data. To address this limitation, we introduce ADApprox, a novel gas mapping algorithm. By leveraging the underlying physics which governs gas dispersion, ADAapprox offers superior interpolation capabilities. Our method locally approximates advection-diffusion equation for an entire grid of points and learns the model parameters from gas measurements. The learned parameters are subsequently used to predict gas concentrations across the entire environment. Extensive simulations and physical experiments are conducted using a nano aerial vehicle. The mapping results demonstrate that ADApprox consistently outperforms the state-of-the-art algorithm Kernel DM+V/W while having a comparable computational cost. In addition, we evaluate the effectiveness in localizing a gas source based on the predicted gas maps. Our findings indicate that ADApprox effectively localizes the gas source, achieving a median error of 18cm on an area of 12m2 in physical experiments.

Abstract:
Mobile edge computing is an emerging computing paradigm that enhances the computational capabilities of mobile devices by offloading intensive tasks to edge servers. In robotic systems, MEC can significantly reduce response times and improve user experience. However, as robots move through their environments, factors such as the distance to edge servers and physical obstructions fluctuate, leading to variations in communication bandwidth and, consequently, communication delays. To address these challenges, this paper proposes an adaptive computing node offloading framework (ARDF) designed to optimize the dynamic deployment of Robot Operating System 2 (ROS2) nodes in robot-edge environments. The framework enables developers to flexibly deploy robotic computing tasks based on varying computational and network conditions. We validate its effectiveness through experiments involving robotic arm control, 3D detection applications, and numerically simulated ROS2 tasks under different bandwidth conditions. The results demonstrate that the framework significantly improves response times for ROS2 applications, even under fluctuating computational loads and network constraints. The code for the framework ARDF can be found on GitHub 1.

Abstract:
Maneuverability in soft bio-inspired underwater robots, particularly for following complex trajectories, remains an unsolved challenge. In this work, we present a control approach based on a PD controller integrated with real-time camera feedback, enabling continuous and reliable free-swimming control. The system was able to maintain precise waypoint tracking for 60 minutes and more, with a minimum turning radius of 27 cm, demonstrating the robot’s high maneuverability relative to its body size. Our contributions include the development of a robust camera-based tracking system, the tuning of a PD controller to enhance trajectory following, and the exploration of the limits of maneuverability in soft swimming robots. This work paves the way for future integration with onboard sensing systems to improve state estimation in soft swimmers and reduce reliance on external camera systems.

Abstract:
Manipulation and insertion of small and tight-toleranced objects in robotic assembly remain a critical challenge for vision-based robotics systems due to the required precision and cluttered environment. Conventional global or wrist-mounted cameras often suffer from occlusions when either assembling or disassembling from an existing structure. To address the challenge, this paper introduces "Eye-In-Finger", a novel tool design approach that enhances robotic manipulation by embedding low-cost, high-resolution perception directly at the tool tip. We validate our approach using LEGO assembly and disassembly tasks, which require the robot to manipulate in a cluttered environment and achieve sub-millimeter accuracy and robust error correction due to the tight tolerances. Experimental results demonstrate that our proposed system enables real-time, fine corrections to alignment error, increasing the tolerance of calibration error from 0.4mm to up to 2.0mm for the LEGO manipulation robot.

Affiliations: School of Advanced Interdisciplinary Sciences, University of Chinese Academy of Sciences, Beijing, China; Centre for Artificial Intelligence and Robotics, Hong Kong Institute of Science & Innovation, Chinese Academy of Sciences, Hong Kong, China; State Key Laboratory of Multimodal Artificial Intelligence Systems, Institute of Automation, Chinese Academy of Sciences, Beijing, China; School of Biomedical Engineering and Imaging Sciences, King’s College London, London, U.K.

Abstract:
Transcatheter tricuspid valve replacement (TTVR) has emerged as a promising minimally invasive procedure for treating severe tricuspid regurgitation (TR). However, accurate catheter delivery remains a significant challenge, primarily due to the reliance on 2D vision feedback, complex catheter kinematics, camera-to-robot pose calibration, which are difficult to generalize across patients. To address these issues, this paper presents a model-free robotic catheter delivery strategy for TTVR using Data-Enabled Predictive Control (DeePC). This approach leverages data-driven control to optimize catheter positioning without the need for prior knowledge of the system’s dynamics, eliminating the need for complex kinematic models or camera calibration. The proposed method incorporates environmental constraints to ensure the safety of the procedure, delivering the catheter to the desired location with high accuracy across varying catheters and camera poses. Experimental results demonstrate the effectiveness and versatility of the approach, suggesting its potential for broader applications in robotic-assisted surgeries. This work presents a new perspective for vision based robotic TTVR, as well as other clinical interventions involving robotic catheter control.

Abstract:
We propose a hybrid approach for decentralized multi-robot navigation that ensures both safety and deadlock prevention. Building on a standard control formulation, we add a lightweight deadlock prevention mechanism by forming temporary "roundabouts" (circular reference paths). Each robot relies only on local, peer-to-peer communication and a controller for base collision avoidance; a roundabout is generated or joined on demand to avert deadlocks. Robots in the roundabout travel in one direction until an escape condition is met, allowing them to return to goal-oriented motion. Unlike classical decentralized methods that lack explicit deadlock resolution, our roundabout maneuver ensures system-wide forward progress while preserving safety constraints. Extensive simulations and physical robot experiments show that our method consistently outperforms or matches the success and arrival rates of other decentralized control approaches, particularly in cluttered or high-density scenarios, all with minimal centralized coordination.

Abstract:
Typical robot controllers assume firm ground, limiting their effectiveness in controlling robots on dynamic platforms such as trucks or ships. To address this limitation, we propose a reinforcement learning framework for robot locomotion on dynamic rigid platforms and a simulation in which 6-DoF dynamic platforms emulating ship oscillation. The framework enables a reinforcement learning model to estimate platform motion during robot locomotion control. In the simulation, our framework significantly reduces the quadruped robot’s fall rate and trajectory deviation compared to baseline controllers. Experiments on a real robot show that our framework enabled a quadruped robot to adapt to platform motions, including those that threw the robot into the air, while baseline models struggled in this case. Thus, our framework can advance the deployment of robots in real-world marine and vehicular applications.

Abstract:
The imitation learning paradigm is a systematic approach for encoding intelligent behaviors into robotic systems. While a model representation of the ideal task behavior can be learned by processing a set of human demonstrations, learning a modeling representation that can generalize the desired behavior to perform well in dynamic environments with human collaborators is an open challenge. To address this problem, we encode intelligent robot behavior as a combination of a popular learned baseline control policy (Gaussian Mixture Model, GMM) with reactive control policies that activate based on triggers from online sensory information during task execution. Two contributions encapsulate the approach: an iterative algorithm to combine the learned and reactive policies and examples for mapping sensory information into desired robot reactive behaviors. The proposed approach was implemented on a bi-manual surgical robot and evaluated on how well the combined control policy balanced the behavioral constraints imposed during a collision avoidance and compliance tasks. Successful dynamic collision avoidance results and compliance responses that reduce environmental forces on the manipulator support the use of this paradigm for designing intelligent robot behaviors which can complement learned models to program complex robot behaviors that can balance task performance in scenarios with human collaborators.

Abstract:
Learning dexterous manipulation from few-shot demonstrations is a significant yet challenging problem for advanced, human-like robotic systems. Dense distilled feature fields have addressed this challenge by distilling rich semantic features from 2D visual foundation models into the 3D domain. However, their reliance on neural rendering models such as Neural Radiance Fields (NeRF) or Gaussian Splatting results in high computational costs. In contrast, previous approaches based on sparse feature fields either suffer from inefficiencies due to multi-view dependencies and extensive training or lack sufficient grasp dexterity. To overcome these limitations, we propose Language-ENhanced Sparse Distilled Feature Field (LensDFF), which efficiently distills view-consistent 2D features onto 3D points using our novel language-enhanced feature fusion strategy, thereby enabling single-view few-shot generalization. Based on LensDFF, we further introduce a few-shot dexterous manipulation framework that integrates grasp primitives into the demonstrations to generate stable and highly dexterous grasps. Moreover, we present a real2sim grasp evaluation pipeline for efficient grasp assessment and hyperparameter tuning. Through extensive simulation experiments based on the real2sim pipeline and real-world experiments, our approach achieves competitive grasping performance, outperforming state-of-the-art approaches. See our website for the code and videos: david-s-martinez.github.io/LensDFF.

Abstract:
Robotic grasping, serving as the cornerstone of robot manipulation, is fundamental for embodied intelligence. Manipulation in challenging scenarios demands grasp detection algorithms with higher efficiency and generalizability. However, for general 6-Dof grasp detection, most data-driven methods directly extract scene-level features to generate grasp prediction, relying on a relatively heavy scene-level feature encoder and a significant amount of data with dense grasp labels for model training. In this letter, we propose a novel data-efficient 6-Dof grasp detection framework in cluttered scenes, named Region-Centric Grasp Detection (RCGD), consisting of an Iterative Search Module (ISM) and a Region Grasp Model (RGM). Concretely, ISM aims to retrieve potential region centers and aggregate multiple regions in a coarse-to-fine way. Then, RGM extracts aligned grasp-related embeddings and predicts grasps within these local regions. Benefiting from the region-centric paradigm and the training-free location strategy, RCGD significantly outperforms previous methods and shows minimal performance loss with even a very small portion of training data or labels. Furthermore, real-world robotic experiments in two distinct settings highlight the effectiveness of our method with a 95% success rate.

Abstract:
This work focuses on enhancing the generalization performance of deep reinforcement learning-based robot navigation in unseen environments. We present a novel data augmentation approach called scenario augmentation, which enables robots to navigate effectively across diverse settings without altering the training scenario. The method operates by mapping the robot’s observation into an imagined space, generating an imagined action based on this transformed observation, and then remapping this action back to the real action executed in simulation. Through scenario augmentation, we conduct extensive comparative experiments to investigate the underlying causes of suboptimal navigation behaviors in unseen environments. Our analysis indicates that limited training scenarios represent the primary factor behind these undesired behaviors. Experimental results confirm that scenario augmentation substantially enhances the generalization capabilities of deep reinforcement learning-based navigation systems. The improved navigation framework demonstrates exceptional performance by producing near-optimal trajectories with significantly reduced navigation time in real-world applications.

Abstract:
This paper presents a real-time trajectory planning scheme for a heterogeneous multi-robot system (consisting of a quadrotor and a ground mobile robot) for a cooperative landing task, where the landing position, landing time, and coordination between the robots are determined autonomously under the consideration of feasibility and user specifications. The proposed framework leverages the potential of the complementarity constraint as a decision-maker and an indicator for diverse cooperative tasks and extends it to the collaborative landing scenario. In a potential application of the proposed methodology, a ground mobile robot may serve as a mobile charging station and coordinates in real-time with a quadrotor to be charged, facilitating a safe and efficient rendezvous and landing. We verified the generated trajectories in simulation and real-world applications, demonstrating the real-time capabilities of the proposed landing planning framework.

Abstract:
Multi-robot collaboration for target tracking in adversarial environments poses significant challenges, including system failures, dynamic priority shifts, and other unpredictable factors. These challenges become even more pronounced when the environment is unknown. In this paper, we propose a resilient coordination framework for multi-robot, multi-target tracking in environments with unknown sensing and communication danger zones. We consider scenarios where failures caused by these danger zones are probabilistic and temporary, allowing robots to escape from danger zones to minimize the risk of future failures. We formulate this problem as a nonlinear optimization with soft chance constraints, enabling real-time adjustments to robot behaviors based on varying types of dangers and failures. This approach dynamically balances target tracking performance and resilience, adapting to evolving sensing and communication conditions in real-time. To validate the effectiveness of the proposed method, we assess its performance across various tracking scenarios, benchmark it against methods without resilient adaptation and collaboration, and conduct several real-world experiments.

Abstract:
Zero-shot generalization across various robots, tasks and environments remains a significant challenge in robotic manipulation. Policy code generation methods use executable code to connect high-level task descriptions and low-level action sequences, leveraging the generalization capabilities of large language models and atomic skill libraries. In this work, we propose Robotic Programmer (RoboPro), a robotic foundation model, enabling the capability of perceiving visual information and following free-form instructions to perform robotic manipulation with policy code in a zero-shot manner. To address low efficiency and high cost in collecting runtime code data for robotic tasks, we devise Video2Code to synthesize executable code from extensive videos in-the-wild with off-the-shelf vision-language model and code-domain large language model. Extensive experiments show that RoboPro achieves the state-of-the-art zero-shot performance on robotic manipulation in both simulators and real-world environments. Specifically, the zero-shot success rate of RoboPro on RLBench surpasses Code-as-Policies equipped with the state-of-the-art model GPT-4o by 11.6%. Furthermore, RoboPro is robust to variations on API formats and skill sets. Our website can be found at https://video2code.github.io/RoboPro-website/.

Abstract:
Ice, a naturally abundant resource in polar regions and extraterrestrial environments, has gained significant attention for its potential applications in robotics. However, there is a noticeable gap in existing research concerning the manufacturing processes of ice-based components, particularly those utilizing formative technologies for field deployment. Furthermore, outdoor field validations of robots incorporating ice components remain scarce.To bridge these gaps, this paper introduces "Project Yukionna," which includes developing and validating the Ice Formative Method (IFM). Additionally, the feasibility of three distinct manufacturing approaches: formative manufacturing (FM), subtractive manufacturing (SM), and additive manufacturing (AM), is evaluated using a two-tier Analytic Hierarchy Process (AHP). The paper further pioneers the execution of most production processes and experimental validations in fully outdoor field conditions. Numerous tests were conducted to assess the effectiveness and limitations of the IFM and a 4WD (Four-Wheel Drive) rover equipped with ice-based components. The findings demonstrate the feasibility of ice-based robotic components while highlighting the manufacturing challenges and the inherent constraints of ice as a material.

Abstract:
Spatial relationships between objects are key to achieving well-arranged scenes. In this paper, we address the robotic rearrangement task by leveraging these relationships to reach configurations that are both well-arranged and satisfying the given language goal. We propose a hierarchical planning framework that bridges the gap between abstract language inputs and concrete robotic actions. A scene graph is central to this approach, serving as both an intermediate representation and the state for high-level planning, capturing the relationships among objects effectively and reducing planning complexity. This also enables the proposed method to handle more general language goals. To achieve this, we leverage a large language model (LLM) to convert language goals into a scene graph, which becomes the goal for high-level planning. In high-level planning, we plan transitions from the current scene graph to the goal scene graph. To integrate high-level and low-level planning, we introduce a network that generates a physical configuration of objects from a scene graph. Low-level planning then verifies the high-level plan’s feasibility, ensuring it can be executed through robotic manipulation. Through experiments, we show that the proposed method handles general language goals effectively and produces human-preferred rearrangements compared to other approaches, demonstrating its applicability on real robots without requiring sim-to-real adjustments.

Abstract:
When users give natural language instructions to service robots, positional information is often referenced relative to objects in the environment rather than absolute coordinates. However, humans naturally use relative references. For example, in“Go to the chair and pick up empty bottles”, where the positional reference is the chair, ambiguity arises when multiple similar objects co-exist in the environment or when the robot’s view is limited, resulting in multiple possible interpretations of the same command and affecting navigation decisions. To address this issue, we propose a two-level framework that integrates a large language model (LLM) and a vision-language model (VLM), allowing the robot to engage in multi-turn dialogues for spatial disambiguation. Our method first utilizes a VLM to map the semantic meanings of dialogues to a unique object ID in images and then further maps this object ID to a 3D depth map, enabling the robot to accurately determine its navigation target. To the best of our knowledge, this is the first work leveraging foundation models to address spatial ambiguity.

Affiliations: School of Biomedical Engineering, Division of Life Sciences and Medicine, University of Science and Technology of China, Hefei, P.R. China; Faculty of Hepato-Biliary-Pancreatic Surgery, The First Medical Center, Chinese PLA General Hospital, Beijing, China; Center for Medical Imaging, Robotics, Analytic Computing & Learning(MIRACLE), Suzhou Institute for Advanced Research, University of Science and Technology of China, Suzhou, P.R. China

Abstract:
In robotic surgery, instrument presence labels are typically recorded alongside video streams, offering a cost-effective alternative to manual annotations for segmentation tasks. Label-supervised surgical instrument segmentation (SIS), a weakly supervised segmentation setting where only instrument presence labels are available, remains underexplored due to its inherently ill-posed nature. Temporal information plays a vital role in capturing sequential dependencies, thereby enhancing representation learning even under incomplete supervision. This paper extends a two-stage label-supervised segmentation framework by leveraging the temporal characteristics of surgical videos from three perspectives. First, a temporal equivariance constraint is introduced to enforce pixel-level consistency across adjacent frames. Second, a class-aware semantic continuity constraint is applied to preserve coherence between global and local regions over time. Third, temporally-enhanced pseudo masks are generated from consecutive frames to suppress irrelevant regions and improve segmentation accuracy. We evaluate our method on two surgical video datasets: the Cholec80 cholecystectomy benchmark and a real-world robotic left lateral segmentectomy (RLLS) dataset. Instance-level instrument annotations, sampled at regular intervals and validated by an experienced clinician, provide a reliable basis for evaluation. Experimental results demonstrate that our method consistently achieves favorable performances over state-of-the-art methods. These findings highlight the effectiveness of incorporating temporal constraints into label-supervised frameworks, offering a promising strategy to reduce annotation costs and advance surgical video analysis.

Abstract:
Multi-task robotic bimanual manipulation is becoming increasingly popular as it enables sophisticated tasks that require diverse dual-arm collaboration patterns. Compared to unimanual manipulation, bimanual tasks pose challenges to understanding the multi-body spatiotemporal dynamics. An existing method ManiGaussian [30] pioneers encoding the spatiotemporal dynamics into the visual representation via Gaussian world model for single-arm settings, which ignores the interaction of multiple embodiments for dual-arm systems with significant performance drop. In this paper, we propose ManiGaussian++, an extension of ManiGaussian framework that improves multi-task bimanual manipulation by digesting multi-body scene dynamics through a hierarchical Gaussian world model. To be specific, we first generate task-oriented Gaussian Splatting from intermediate visual features, which aims to differentiate acting and stabilizing arms for multi-body spatiotemporal dynamics modeling. We then build a hierarchical Gaussian world model with the leader-follower architecture, where the multi-body spatiotemporal dynamics is mined for intermediate visual representation via future scene prediction. The leader predicts Gaussian Splatting deformation caused by motions of the stabilizing arm, through which the follower generates the physical consequences resulted from the movement of the acting arm. As a result, our method significantly outperforms the current state-of-the-art bimanual manipulation techniques by an improvement of 20.2% in 10 simulated tasks, and achieves 60% success rate on average in 9 challenging real-world tasks. Our code is available at https://github.com/April-Yz/ManiGaussian_Bimanual.

Abstract:
We propose a closed-form solution to the landmark simultaneous localization and mapping (landmarkSLAM) problem. The core idea is to extend the recent advancement in the generalized Procrustes analysis (GPA) research by incorporating an affine-relaxed odometry term. We show that the resulting affine relaxed landmark-SLAM formulation, termed affine-SLAM, can be solved globally in closed-form. Through numerical experiments, we demonstrate that the affine-SLAM solution is rather close to the optimal solution of the standard nonlinear least squares (NLS) optimization, and thus can be used either as a stand-alone approximate solution or as a high-quality initialization for NLS solvers.

Abstract:
Reversible morphing wings, capable of exchanging the direction of the leading and trailing edge, improve the feasibility of stopped-rotor aerial vehicles and thereby expand aerial robotics capabilities. Despite the potential benefits of reversible morphing wings, few designs exist and most do not consider the coupled aerodynamic and structural effects of the introduced morphing mechanisms. We report a six-bar linkage-inspired reversible morphing wing design that is compliant, yet structurally rigid against aerodynamic loading. Compared to alternative methods for reversing airfoil direction, the developed wing increases the maximum aerodynamic performance by 50%, while also yielding a non-zero efficiency at 0° angle of attack. The proposed linkage system derives rigidity by strategically eliminating degrees of freedom upon actuation. To validate the linkage theory of rigidity, the full wing is studied under one-way fluid-structural interaction simulations. Furthermore, we demonstrate how constrained optimization can be applied to improve the aerodynamic efficiency of these wings subject to the constraints of a six-bar linkage system.

Abstract:
Visual Simultaneous Localization and Mapping (VSLAM) research faces significant challenges due to fragmented toolchains, complex system configurations, and inconsistent evaluation methodologies. To address these issues, we present VSLAM-LAB, a unified framework designed to streamline the development, evaluation, and deployment of VSLAM systems. VSLAM-LAB simplifies the entire workflow by enabling seamless compilation and configuration of VSLAM algorithms, automated dataset downloading and preprocessing, and standardized experiment design, execution, and evaluation. All of these features are accessible through a single command-line interface. The framework supports a wide range of VSLAM systems and datasets, offering broad compatibility and extendability while promoting reproducibility through consistent evaluation metrics and analysis tools. By reducing implementation complexity and minimizing configuration overhead, VSLAM-LAB empowers researchers to focus on advancing VSLAM methodologies and accelerates progress toward scalable, real-world solutions. We demonstrate the ease with which user-relevant benchmarks can be created: here, we introduce difficulty-level-based categories, but one could envision environment-specific or condition-specific categories.

Abstract:
In robotic reinforcement learning, the Sim2Real gap remains a critical challenge. However, the impact of Static friction on Sim2Real has been underexplored. Conventional domain randomization methods typically exclude Static friction from their parameter space. In our robotic reinforcement learning task, such conventional domain randomization approaches resulted in significantly underperforming real-world models. To address this Sim2Real challenge, we employed Actuator Net as an alternative to conventional domain randomization. While this method enabled successful transfer to flat-ground locomotion, it failed on complex terrains like stairs. To further investigate physical parameters affecting Sim2Real in robotic joints, we developed a control-theoretic joint model and performed systematic parameter identification. Our analysis revealed unexpectedly high friction-torque ratios in our robotic joints. To mitigate its impact, we implemented Static friction-aware domain randomization for Sim2Real. Recognizing the increased training difficulty introduced by friction modeling, we proposed a simple and novel solution to reduce learning complexity. To validate this approach, we conducted comprehensive Sim2Sim and Sim2Real experiments comparing three methods: conventional domain randomization (without Static friction), Actuator Net, and our Static friction-aware domain randomization. All experiments utilized the Rapid Motor Adaptation (RMA) algorithm. Results demonstrated that our method achieved superior adaptive capabilities and overall performance.

Abstract:
Limited data has become a major bottleneck in scaling up offline imitation learning (IL). In this paper, we propose enhancing IL performance under limited expert data by introducing a pre-training stage that learns dynamics representations, derived from factorizations of the transition dynamics. We first theoretically justify that the optimal decision variable of offline IL lies in the representation space, significantly reducing the parameters to learn in the downstream IL. Moreover, the dynamics representations can be learned from arbitrary data collected with the same dynamics, allowing the reuse of massive non-expert data and mitigating the limited data issues. We present a tractable loss function inspired by noise contrastive estimation to learn the dynamics representations at the pre-training stage. Experiments on MuJoCo demonstrate that our proposed algorithm can mimic expert policies with as few as a single trajectory. Experiments on real quadrupeds show that we can leverage pre-trained dynamics representations from simulator data to learn to walk from a few real-world demonstrations.

Abstract:
Natural environments pose significant challenges for autonomous robot navigation, particularly due to their unstructured and ever-changing nature. Hiking trails, with their dynamic conditions influenced by weather, vegetation, and human traffic, represent one of these challenges. This work introduces a novel approach to autonomous hiking trail navigation that balances trail adherence with the flexibility to adapt to off-trail routes when necessary. The solution is a Traversability Analysis module that integrates semantic data from camera images with geometric information from LiDAR to create a comprehensive understanding of the surrounding terrain. A planner uses this traversability map to navigate safely, adhering to trails while allowing off-trail movement when necessary to avoid on-trail hazards or for safe off-trail shortcuts. The method is evaluated through simulation to determine the balance between semantic and geometric information in traversability estimation. These simulations tested various weights to assess their impact on navigation performance across different trail scenarios. Weights were then validated through autonomous field tests at the West Virginia University Core Arboretum, demonstrating the method’s effectiveness in a real-world environment.

Abstract:
Physics-informed Neural Motion Planners (PiN- MPs) provide a data-efficient framework for solving the Eikonal Partial Differential Equation (PDE) and representing the cost-to-go function for motion planning. However, their scalability remains limited by spectral bias and the complex loss landscape of PDE-driven training. Domain decomposition mitigates these issues by dividing the environment into smaller subdomains, but existing methods enforce continuity only at individual spatial points. While effective for function approximation, these methods fail to capture the spatial connectivity required for motion planning, where the cost- to-go function depends on both the start and goal coordinates rather than a single query point. We propose Finite Basis Neural Time Fields (FB-NTFields), a novel neural field representation for scalable cost-to-go estimation. Instead of enforcing continuity in output space, FB-NTFields construct a latent space representation, computing the cost-to-go as a distance between the latent embeddings of start and goal coordinates. This enables global spatial coherence while integrating domain decomposition, ensuring efficient large-scale motion planning. We validate FB-NTFields in complex synthetic and real-world scenarios, demonstrating substantial improvements over existing PiNMPs. Finally, we deploy our method on a Unitree B1 quadruped robot, successfully navigating indoor environments. The supplementary videos can be found at https://youtu.be/OpRuCbLNOwM.

Abstract:
The present study aims at investigating how humans engage with common communication modalities—speech, tablet, and gesture—when interacting with a humanoid robot. To explore this, we designed a live interaction experiment using a congruence paradigm, where participants engaged with a robot presenting two out of three modalities simultaneously: one as the primary cue and the other as a distracting cue. We measured participants’ task performance (response time, error rate) and fixation distribution (fixation count and duration proportions) across different roles (primary, distracting, neither) and areas of interest (face, tablet, gesture). Additionally, we compared fixation patterns between the performance and baseline phases. Our findings reveal that while the tablet is the most effective modality for task engagement, it also serves as a strong attentional distractor, dominating gaze allocation regardless of its informational value. This underscores the importance of carefully balancing tablet integration in HRI design. Notably, our results demonstrate that gaze patterns alone do not fully reveal attentional focus, emphasizing the need to consider both overt and covert cognitive processes in multimodal HRI. These insights provide valuable guidelines for designing more effective and engaging human-robot interactions.

Abstract:
Despite significant advances in self-supervised monocular depth estimation methods, achieving temporally consistent and accurate depth maps from frame sequences remains a formidable challenge. Existing approaches often estimate depth maps for individual frames in isolation, neglecting the rich geometric and temporal coherence present across frames. Consequently, this oversight leads to temporally inconsistent outputs, resulting in noticeable temporal flickering artifacts. In response, this paper presents TCNet, a Temporal Consistent Network for self-supervised monocular depth estimation. Specifically, we propose an Inter-frame Temporal Fusion (ITF) module to emphasize the influence of preceding images on the depth estimation of the current frame. The Temporal Consistency Loss (TCL) is proposed to leverage the temporal constraints between the depth maps of adjacent frames. Besides, TCNet can also be applied to both single-frame and multi-frame scenarios during inference. Experimental evaluations on the KITTI dataset demonstrate that our method surpasses state-of-the-art depth estimation methods in accuracy and temporal consistency. Our code will be made public.

Abstract:
A decentralized strategy for object transportation is presented, assuming that the object is grasped by a team of N cooperative manipulators. The proposed strategy consists of two steps. First, each robot estimates the wrenches applied to the object by all the others robots, even without all-to-all communication. Second, an admittance control scheme is used to limit internal wrenches, preventing excessive stresses that could affect manipulation stability and object integrity. Stability is proven under the assumption of a spring connection between each robot end-effector and its grasping point on the object. A work cell with two 7-degree-of-freedom (DOF) and one 6-DOF robotic manipulators was used to validate the strategy. Experimental results show that the controller effectively reduces internal wrenches, confirming the feasibility and robustness of the decentralized approach in cooperative manipulation.

Abstract:
Humans typically avert their gaze when faced with situations involving another person’s privacy, and humanoid robots should exhibit similar behaviors. Various approaches exist for privacy recognition, including an image privacy recognition model and a Large Vision-Language Model (LVLM). The former relies on datasets of labeled images, which raise ethical concerns, while the latter requires more time to recognize images accurately, making real-time responses difficult. To this end, we propose a method of automatically constructing the LLM Privacy Text Dataset (LPT Dataset), a privacy-related text dataset with privacy indicators, and a method of recognizing whether observing a scene violates privacy without ethically sensitive training images. In constructing the LPT Dataset, which consists of both private and public scenes, we use an LLM to define privacy indicators and generate texts scored for each indicator. Our model recognizes whether a given image is private or public by retrieving texts with privacy scores similar to the image in a multi-modal feature space. In our experiments, we evaluated the performance of our model on three image privacy datasets and a realistic experiment with a humanoid robot in terms of accuracy and responsibility. The experiments show that our approach identifies the private image as accurately as the highly tuned LVLM without delay.

Abstract:
This paper builds up the skill of impact aware non prehensile manipulation through a hitting motion of a redundant robot arm by allowing it to come in contact with the environment with the appropriate link according to the requirements of the hitting task. In tasks where directional effective inertia of a robot is important at the contact point, it is useful to understand inertia at different links, so as to select the appropriate link. Hitting with those links allows us to manipulate a wider range of object masses since the robot effective inertia is different at different links. We propose a learning based methodology for selecting a hitting link based on the hitting task specifications, impact posture generation for the robot and an automated generation of desired directional inertia values throughout the hitting motion.

Abstract:
We present OpenRoboCare, a multimodal dataset for robot caregiving, capturing expert occupational therapist demonstrations of Activities of Daily Living (ADLs). Caregiving tasks involve complex physical human-robot interactions, requiring precise perception under occlusions, safe physical contact, and long-horizon planning. While recent advances in robot learning from demonstrations have shown promise, there is a lack of a large-scale, diverse, and expert-driven dataset that captures real-world caregiving routines. To address this gap, we collect data from 21 occupational therapists performing 15 ADL tasks on two manikins. The dataset spans five modalities—RGB-D video, pose tracking, eye-gaze tracking, task and action annotations, and tactile sensing, providing rich multimodal insights into caregiver movement, attention, force application, and task execution strategies. We further analyze expert caregiving principles and strategies, offering insights to improve robot efficiency and task feasibility. Additionally, our evaluations demonstrate that OpenRoboCare presents challenges for state-of-the-art robot perception and human activity recognition methods, both critical for developing safe and adaptive assistive robots, highlighting the value of our contribution. See our website for additional visualizations: https://emprise.cs.cornell.edu/robo-care/.

Abstract:
Teams of cooperating autonomous underwater vehicles (AUVs) rely on acoustic communication for coordination, yet this communication medium is constrained by limited range, multi-path effects, and low bandwidth. One way to address the uncertainty associated with acoustic communication is to learn the communication environment in real-time. We address the challenge of a team of robots building a map of the probability of communication success from one location to another in real-time. This is a decentralized classification problem – communication events are either successful or unsuccessful – where AUVs share a subset of their communication measurements to build the map. The main contribution of this work is a rigorously derived data sharing policy that selects measurements to be shared among AUVs. We experimentally validate our proposed sharing policy using real acoustic communication data collected from teams of Virginia Tech 690 AUVs, demonstrating its effectiveness in underwater environments.

Abstract:
Safety has been of paramount importance in motion planning and control techniques and is an active area of research in the past few years. Most safety research for mobile robots target at maintaining safety with the notion of collision avoidance. However, safety goes beyond just avoiding collisions, especially when robots have to navigate unstructured, vertically challenging, off-road terrain, where vehicle rollover and immobilization is as critical as collisions. In this work, we introduce a novel Traversability-based Control Barrier Function (T-CBF), in which we use neural Control Barrier Functions (CBFs) to achieve safety beyond collision avoidance on unstructured vertically challenging terrain by reasoning about new safety aspects in terms of traversability. The neural T-CBF trained on safe and unsafe observations specific to traversability safety is then used to generate safe trajectories. Furthermore, we present experimental results in simulation and on a physical Verti-4 Wheeler (V4W) platform, demonstrating that T-CBF can provide traversability safety while reaching the goal position. T-CBF planner outperforms previously developed planners by 30% in terms of keeping the robot safe and mobile when navigating on real world vertically challenging terrain.

Abstract:
Compact and efficient 6DoF object pose estimation is crucial in applications such as robotics, augmented reality, and space autonomous navigation systems, where lightweight models are critical for real-time accurate performance. This paper introduces a novel uncertainty-aware end-to-end Knowledge Distillation (KD) framework focused on keypoint-based 6DoF pose estimation. Keypoints predicted by a large teacher model exhibit varying levels of uncertainty that can be exploited within the distillation process to enhance the accuracy of the student model while ensuring its compactness. To this end, we propose a distillation strategy that aligns the student and teacher predictions by adjusting the knowledge transfer based on the uncertainty associated with each teacher keypoint prediction. Additionally, the proposed KD leverages this uncertainty-aware alignment of keypoints to transfer the knowledge at key locations of their respective feature maps. Experiments on the widely-used LINEMOD benchmark demonstrate the effectiveness of our method, achieving superior 6DoF object pose estimation with lightweight models compared to state-of-the-art approaches. Further validation on the SPEED+ dataset for spacecraft pose estimation highlights the robustness of our approach under diverse 6DoF pose estimation scenarios.

Abstract:
Accurate LiDAR-Camera (LC) calibration is challenging but crucial for autonomous systems and robotics. In this paper, we propose two single-shot and target-less algorithms to estimate the calibration parameters between LiDAR and camera using line features. The first algorithm constructs line-to-line constraints by defining points-to-line projection errors and minimizes the projection error. The second algorithm (PLK-Calib) utilizes the co-perpendicular and co-parallel geometric properties of lines in Plücker (PLK) coordinate, and decouples the rotation and translation into two constraints, enabling more accurate estimates. Our degenerate analysis and Monte Carlo simulation indicate that three nonparallel line pairs are the minimal requirements to estimate the extrinsic parameters. Furthermore, we collect an LC calibration dataset with varying extrinsic under three different scenarios and use it to evaluate the performance of our proposed algorithms.

Abstract:
While human vision inherently achieves robust cross-domain depth estimation through binocular coordination, robotic systems employing stereo matching still confront significant challenges in maintaining robustness across domains when performing real-time environmental depth perception. Furthermore, most stereo matching methods struggle with challenging regions such as object boundaries and non-overlapping areas on the left side of the left image, resulting in disparity maps that are relatively indistinct and lacking fine details. In this paper, we propose Learning More in Challenging Areas (LMC) to alleviate this problem, which enhances the domain generalization of the model through targeted training on challenging regions. LMC is a simple yet effective data-driven training framework primarily based on self-distillation. Specifically, 1) We pre-train models on a high-frequency dataset to improve perception ability on object boundaries; 2) We develop a self-distillation training strategy to benefit learning in non-overlapping areas on the left side of the left image; 3) We design an adaptive difficult area mask to balance the loss weight on other undefined challenging regions. Under our proposed training framework, GwcNet achieves 33% and 23% performance improvements in autonomous driving benchmarks KITTI 2012 and KITTI 2015 respectively, while preserving real-time inference efficiency without computational overhead.

Abstract:
Current approaches to integrating CLIP into language-driven robotics face a fundamental dilemma: While robotic implementations overlook cutting-edge 3D classification adaptations of CLIP, existing 3D-oriented CLIP methods prove inadequate for interpreting color-critical instructions prevalent in manipulation tasks. We resolve this through DualCLIP, a contrastive multimodal fusion framework that hierarchically integrates depth-aligned CLIP encoders. Our approach first aligns depth and CLIP RGB encoders using synthetic RGB-D pairs, then performs multimodal fusion via contrastive learning with language-triplet optimization. This joint training preserves 3D geometric coherence and color semantics. Evaluations demonstrate DualCLIP’s combined strength — surpassing CLIP2Point in 3D classification while showing promising improvements for CLIPORT in color-sensitive robotic manipulation. This work establishes a paradigm for translating vision-language models into 3D-aware robotic systems without compromising task-specific modality sensitivity.

Abstract:
Autonomous systems rely on accurate 3D object detection from LiDAR data, yet most detectors are limited to a predefined set of known classes, making them vulnerable to unexpected out-of-distribution (OOD) objects. In this work, we present HD-OOD3D, a novel two-stage method for detecting unknown objects. We demonstrate the superiority of two-stage approaches over single-stage methods, achieving more robust detection of unknown objects. Furthermore, we conduct an in-depth analysis of the standard evaluation protocol for OOD detection, revealing the critical impact of hyperparameter choices. To address the challenge of scaling the learning of unknown objects, we explore unsupervised training strategies to generate pseudo-labels for unknowns. Among the different approaches evaluated, our experiments show that top-K autolabelling offers more promising performance compared to simple resizing techniques.

Abstract:
This paper proposes a tightly coupled fusion method for inertial and forward-looking sonar (FLS) data, integrating underwater image observations from the FLS into the inertial odometry for underwater robot localization. Since the FLS images provide only horizontal plane information, this work focuses on the positioning of the underwater robot in a 2D plane. In underwater navigation systems, relying solely on inertial measurements often leads to error accumulation and suboptimal localization results, as demonstrated in previous studies. To address this issue, we integrate the FLS data into the inertial odometry. Specifically, we convert the sonar images into 2D underwater point clouds and use an Error State Kalman Filter (ESKF) to fuse Inertial Measurement Unit (IMU) data with the sonar point cloud data for joint estimation of the initial pose. Next, edge feature point clouds are extracted from the sonar images using a horizontal scanning method. Finally, by constructing edge feature error terms, we constrain the relative position changes between two adjacent sonar frames. Through experiments in an underwater simulation environment (Dave) and a real pool, the results show that the proposed fusion method can significantly improve the localization accuracy of underwater robots compared to using inertial odometers and sonar odometers alone.

Abstract:
We present Asymmetric Critic Guided Distillation, ACGD, a framework for learning multi-task dexterous manipulation policies that can manipulate articulated objects using images as input. ACGD is a scalable student-teacher distillation approach that utilizes behavior cloning to distill multiple expert policies into a single vision-based, multi-task student policy for dexterous manipulation. The expert policies are trained with traditional RL techniques with access to privileged state information of both the robot and the manipulated object, while the distilled student policy operates under realistic sensory constraints, specifically using only camera images and robot proprioception. During distillation, we use an expert-critic that provides action labels and value estimates to refine the student’s action sampling through a dual IL/RL objective. In the multi-task setting, we achieve this through an aggregate critic for different single-task experts. Our approach exhibits strong performance compared to a number of state-of-the-art imitation learning (IL) and reinforcement learning (RL) baselines. We evaluate across a variety of multi-task dexterous manipulation benchmarks including bimanual manipulation, single-hand object articulation tasks, and a tendon-actuated hand and achieves state-of-the-art performance with 10-15% improvement over the baseline algorithms. Visit our website for more details.

Abstract:
In recent years, Vision-Language Models (VLMs) have exhibited powerful capacity of reasoning, decomposing long-horizon tasks and motion planning in robotic manipulation tasks. However, the current operating speed of VLMs has limited the interaction frequency of users and the model to several seconds, which disables the real-time perception of environmental changes when executing tasks released by VLM. We propose Real-time Object Detection - VLM (ROD-VLM), a novel framework which combines classical Object Detection Algorithm YOLO-v5x with VLM to achieve the real-time robotic environmental perception, reasoning and manipulation. Specifically, we introduce the concept of key frame to VLM model, capturing the crucial information through object detection algorithm to assist VLM in perceiving varying environment. Our comprehensive real-world experiments show that ROD-VLM possess an excellent capability in real-time environmental understanding, decision-making and action executing.

Abstract:
Uncertainties in contact dynamics and object geometry remain significant barriers to robust robotic manipulation. Caging helps mitigate these uncertainties by constraining an object’s mobility without requiring precise contact modeling. Existing caging research often treats morphology and policy optimization as separate problems, overlooking their synergy. In this paper, we introduce CageCoOpt, a hierarchical framework that jointly optimizes manipulator morphology and control policy for robust caging-based manipulation. The framework employs reinforcement learning for policy optimization at the lower level and multitask Bayesian optimization for morphology optimization at the upper level. We incorporate a caging metric into both optimization levels to encourage caging configurations and thereby improve manipulation robustness. The evaluation consists of four manipulation tasks and demonstrates that co-optimizing morphology and policy improves task performance under uncertainties, establishing caging-guided co-optimization as a viable approach for robust manipulation.

Abstract:
A novel camera autocalibration method is presented. Any camera model can be calibrated, and no calibration targets like checkerboards are used. The method requires the camera to be mounted on a lidar-equipped moving platform travelling through a structured environment along a known path. The primary reason for cross-modal camera calibration is not to solve the sensor fusion problem, but to tap the huge amount of accurate metric data points available from the lidar. The amount of measurements is easily four orders of magnitude higher than in checkerboard based methods. This leads to improved estimation accuracy, especially of higher-order distortion coefficients. In a multi-camera setup, the lidar additionally defines a common reference coordinate system for all cameras.Compared to the majority of published methods on camera-lidar autocalibration, (i) our calibration procedure relies on motion features, (ii) the hard-to-obtain-accurately lidar-lidar and lidar-image feature correspondences are not required, and (iii) both camera extrinsics and intrinsics, including complex distortion models, are autocalibrated. Experiments show that the calibration accuracy reaches or exceeds the accuracy of methods relying on calibration targets.

Abstract:
Autonomous learning of dexterous, long-horizon robotic skills has been a longstanding pursuit of embodied AI. Recent advances in robotic reinforcement learning (RL) have demonstrated remarkable performance and robustness in real-world visuomotor control tasks. However, applying RL in the real world faces challenges such as low sample efficiency, slow exploration, and significant reliance on human intervention. In contrast, simulators offer a safe and efficient environment for extensive exploration and data collection, while the visual sim-to-real gap, often a limiting factor, can be mitigated using real-to-sim techniques. Building on these, we propose SimLauncher, a novel framework that combines the strengths of real-world RL and real-to-sim-to-real approaches to overcome these challenges. Specifically, we first pre-train a visuomotor policy in the digital twin simulation environment, which then benefits real-world RL in two ways: (1) bootstrapping target values using extensive simulated demonstrations and real-world demonstrations derived from pre-trained policy rollouts, and (2) Incorporating action proposals from the pre-trained policy for better exploration. We conduct comprehensive experiments across multi-stage, contact-rich, and dexterous hand manipulation tasks. Compared to prior real-world RL approaches, SimLauncher significantly improves sample efficiency and achieves near-perfect success rates. We hope this work serves as a proof of concept and inspires further research on leveraging large-scale simulation pre-training to benefit real-world robotic RL.

Abstract:
In this work, we present PlaceNet: a deep learning framework for mobile manipulator base placement which provides solutions to the shortcomings common in the state-of-the-art. Our method addresses the lack of obstacle awareness of reachability methods and the limited generalization of learning methods. Using only the raw pointcloud and task pose data as input, PlaceNet learns the concepts of reachability and obstacle occlusions in an environment-independent manner, enabling its use in situations outside its training experiences. Tests comparing PlaceNet to inverse reachability and heuristic methods demonstrated state-of-the-art performance in both the In-Distribution and Out-Of-Distribution test sets, achieving as high as 98% success rate for problems with many solutions, and an 82% success rate overall. PlaceNet can be trained on grounded pointcloud data from any source without the need for dynamic simulation, marking it as an accessible alternative to similar frameworks which require expensive, high-performance GPUs for running simultaneous simulation and training or which depend on labor intensive data collection. PlaceNet is lightweight during deployment and can easily run with low latency on affordable hardware, including laptop GPUs and the NVIDIA Jetson line for embedded deployment.

Abstract:
This paper presents a novel collision avoidance method for general ellipsoids based on control barrier functions (CBFs) and separating hyperplanes. First, collision-free conditions for general ellipsoids are analytically derived using the concept of dual cones. These conditions are incorporated into the CBF framework by extending the system dynamics of controlled objects with separating hyperplanes, enabling efficient and reliable collision avoidance. The validity of the proposed collision-free CBFs is rigorously proven, ensuring their effectiveness in enforcing safety constraints. The proposed method requires only single-level optimization, significantly reducing computational time compared to state-of-the-art methods. Numerical simulations and real-world experiments demonstrate the effectiveness and practicality of the proposed algorithm.

Abstract:
We introduce Go-Slam, a novel framework that combines 3D Gaussian Splatting SLAM with grounded object segmentation and open-vocabulary querying to enable object-aware 3D scene reconstruction. Go-Slam incrementally builds high-fidelity 3D maps from RGB-D inputs while embedding semantic information by assigning unique object identifiers to Gaussian primitives. This integration allows the system to support flexible, natural language queries and accurately localize objects in complex, static environments. To achieve robust semantic mapping, Go-Slam leverages object detection and segmentation models, enabling consistent object identification across frames without relying on predefined categories. We evaluate Go-Slam across diverse indoor scenes, demonstrating improvements over existing baselines in both reconstruction quality and object localization accuracy. Our results show that Go-Slam effectively bridges the gap between geometric mapping and semantic understanding, supporting real-time scene interaction and object retrieval in open-world environments.

Abstract:
Constructing a unified canonical pose representation for 3D object categories is crucial for pose estimation and robotic scene understanding. Previous unified pose representations often relied on manual alignment, such as in ShapeNet and ModelNet. Recently, self-supervised canonicalization methods have been proposed, However, they are sensitive to intra-class shape variations, and their canonical pose representations cannot be aligned to a coordinate system centered on the object. In this paper, we propose a category-level canonicalization method that alleviates the impact of shape variation and extends the canonical pose representation to an upright and forward-facing state. First, we design a Siamese Vector Neurons Module (SVNM) that achieves SE(3) equivariance modeling and self-supervised disentangling of 3D shape and pose attributes. Next, we introduce a Siamese equivariant constraint that addresses the pose alignment bias caused by shape deformation. Finally, we propose a method to generate upright surface labels from pose-unknown in-the-wild data and use upright and symmetry losses to correct the canonical pose. Experimental results show that our method not only achieves SOTA consistency performance but also aligns with the object-centered coordinate system. Project page: https://anon-mity.github.io/upright-facing/

Abstract:
Distributed Multi-Agent Path Finding (MAPF) integrated with Multi-Agent Reinforcement Learning (MARL) has emerged as a prominent research focus, enabling real-time cooperative decision-making in partially observable environments through inter-agent communication. However, due to insufficient collaborative and perceptual capabilities, existing methods are inadequate for scaling across diverse environmental conditions. To address these challenges, we propose PC2P, a novel distributed MAPF method derived from a Q-learning-based MARL framework. Initially, we introduce a personalized-enhanced communication mechanism based on dynamic graph topology, which ascertains the core aspects of "who" and "what" in interactive process through three-stage operations: selection, generation, and aggregation. Concurrently, we incorporate local crowd perception to enrich agents’ heuristic observation, thereby strengthening the model’s guidance for effective actions via the integration of static spatial constraints and dynamic occupancy changes. To resolve extreme deadlock issues, we propose a region-based deadlock-breaking strategy that leverages expert guidance to implement efficient coordination within confined areas. Experimental results demonstrate that PC2P achieves superior performance compared to state-of-the-art distributed MAPF methods in varied environments. Ablation studies further confirm the effectiveness of each module for overall performance.

Abstract:
Recent advances in Model Predictive Control (MPC) leveraging a combination of first-order methods, such as the Alternating Direction Method of Multipliers (ADMM), and offline precomputation and caching of select operations, have excitingly enabled real-time MPC on microcontrollers. Unfortunately, these approaches require the use of fixed hyperparameters, limiting their adaptability and overall performance. In this work, we introduce First-Order Adaptive Caching, which precomputes not only select matrix operations but also their sensitivities to hyperparameter variations, enabling online hyperparameter updates without full recomputation of the cache. We demonstrate the effectiveness of our approach on a number of dynamic quadrotor tasks, achieving up to a 63.4% reduction in ADMM iterations over the use of optimized fixed hyperparameters and approaching 70% of the performance of a full cache recomputation, while reducing the computational cost from O(n3) to O(n2) complexity. This performance enables us to perform figure-eight trajectories on a 27g tiny quadrotor under wind disturbances. We release our implementation open-source for the benefit of the wider robotics community.

Abstract:
Bipedal locomotion control is essential for humanoid robots to navigate complex, human-centric environments. While optimization-based control designs are popular for integrating sophisticated models of humanoid robots, they often require labor-intensive manual tuning. In this work, we address the challenges of parameter selection in bipedal locomotion control using DiffTune, a model-based autotuning method that leverages differential programming for efficient parameter learning. A major difficulty lies in balancing model fidelity with differentiability. We address this difficulty using a low-fidelity model for differentiability, enhanced by a Ground Reaction Force-and-Moment Network (GRFM-Net) to capture discrepancies between MPC commands and actual control effects. We validate the parameters learned by DiffTune with GRFM-Net in hardware experiments, which demonstrates the parameters’ optimality in a multi-objective setting compared with baseline parameters, reducing the total loss by up to 40.5% compared with the expert-tuned parameters. The results confirm the GRFM-Net’s effectiveness in mitigating the sim-to-real gap, improving the transferability of simulation-learned parameters to real hardware.

Abstract:
State estimation is crucial for legged robots as it directly affects control performance and locomotion stability. In this paper, we propose an Adaptive Invariant Extended Kalman Filter to improve proprioceptive state estimation for legged robots. The proposed method adaptively adjusts the noise level of the contact foot model based on online covariance estimation, leading to improved state estimation under varying contact conditions. It effectively handles small slips that traditional slip rejection fails to address, as overly sensitive slip rejection settings risk causing filter divergence. Our approach employs a contact detection algorithm instead of contact sensors, reducing the reliance on additional hardware. The proposed method is validated through real-world experiments on the quadruped robot LeoQuad, demonstrating enhanced state estimation performance in dynamic locomotion scenarios.

Abstract:
The integration of large language models (LLMs) into robotic task planning has unlocked better reasoning capabilities for complex, long-horizon workflows. However, ensuring safety in LLM-driven plans remains a critical challenge, as these models often prioritize task completion over risk mitigation. This paper introduces SAFER (Safety-Aware Framework for Execution in Robotics), a multi-LLM framework designed to embed safety awareness into robotic task planning. SAFER employs a Safety Agent that operates alongside the primary task planner, providing safety feedback. Additionally, we introduce LLM-as-a-Judge, a novel metric leveraging LLMs as evaluators to quantify safety violations within generated task plans. Our framework integrates safety feedback at multiple stages of execution, enabling real-time risk assessment, proactive error correction, and transparent safety evaluation. We also integrate a control framework using Control Barrier Functions (CBFs) to ensure safety guarantees within SAFER’s task planning. We evaluated SAFER against state-of-the-art LLM planners on complex long-horizon tasks involving heterogeneous robotic agents, demonstrating its effectiveness in reducing safety violations while maintaining task efficiency. We also verify the task planner and safety planner through actual hardware experiments involving multiple robots and a human.

Affiliations: Tianjin Key Laboratory of Intelligent Unmanned Swarm Technology and System, School of Electrical and Information Engineering, Tianjin University, Tianjin, China; Institute of Artificial Intelligence, Shanghai University, Shanghai, China; Department of Aeronautical and Aviation Engineering, Hong Kong Polytechnic University, Hung Hom, Hong Kong, China; EFY Intelligent Control (Tianjin) Technology Company Ltd., Tianjin, China

Abstract:
This paper proposes a whole process method of multi-quadrotors from detecting and locating to encircle and track targets. The reconnaissance quadrotor realizes accurate target detection based on the one-stage target detector of convolutional neural network. Then, based on a pinhole camera projection model, the target is located from the 2D pixel coordinates to 3D North East Down(NED) coordinates world. Finally, the hunter quadrotors realize the target encircling and the time-varying formation tracking based on the consensus theory. At the same time, we prove the stability of the time-varying formation tracking control. We built a multiple quadrotors platform composed of one reconnaissance quadrotor and four hunter quadrotors, and deployed the method on the platform to conduct a series of experiments with a minibus as the target for validation. The results indicate that reconnaissance quadrotor can accurately detect target and have small localization errors in the north and east directions. Hunter quadrotors can encircle and track targets in time-varying formation based on target information provided by reconnaissance quadrotor. Experiments have demonstrated that the method achieves high-speed and accurate target encirclement.

Abstract:
In this paper, a LiDAR-inertial odometry (LIO) method that eliminates the influence of moving objects in dynamic driving scenarios is proposed. This method constructs binarized labels for 3D points of current sweep, and utilizes the label difference between each point and its surrounding points in global map to identify moving objects. The surrounding points in global map are localized by voxel-location-based nearest neighbor search, without involving any massive computations. In addition, the proposed method is embeded into a LIO system (i.e., Dynamic-LIO), and achieves state-of-the-art performance on public datasets with extremlely low computational overhead (i.e., 1~9ms/sweep). We have released the source code of this work for the development of the community.

Abstract:
Imitation learning has emerged as an effective paradigm for training visuo-motor policies in robotic manipulation. In real-world scenarios, visuo-motor policies are required to be effective, sample-efficient, and capable of adapting to dynamic environments. A key factor influencing these capabilities is the quality of visual representations. Conventional approaches that learn a vision encoder and policy network from scratch often result in suboptimal representations, as the training process tends to prioritize policy optimization over rich semantic feature extraction. Alternatively, while pre-trained large vision models offer strong general-purpose features, they often fail to capture the fine-grained, task-specific information required for effective manipulation. To capture rich and informative visual features, we propose TOC-DP, a novel framework that integrates SlotAttention to facilitate object-centric representation learning. Task-specific segmentation priors are incorporated as an inductive bias to enhance the task-awareness and object-awareness of the learned visual features. The extracted representations are subsequently refined to encode action-aware information during downstream policy learning. Extensive experiments on the Meta-World benchmark and real-world tasks demonstrate that TOC-DP achieves a 30% improvement in success rate over baseline methods during deployment for a variety of scenarios.

Abstract:
Autonomous mobile robots are increasingly used in pedestrian-rich environments where safe navigation and appropriate human interaction are crucial. While Deep Reinforcement Learning (DRL) enables socially integrated robot behavior, challenges persist in novel or perturbed scenarios to indicate when and why the policy is uncertain. Unknown uncertainty in decision-making can lead to collisions or human discomfort and is one reason why safe and risk-aware navigation is still an open problem. This work introduces a novel approach that integrates aleatoric, epistemic, and predictive uncertainty estimation into a DRL navigation framework for policy distribution uncertainty estimates. We, therefore, incorporate Observation-Dependent Variance (ODV) and dropout into the Proximal Policy Optimization (PPO) algorithm. For different types of perturbations, we compare the ability of deep ensembles and Monte-Carlo dropout (MC-dropout) to estimate the uncertainties of the policy. In uncertain decision-making situations, we propose to change the robot’s social behavior to conservative collision avoidance. The results show improved training performance with ODV and dropout in PPO and reveal that the training scenario has an impact on the generalization. In addition, MC-dropout is more sensitive to perturbations and correlates the uncertainty type to the perturbation better. With the safe action selection, the robot can navigate in perturbed environments with fewer collisions.

Abstract:
The concept of 3D scene graphs is increasingly recognized as a powerful semantic and hierarchical representation of the environment. Current approaches often address this at a coarse, object-level resolution. In contrast, our goal is to develop a representation that enables robots to directly interact with their environment by identifying both the location of functional interactive elements and how these can be used. To achieve this, we focus on detecting and storing objects at a finer resolution, focusing on affordance-relevant parts. The primary challenge lies in the scarcity of data that extends beyond instance-level detection and the inherent difficulty of capturing detailed object features using robotic sensors. We leverage currently available 3D resources to generate 2D data and train a detector, which is then used to augment the standard 3D scene graph generation pipeline. Through our experiments, we demonstrate that our approach achieves functional element segmentation comparable to state-of-the-art 3D models and that our augmentation enables task-driven affordance grounding with higher accuracy than the current solutions. See our project page at https://fungraph.github.io.

Abstract:
Task-oriented grasping (TOG) is essential for robots to perform manipulation tasks, requiring grasps that are both stable and compliant with task-specific constraints. Humans naturally grasp objects in a task-oriented manner to facilitate subsequent manipulation tasks. By leveraging human grasp demonstrations, current methods can generate high-quality robotic parallel-jaw task-oriented grasps for diverse objects and tasks. However, they still encounter challenges in maintaining grasp stability and sampling efficiency. These methods typically rely on a two-stage process: first performing exhaustive task-agnostic grasp sampling in the 6-DoF space, then applying demonstration-induced constraints (e.g., contact regions and wrist orientations) to filter candidates. This leads to inefficiency and potential failure due to the vast sampling space. To address this, we propose the Human-guided Grasp Diffuser (HGDiffuser), a diffusion-based framework that integrates these constraints into a guided sampling process. Through this approach, HGDiffuser directly generates 6-DoF task-oriented grasps in a single stage, eliminating exhaustive task-agnostic sampling. Furthermore, by incorporating Diffusion Transformer (DiT) blocks as the feature backbone, HGDiffuser improves grasp generation quality compared to MLP-based methods. Experimental results demonstrate that our approach significantly improves the efficiency of task-oriented grasp generation, enabling more effective transfer of human grasping strategies to robotic systems. To access the source code and supplementary videos, visit https://sites.google.com/ view/hgdiffuser.

Abstract:
Achieving reliable navigation for autonomous drones in complex environments remains a significant challenge, particularly in low-light conditions. To address this, we propose an integrated multimodal obstacle detection and adaptive neural control system with online learning to enable drones to navigate autonomously both during the day and at night. The proposed multimodal obstacle detection system integrates two ranging LiDAR sensors and a depth camera with sensory processing techniques, including the iKD-Tree interested area search algorithm, sensor fusion, and neuro-obstacle directional feature extraction. This ensures robust obstacle detection across various conditions without requiring sensor reconfiguration. The adaptive neural control system applies Hebbian correlation-based learning and synaptic scaling plasticity principles to continuously update the control weights, allowing the drone to dynamically adapt its speed and maneuver around obstacles in real time. We evaluate the system’s performance in both simulation and real-world environments, demonstrating its effectiveness under diverse lighting conditions and obstacle types.

Abstract:
LiDAR-based localization serves as a critical component in autonomous systems, yet existing approaches face persistent challenges in balancing repeatability, accuracy, and environmental adaptability. Traditional point cloud registration methods relying solely on offline maps often exhibit limited robustness against long-term environmental changes, leading to localization drift and reliability degradation in dynamic real-world scenarios. To address these challenges, this paper proposes DuLoc, a robust and accurate localization method that tightly couples LiDAR-inertial odometry with offline map-based localization, incorporating a constant-velocity motion model to mitigate outlier noise in real-world scenarios. Specifically, we develop a LiDAR-based localization framework that seamlessly integrates a prior global map with dynamic real-time local maps, enabling robust localization in unbounded and changing environments. Extensive real-world experiments in ultra unbounded port that involve 2,856 hours of operational data across 32 Intelligent Guided Vehicles (IGVs) are conducted and reported in this study. The results attained demonstrate that our system outperforms other state-of-the-art LiDAR localization systems in large-scale changing outdoor environments.

Abstract:
This paper proposes a novel Tensegrity-Based Fracture Reduction Robot (TFRR) designed to enhance the safety and efficacy of orthopedic procedures through integrated force-sensing and control capabilities. Inspired by the biomechanics of skeletal muscles, the robot adopts a tensegrity architecture that enables real-time monitoring of internal force distribution and dynamic adjustment of posture and inter-bone contact forces via controlled tensioning of its string network. To establish a theoretical foundation for system control, a comprehensive static analysis of the tensegrity structure is conducted, allowing accurate modulation of topological configurations through systematic tension control. Extensive experimental validation demonstrates the robustness and reliability of the proposed method across a range of operating conditions. In particular, targeted experiments on contact-force regulation confirm the robot’s ability to precisely monitor and adjust inter-bone forces during fracture reduction. These features collectively enable safer, more controlled surgical interventions, with the potential to reduce tissue trauma and improve clinical outcomes.

Abstract:
Point-to-point and periodic motions are ubiquitous in the world of robotics. To master these motions, Autonomous Dynamic System (ADS) based algorithms are fundamental in the domain of Learning from Demonstration (LfD). However, these algorithms face the significant challenge of balancing precision in learning with the maintenance of system stability. This paper addresses this challenge by presenting a novel ADS algorithm that leverages neural network technology. The proposed algorithm is designed to distill essential knowledge from demonstration data, ensuring stability during the learning of both point-to-point and periodic motions. For point-to-point motions, a neural Lyapunov function is proposed to align with the provided demonstrations. In the case of periodic motions, the neural Lyapunov function is used with the transversal contraction to ensure that all generated motions converge to a stable limit cycle. The model utilizes a streamlined neural network architecture, adept at achieving dual objectives: optimizing learning accuracy while maintaining global stability. To thoroughly assess the efficacy of the proposed algorithm, rigorous evaluations are conducted using the LASA dataset and a massage robot task. The assessments were complemented by empirical validation, providing evidence of the algorithm’s performance.

Abstract:
This paper introduces a novel hierarchical control approach for feature matching, real-time tracking and inter-UAV collision avoidance in multiple unmanned aerial vehicle-unmanned ground vehicle (multi-UAV-UGV) collaborative tracking. Our approach divides into three layers: optimal feature matching, tracking control by reinforcement learning (RL), and collision avoidance using control barrier functions (CBFs). First, a distance cost matrix is cleverly constructed based on the feature matching capabilities of UAVs and UGVs to determine the optimal matching configuration. It allows UAVs to perform the tracking task while minimizing travel distance. Second, a RL-based tracker is developed to achieve precise real-time tracking without depending on UAV dynamic models. The tracker is trained in a single UAV-UGV environment, which reduces policy convergence difficulty by simplifying state space and interactions compared with training in complex multi-UAV-UGV scenarios. Third, a collision avoidance mechanism based on CBFs is introduced to transform RL commands into collision-free actions by solving a quadratic programming (QP) problem. Extensive simulations and real-world experiments demonstrate the effectiveness of the proposed approach.

Abstract:
Pneumatic soft actuators are known for their versatility and reliability; however, their control presents a major challenge as systems scale beyond tens of actuators. Traditional rigid pneumatic valves add bulk, weight, and complexity, while most soft valves fail to generate programmable independent output states. We propose PneuChip, a compact pneumatic controller designed for large-scale soft actuators. The PneuChip functions as a two-dimensional array with rows and columns, controlled by m + n input signals to generate 2^m + n - 2^m - 2^n + 2 distinct output states, enabling programmable control over m × n soft actuators. To validate its effectiveness, we implemented PneuChip in a muscular-skeletal robotic arm comprising 24 Miura-Ori inspired, negative pressure-actuated artificial muscles and a rigid two-link skeleton connected by a ball joint. A 4×6 PneuChip was fabricated and integrated to control the arm’s 3 degree of freedoms (DOFs) motion. Within a 120° rotation range, the robot arm achieved 946 distinct positions with smooth state transitions, paving the way for future applications in trajectory tracking and dexterous manipulation. The compact design and high controllability of PneuChip promise to notably simplify complex pneumatic systems, significantly enhancing the practicality of large-scale soft robots for various applications.

Abstract:
The rock-climbing fish is a benthic organism that can move rapidly and flexibly on rock surfaces in complex underwater environments. Studies have shown that this unique adhesion-sliding movement mechanism of the rock-climbing fish relies on the anisotropic friction exhibited by its sucker structure, which helps to reduce friction in the forward direction and defend against the impact of the flow field. In this work, inspired by the anisotropic friction phenomenon of the sucker of the rock-climbing fish, we designed the absorption module and pectoral and pelvic fin flapping module of the robotic fish to realize the contact switching from low friction to high friction. Meanwhile the propulsion module adopts a novel design of wire-driven caudal fin that can oscillate at high frequency (exceeding ~5 Hz). Resulting robotic fish can realize different motion modes such as adhesion-sliding movement (~0.5 BL/s) and wall-stabilized adsorption. This work will provide a new solution for the design pattern of underwater wall robots.

Abstract:
Inferring the affordance of an object and grasping it in a task-oriented manner is crucial for robots to successfully complete manipulation tasks. Affordance indicates where and how to grasp an object by taking its functionality into account, serving as the foundation for effective task-oriented grasping. However, current task-oriented methods often depend on extensive training data that is confined to specific tasks and objects, making it difficult to generalize to novel objects and complex scenes. In this paper, we introduce AffordGrasp, a novel open-vocabulary grasping framework that leverages the reasoning capabilities of vision-language models (VLMs) for in-context affordance reasoning. Unlike existing methods that rely on explicit task and object specifications, our approach infers tasks directly from implicit user instructions, enabling more intuitive and seamless human-robot interaction in everyday scenarios. Building on the reasoning outcomes, our framework identifies task-relevant objects and grounds their part-level affordances using a visual grounding module. This allows us to generate task-oriented grasp poses precisely within the affordance regions of the object, ensuring both functional and context-aware robotic manipulation. Extensive experiments demonstrate that AffordGrasp achieves state-of-the-art performance in both simulation and real-world scenarios, highlighting the effectiveness of our method. We believe our approach advances robotic manipulation techniques and contributes to the broader field of embodied AI. Project website: https://eqcy.github.io/affordgrasp/.

Abstract:
Effectively utilizing multi-sensory data is important for robots to generalize across diverse tasks. However, the heterogeneous nature of these modalities makes fusion challenging. Existing methods propose strategies to obtain comprehensively fused features but often ignore the fact that each modality requires different levels of attention at different manipulation stages. To address this, we propose a force-guided attention fusion module that adaptively adjusts the weights of visual and tactile features without human labeling. We also introduce a self-supervised future force prediction auxiliary task to reinforce the tactile modality, improve data imbalance, and encourage proper adjustment. Our method achieves an average success rate of 93% across three fine-grained, contact-rich tasks in real-world experiments. Further analysis shows that our policy appropriately adjusts attention to each modality at different manipulation stages. The videos can be viewed at https://adaptac-dex.github.io/.

Abstract:
Personalizing bite sizes is crucial for robot-assisted feeding, as users have diverse dietary needs and preferences. However, precisely controlling the amount of food scooped remains a challenge due to variations in food properties, such as texture, granularity and cohesion. This work introduces SAVR (Scooping Adaptation for Variable food properties via Reinforcement learning), a learning-based framework that enables robots to scoop a targeted amount of food while adapting to different food characteristics. SAVR integrates Dynamic Motion Primitives (DMPs), with Reinforcement Learning (RL), where DMPs provide a structured motion representation, and RL refines execution by modifying the force term within the DMP formulation. This formulation enables efficient learning by allowing the RL agent to fine-tune the scooping trajectory rather than learning entire trajectories from scratch. Through ablation studies, we show that segmented spoon and food masks, combined with force-torque data, are essential for accurate scooping, significantly improving sim-to-real transfer. We validate SAVR on a real robotic system, demonstrating substantial improvements in accuracy and adaptability over baselines, without any additional fine-tuning.

Abstract:
Robots operating in everyday environments encounter a wide variety of previously unseen objects. Deep Learning methods simplify unknown object and scene segmentation by structuring inherent real-world complexities, improving visual scene understanding. However, they need vast amounts of labeled high-variance data for training. Acquiring these labels for rich real-world data requires significant manual effort, especially for segmentation masks. Although interactive segmentation accelerates this process, these methods still require substantial manual interaction, and the creation of large datasets remains labor-intensive. Consequently, there is a lack of diverse, high-quality datasets for unknown object instance segmentation in everyday environments. This research proposes a semi-automatic, RGB-only algorithmic pipeline for annotating novel objects, reducing manual effort to iteratively placing objects in the scene. We investigate several change detection-based approaches, including remote sensing change detection methods (TTP model), the DeepBackgroundMattingV2 image matting model, and the Segment Anything Model (SAM1 + SAM2) prompted with automatically extracted change regions. We propose the novel ILIS dataset to evaluate these methods in challenging everyday scenes, displaying reliable automatic mask proposal performance of up to 0.9549 mIoU and 0.9565 boundary F1 score. This highlights the potential of this method to accelerate large-scale dataset creation, saving at least 27.27 hours per 1,000 images by eliminating manual annotations.

Abstract:
Achieving controlled sliding of objects on finger surfaces is a significant challenge for robots, substantially constraining their ability to perform complex in-hand manipulation tasks. In this work, we investigate the role of surface vibration in modulating the effective friction at the object-finger contact locations to facilitate controlled sliding. We demonstrate that friction at contact points can be reduced by applying targeted vibrations at specific locations on a robotic finger, creating regions that are suitable for sliding. In this way, we create sticking/sliding regions on finger surfaces on demand and can easily switch between sliding and rolling contacts. To investigate this phenomenon, we embedded an array of vibration modules into robotic fingers. We first analyzed the velocity fields created by surface vibrations on a single finger. Then, we developed a method to select the appropriate activation states of the modules that achieve the desired velocity field at a given object location. Utilizing these fingers and the vibration selection method, we formed a two-finger robotic hand and demonstrated controlled sliding and rotation of a held object within the hand. To the best of our knowledge, this is the first work that utilizes vibration-induced friction modulation for in-hand manipulation that can achieve combinations of object sliding and rolling actions.

Abstract:
High resolution underwater 3D scene reconstruction is crucial for various applications, including construction, infrastructure maintenance, monitoring, exploration, and scientific investigation. Prior work has leveraged the complementary sensing modalities of imaging sonars and optical cameras for opti-acoustic 3D scene reconstruction, demonstrating improved results over methods which rely solely on either sensor. However, while most existing approaches focus on offline reconstruction, real-time spatial awareness is essential for both autonomous and piloted underwater vehicle operations. This paper presents OASIS, an opti-acoustic fusion method that integrates data from optical images with voxel carving techniques to achieve real-time 3D reconstruction unstructured underwater workspaces. Our approach utilizes an "eye-in-hand" configuration, which leverages the dexterity of robotic manipulator arms to capture multiple workspace views across a short baseline. We validate OASIS through tank-based experiments and present qualitative and quantitative results that highlight its utility for underwater manipulation tasks.

Abstract:
The construction of high-definition (HD) maps at intersections is crucial for autonomous driving and vehicle-to-infrastructure (V2I) collaboration. However, the semantic complexity of intersections poses significant challenges for HD mapping. Previous research has predominantly relied on traditional algorithms to process LiDAR or camera data, which often struggle with occlusion and inherent sensor limitations. To address these challenges, we propose a novel method, called ESFusion for Effective BEV Feature Selection and Fusion. To the best of our knowledge, this is the first work to leverage multi-modal data from intelligent roadside infrastructure, particularly LiDAR and cameras, for generating HD maps at intersections. To enhance multi-modal feature representation in Bird's Eye View (BEV), we design a Cross-modal Channel Exchange (CCE) module that creates multi-scale spatial features and facilitates LiDAR-camera information exchange across channels. Additionally, we introduce a Dynamic Feature Selection (DFS) module to adaptively select the most valuable information between modalities. Comprehensive evaluations on the DAIR-V2X dataset demonstrate that our method outperforms single-modal approaches and existing state-of-the-art fusion methods for vehicle-side applications. Moreover, experiments on the nuScenes dataset further highlight the high flexibility of our proposed module, showcasing its ability to be seamlessly integrated into existing multi-modal fusion workflows.

Abstract:
Navigating dense and dynamic environments poses a significant challenge for autonomous driving systems, owing to the intricate nature of multimodal interaction, wherein the actions of various traffic participants and the autonomous vehicle are complex and implicitly coupled. In this paper, we propose a novel framework, Multimodal Integrated predictioN and Decision-making (MIND), which addresses the challenges by efficiently generating joint predictions and decisions covering multiple distinctive interaction modalities. Specifically, MIND leverages learning-based scenario predictions to obtain integrated predictions and decisions with socially-consistent interaction modality and utilizes a modality-aware dynamic branching mechanism to generate scenario trees that efficiently capture the evolutions of distinctive interaction modalities with low growth of interaction uncertainty along the planning horizon. The scenario trees are seamlessly utilized by the contingency planning under interaction uncertainty to obtain clear and considerate maneuvers accounting for multimodal evolutions. Comprehensive experimental results in the closed-loop simulation based on the real-world driving dataset showcase superior performance to other strong baselines under various driving contexts. Code is available at: https://github.com/HKUST-Aerial-Robotics/MIND.

Abstract:
In robotics, structural design and behavior optimization have long been considered separate processes, resulting in the development of systems with limited capabilities. Recently, co-design methods have gained popularity, where bi-level formulations are used to simultaneously optimize the robot design and behavior for specific tasks. However, most implementations assume a serial or tree-type model of the robot, overlooking the fact that many robot platforms incorporate parallel mechanisms. In this paper, we present a first co-design formulation that explicitly incorporates parallel coupling constraints into the dynamic model of the robot. In this framework, an outer optimization loop focuses on the design parameters, in our case the transmission ratios of a parallel belt-driven manipulator, which map the desired torques from the joint space to the actuation space. An inner loop performs trajectory optimization in the actuation space, thus exploiting the entire dynamic range of the manipulator. We compare the proposed method with a conventional co-design approach based on a simplified tree-type model. By taking advantage of the actuation space representation, our approach leads to a significant increase in dynamic payload capacity compared to the conventional co-design implementation.

Abstract:
In complex scenarios where typical pick-and-place techniques are insufficient, often non-prehensile manipulation can ensure that a robot is able to fulfill its task. However, non-prehensile manipulation is challenging due to its underactuated nature with hybrid-dynamics, where a robot needs to reason about an object’s long-term behavior and contact-switching, while being robust to contact uncertainty. The presence of clutter in the workspace further complicates this task, introducing the need to include more advanced spatial analysis to avoid unwanted collisions. Building upon prior work on reinforcement learning with multimodal categorical exploration for planar pushing, we propose to incorporate location-based attention to enable robust manipulation in cluttered scenes. Unlike previous approaches addressing this obstacle avoiding pushing task, our framework requires no predefined global paths and considers the desired target orientation of the manipulated object. Experimental results in simulation as well as with a real KUKA iiwa robot arm demonstrate that our learned policy manipulates objects successfully while avoiding collisions through complex obstacle configurations, including dynamic obstacles, to reach the desired target pose.

Abstract:
To match the dexterity of human hands, a robot’s end-effector needs tactile sensing. However, current tactile sensing solutions often have complex electronics, making them impractical for covering the entire area of the manipulator without interfering with robot manipulation, especially for miniature objects. Here, we present a tactile sensing design that enables a single accelerometer positioned at the base of a robot’s end-effector, to locate and estimate contact forces applied across a large region of the end-effector. Inspired by human tactile sensing, where a single afferent responds to skin vibrations over a large area, we integrated a string to transmit vibrations from remote contacts to the accelerometer. We utilized lightweight machine learning models to decode tactile information from the vibration signals captured by the accelerometer. Our experimental results demonstrate that we can accurately predict remote contact locations and force amplitudes, with a precision of 1.9 mm and 0.09 N, respectively. Additionally, the vibrations can be used to identify the surface materials of contact objects, achieving 99% accuracy in discriminating between 15 different materials. Our approach could help simplify and minimize the design of robot manipulators, enabling more delicate manipulation and reducing the hardware costs and data volume required for tactile sensing.

Abstract:
We present LiHRA, a novel dataset designed to facilitate the development of automated, learning-based, or classical risk monitoring (RM) methods for Human-Robot Interaction (HRI) scenarios. The growing prevalence of collaborative robots in industrial environments has increased the need for reliable safety systems. However, the lack of high-quality datasets that capture realistic human-robot interactions, including potentially dangerous events, slows development. LiHRA addresses this challenge by providing a comprehensive, multi-modal dataset combining 3D LiDAR point clouds, human body keypoints, and robot joint states, capturing the complete spatial and dynamic context of human-robot collaboration. This combination of modalities allows for precise tracking of human movement, robot actions, and environmental conditions, enabling accurate RM during collaborative tasks. The LiHRA dataset covers six representative HRI scenarios involving collaborative and coexistent tasks, object handovers, and surface polishing, with safe and hazardous versions of each scenario. In total, the data set includes 4,431 labeled point clouds recorded at 10 Hz, providing a rich resource for training and benchmarking classical and AI-driven RM algorithms. Finally, to demonstrate LiHRA’s utility, we introduce an RM method that quantifies the risk level in each scenario over time. This method leverages contextual information, including robot states and the dynamic model of the robot. With its combination of high-resolution LiDAR data, precise human tracking, robot state data, and realistic collision events, LiHRA offers an essential foundation for future research into real-time RM and adaptive safety strategies in human-robot workspaces.

Abstract:
This paper proposes a novel Q-learning-based dual-loop force tracking control framework for robot grinding tasks in uncertain environments. A complete system state-space model is established, incorporating interaction dynamics and the desired force. By augmenting the system state, a discount cost function is defined to quantify the tracking errors of the force and reference trajectory. The modified Q-learning method is systematically designed to iteratively compute the optimal control gain in a model-free manner. To mitigate force overshoot during the transition from free space to contact space, a force reference model and a transition mechanism for the control gain are designed. Simulations and experiments validate the method’s effectiveness in precise force tracking with minimal overshoot and robustness to environmental variations.

Abstract:
Imitation Learning (IL) is a widely adopted approach which enables agents to learn from human expert demonstrations by framing the task as a supervised learning problem. However, IL often suffers from causal confusion, where agents misinterpret spurious correlations as causal relationships, leading to poor performance in testing environments with distribution shift. To address this issue, we introduce GAze-Based Regularization in Imitation Learning (GABRIL), a novel method that leverages the human gaze data gathered during the data collection phase to guide the representation learning in IL. GABRIL utilizes a regularization loss which encourages the model to focus on causally relevant features identified through expert gaze and consequently mitigates the effects of confounding variables. We validate our approach in Atari environments and the Bench2Drive benchmark in CARLA by collecting human gaze datasets and applying our method in both domains. Experimental results show that the improvement of GABRIL over behavior cloning is around 179% more than the same number for other baselines in the Atari and 76% in the CARLA setup. Finally, we show that our method provides extra explainability when compared to regular IL agents. The datasets, the code, and some experiment videos are publicly available at https://liralab.usc.edu/gabril.

Abstract:
Trust is essential in human-robot collaboration, particularly in multi-human, multi-robot (MH-MR) teams, where it plays a crucial role in maintaining team cohesion in complex operational environments. Despite its importance, trust is rarely incorporated into task allocation and reallocation algorithms for MH-MR collaboration. While prior research in single-human, single-robot interactions has shown that integrating trust significantly enhances both performance outcomes and user experience, its role in MH-MR task allocation remains underexplored. In this paper, we introduce the Expectation Confirmation Trust (ECT) Model, a novel framework for modeling trust dynamics in MH-MR teams. We evaluate the ECT model against five existing trust models and a no-trust baseline to assess its impact on task allocation outcomes across different team configurations (2H-2R, 5H-5R, and 10H-10R). Our results show that the ECT model improves task success rate, reduces mean completion time, and lowers task error rates. These findings highlight the complexities of trust-based task allocation in MH-MR teams. We discuss the implications of incorporating trust into task allocation algorithms and propose future research directions for adaptive trust mechanisms that balance efficiency and performance in dynamic, multi-agent environments.

Abstract:
An increasing number of datasets sharing similar domains for semantic segmentation have been published over the past few years. But despite the growing amount of overall data, it is still difficult to train bigger and better models due to inconsistency in taxonomy and/or labeling policies of different datasets. To this end, we propose a knowledge distillation approach that also serves as a label space unification method for semantic segmentation. In short, a teacher model is trained on a source dataset with a given taxonomy, then used to pseudo-label additional data for which ground truth labels of a related label space exist. By mapping the related taxonomies to the source taxonomy, we create constraints within which the model can predict pseudo-labels. Using the improved pseudo-labels we train student models that consistently outperform their teachers in two challenging domains, namely urban and off-road driving. Our ground truth-corrected pseudo-labels span over 12 and 7 public datasets with 388.230 and 18.558 images for the urban and off-road domains, respectively, creating the largest compound datasets for autonomous driving to date.

Abstract:
Artificial Potential Field (APF) methods are widely used for reactive flocking control, but they often suffer from challenges such as deadlocks and local minima, especially in the presence of obstacles. Existing solutions to address these issues are typically passive, leading to slow and inefficient collective navigation. As a result, many APF approaches have only been validated in obstacle-free environments or simplified, pseudo-3D simulations. This paper presents GO-Flock, a hybrid flocking framework that integrates planning with reactive APF-based control. GO-Flock consists of an upstream Perception Module, which processes depth maps to extract waypoints and virtual agents for obstacle avoidance, and a downstream Collective Navigation Module, which applies a novel APF strategy to achieve effective flocking behavior in cluttered environments. We evaluate GO-Flock against passive APF-based approaches to demonstrate their respective merits, such as their flocking behavior and the ability to overcome local minima. Finally, we validate GO-Flock through obstacle-filled environment and also hardware-in-the-loop experiments where we successfully flocked a team of nine drones—six physical and three virtual— in a forest environment.

Abstract:
Vehicle object detection benefits from both LiDAR and camera data, with LiDAR offering superior performance in many scenarios. Fusion of these modalities further enhances accuracy, but existing methods often introduce complexity or dataset-specific dependencies. In our study, we propose a model-adaptive late-fusion method, VaLID, which validates whether each predicted bounding box is acceptable or not. Our method verifies the higher-performing, yet overly optimistic LiDAR model detections using camera detections that are obtained from either specially trained, general, or open-vocabulary models. VaLID uses a lightweight neural verification network trained with a high recall bias to reduce the false predictions made by the LiDAR detector, while still preserving the true ones. Evaluating with multiple combinations of LiDAR and camera detectors on the KITTI dataset, we reduce false positives by an average of 63.9%, thus outperforming the individual detectors on 3D average precision (3DAP). Our approach is model-adaptive and demonstrates state-of-the-art competitive performance even when using generic camera detectors that were not trained specifically for this dataset.

Abstract:
Large Multimodal Models (LMMs) have become a pivotal research focus in deep learning, demonstrating remarkable capabilities in 3D scene understanding. However, current 3D LMMs employing thousands of spatial tokens for multimodal reasoning suffer from critical inefficiencies: excessive computational overhead and redundant information flows. Unlike 2D VLMs processing single images, 3D LMMs exhibit inherent architectural redundancy due to the heterogeneous mechanisms between spatial tokens and visual tokens. To address this challenge, we propose AdaToken-3D, an adaptive spatial token optimization framework that dynamically prunes redundant tokens through spatial contribution analysis. Our method automatically tailors pruning strategies to different 3D LMM architectures by quantifying token-level information flows via attention pattern mining. Extensive experiments on LLaVA-3D (a 7B parameter 3D-LMM) demonstrate that AdaToken-3D achieves 21% faster inference speed and 63% FLOPs reduction while maintaining original task accuracy. Beyond efficiency gains, this work systematically investigates redundancy patterns in multimodal spatial information flows through quantitative token interaction analysis. Our findings reveal that over 60% of spatial tokens contribute minimally (<5%) to the final predictions, establishing theoretical foundations for efficient 3D multimodal learning.

Abstract:
In-hand manipulation is a crucial ability for reorienting and repositioning objects within grasps. The main challenges in this are not only the complexity of the computational models, but also the risks of grasp instability caused by active finger motions, such as rolling, sliding, breaking, and remaking contacts. This paper presents the development of the Roller Ring (RR), a modular robotic attachment with active surfaces that is wearable by both robot and human hands to manipulate without lifting a finger. By installing the angled RRs on hands, such that their spatial motions are not colinear, we derive a general differential motion model for manipulating objects. Our motion model shows that complete in-hand manipulation skill sets can be provided by as few as only 2 RRs through non-holonomic object motions, while more RRs can enable enhanced manipulation dexterity with fewer motion constraints. Through extensive experiments, we test the RRs on both a robot hand and a human hand to evaluate their manipulation capabilities. We show that the RRs can be employed to manipulate arbitrary object shapes to provide dexterous in-hand manipulation.

Abstract:
Point clouds collected from real-world environments are often incomplete due to factors such as limited sensor resolution, single viewpoints, occlusions, and noise. These challenges make point cloud completion essential for various applications. A key difficulty in this task is predicting the overall shape and reconstructing missing regions from highly incomplete point clouds. To address this, we introduce CasPoinTr, a novel point cloud completion framework using cascaded networks and knowledge distillation. CasPoinTr decomposes the completion task into two synergistic stages: Shape Reconstruction, which generates auxiliary information, and Fused Completion, which leverages this information alongside knowledge distillation to generate the final output. Through knowledge distillation, a teacher model trained on denser point clouds transfers incomplete-complete associative knowledge to the student model, enhancing its ability to estimate the overall shape and predict missing regions. Together, the cascaded networks and knowledge distillation enhance the model’s ability to capture global shape context while refining local details, effectively bridging the gap between incomplete inputs and complete targets. Experiments on ShapeNet-55 under different difficulty settings demonstrate that CasPoinTr outperforms existing methods in shape recovery and detail preservation, highlighting the effectiveness of our cascaded structure and distillation strategy.

Abstract:
Given the fact that visual-inertial odometry (VIO) is faced with the challenges of localization drift in the long run, we utilize drift-free Ultra-Wideband (UWB) measurements to eliminate accumulated errors in VIO. Existing UWB-VIO fusion methods are mostly constrained by the accuracy of prior UWB anchor positions. However, in large-scale localization scenarios, the precise locations of UWB anchors are difficult to obtain, and the offline calibration process is complex, significantly limiting flexibility. In this paper, we firstly design a lightweight initialization method based on a dual sliding window structure, which can rapidly obtain initial guesses for the UWB anchor coordinates. After that, we further propose a joint estimation system to refine the anchor coordinates while estimating the correction for VIO. The system combines filter-based and optimization-based methods, which mainly consists of an initialization module and a nonlinear estimator module. The filter in the initialization module provides optimization initial values and covariances, and mutually, the optimization results from the nonlinear estimator provide priors for the filter. Finally, the performance of our proposed approach is verified through both public datasets and real-world experiment. Our project, along with our dataset, has been open-sourced in the form of ROS package and ROS bag.

Abstract:
Effective robot navigation in unseen environments is a challenging task that requires precise control actions at high frequencies. Recent advances have framed it as an image-goal-conditioned control problem, where the robot generates navigation actions using frontal RGB images. Current state-of-the-art methods in this area use diffusion policies to generate these control actions. Despite their promising results, these models are computationally expensive and suffer from weak perception. To address these limitations, we present FlowNav, a novel approach that uses a combination of Conditional Flow Matching (CFM) and depth priors from off-the-shelf foundation models to learn action policies for robot navigation. FlowNav is significantly more accurate and faster at navigation and exploration than state-of-the-art methods. We validate our contributions using real robot experiments in multiple environments, demonstrating improved navigation reliability and accuracy. Code and trained models are publicly available†.

Abstract:
We present an iterative inverse reinforcement learning algorithm to infer optimal cost functions in continuous spaces. Based on a popular maximum entropy criteria, our approach iteratively finds a weight improvement step and proposes a method to find an appropriate step size that ensures learned cost function features remain similar to the demonstrated trajectory features. In contrast to similar approaches, our algorithm can individually tune the effectiveness of each observation for the partition function based on the current estimate of the cost function parameters, guiding the algorithm towards better estimates in the following iterations. In addition, it does not need a large sample set, enabling faster learning. We generate sample trajectories by solving an optimal control problem instead of random sampling, leading to more informative trajectories. The performance of our method is compared to two state of the art algorithms to demonstrate its benefits in several simulated environments.

Abstract:
Offline reinforcement learning (RL) provides a powerful framework for training robotic agents using pre-collected, suboptimal datasets, eliminating the need for costly, time-consuming, and potentially hazardous online interactions. This is particularly useful in safety-critical real-world applications, where online data collection is expensive and impractical. However, existing offline RL algorithms typically require reward labeled data, which introduces an additional bottleneck: reward function design is itself costly, labor-intensive, and requires significant domain expertise. In this paper, we introduce PLARE, a novel approach that leverages large vision-language models (VLMs) to provide guidance signals for agent training. Instead of relying on manually designed reward functions, PLARE queries a VLM for preference labels on pairs of visual trajectory segments based on a language task description. The policy is then trained directly from these preference labels using a supervised contrastive preference learning objective, bypassing the need to learn explicit reward models. Through extensive experiments on robotic manipulation tasks from the MetaWorld, PLARE achieves performance on par with or surpassing existing state-of-the-art VLM-based reward generation methods. Furthermore, we demonstrate the effectiveness of PLARE in real-world manipulation tasks with a physical robot, further validating its practical applicability.

Abstract:
Most autonomous cars rely on the availability of high-definition (HD) maps. Current research aims to address this constraint by directly predicting HD map elements from onboard sensors and reasoning about the relationships between the predicted map and traffic elements. Despite recent advancements, the coherent online construction of HD maps remains a challenging endeavor, as it necessitates modeling the high complexity of road topologies in a unified and consistent manner. To address this challenge, we propose a coherent approach to predict lane segments and their corresponding topology, as well as road boundaries, all by leveraging prior map information represented by commonly available standard-definition (SD) maps. We propose a network architecture, which leverages hybrid lane segment encodings comprising prior information and denoising techniques to enhance training stability and performance. Furthermore, we facilitate past frames for temporal consistency. Our experimental evaluation demonstrates that our approach outperforms previous methods by a large margin, highlighting the benefits of our modeling scheme.

Abstract:
Vision and Language Navigation (VLN) requires an agent to navigate through environments following natural language instructions. However, existing methods often struggle with effectively integrating visual observations and instruction details during navigation, leading to suboptimal path planning and limited success rates. In this paper, we propose OIKG (Observation-graph Interaction and Key-detail Guidance), a novel framework that addresses these limitations through two key components: (1) an observation-graph interaction module that decouples angular and visual information while strengthening edge representations in the navigation space, and (2) a key-detail guidance module that dynamically extracts and utilizes fine-grained location and object information from instructions. By enabling more precise cross-modal alignment and dynamic instruction interpretation, our approach significantly improves the agent’s ability to follow complex navigation instructions. Extensive experiments on the R2R and RxR datasets demonstrate that OIKG achieves state-of-the-art performance across multiple evaluation metrics, validating the effectiveness of our method in enhancing navigation precision through better observation-instruction alignment.

Abstract:
In assembly tasks, multiple operations, such as positioning, snap-fitting, and screw fastening, are often required for a single workpiece. The multiple operations add complexity to the planning process. To address this challenge, we propose an assembly sequence planning method that considers the combination of multiple operations associated with each workpiece. We define the sequence of these operations as a "workflow" and search for an optimal assembly sequence while respecting the workflow constraints of the workpieces. Beyond handling multi-operation constraints, our method optimizes the robot’s motion costs by assigning weights to the search tree and minimizing these costs accordingly. To evaluate the effectiveness of the proposed approach, we compare assembly planning results for multi-operation tasks with and without workflow decomposition. Additionally, we analyze the influence of motion cost minimization on planning performance and computational efficiency. Experimental results verified the effectiveness of the proposed method in improving assembly planning efficiency.

Abstract:
Diffusion models exhibit impressive scalability in robotic task learning, yet they struggle to adapt to novel, highly dynamic environments. This limitation primarily stems from their constrained replanning ability: they either operate at a low frequency due to a time-consuming iterative sampling process, or are unable to adapt to unforeseen feedback in case of rapid replanning. To address these challenges, we propose RA-DP, a novel diffusion policy framework with training-free high-frequency replanning ability that solves the above limitations by adapting to unforeseen dynamic environments. Specifically, our method integrates guidance signals, which are often easily obtained in the new environment during the diffusion sampling process, and utilizes a novel action queue mechanism to generate replanned actions at every denoising step without retraining, thus forming a complete training-free framework for robot motion adaptation in unseen environments. We conduct extensive evaluations in both common simulation benchmarks and real-world environments. Our results indicate that RA-DP outperforms the state-of-the-art diffusion-based methods in terms of replanning frequency and success rate. At the end, we show that our framework is theoretically compatible with any training-free guidance signal, hence increasing its applicability to a wide range of robotics tasks.

Abstract:
Embodied artificial intelligence requires a wide variety of large-scale simulated environments for development. Previous scene reconstruction approaches based on multiview images can produce high-fidelity 3D scenes but lack diversity. In contrast, existing prompt-based scene generation approaches can produce diverse scenes but lack fine-grained controls. To bridge these two fields, the scene graph provides the key relationships within a scene, while offering flexible controls. However, 3D scene generation from scene graphs is challenging and under-explored. In this paper, we propose a scene graph-based 3D indoor scene generation method for the efficient simulated environment creation, that maintains both high diversity and fine-grained control. Specifically, we first introduce an interaction-aware scene graph to merge object nodes with hierarchical interaction relationships, which alleviates levitation and interference issues during the scene generation. Then, we employ the large language model (LLM) for instruction-driven 3D layout generation with carefully designed prompts. Finally, a 3D large generation model is utilized to generate the content for each node of the interaction-aware scene graph, which is then transformed based on the corresponding bounding box in the 3D layout. The experiments demonstrate that the proposed method achieves state-of-the-art performance on 3D indoor scene generation. Additionally, the proposed method exhibits fine-grained controls at the object level, while providing a high diversity of layouts, geometry, and textures.

Abstract:
Accurate and efficient 3D scene understanding from multi-view images remains a fundamental challenge in autonomous driving. Existing methods often struggle with high-dimensional features, leading to excessive computational costs and memory usage. In this paper, we present SAFormer, a novel transformer-based framework for efficient spatially adaptive occupancy prediction. SAFormer incorporates two key techniques to reduce resource consumption: Octree-based Multi-resolution Feature (OMRF) Learning and Spatial-Adaptive Progressive Query (SAPQ). First, OMRF introduces an Octree-based hierarchical structure to compress multi-resolution 3D feature volumes. Second, SAPQ facilitates efficient information flow across different scales while effectively addressing scene sparsity. It employs a region-aware query mechanism that intelligently allocates computational resources, processing safety-critical regions at high resolution while handling background elements at lower resolutions. Experiments on the nuScenes dataset demonstrate that our method achieves state-of-the-art performance while significantly reducing inference latency (up to 3×) and memory cost (up to 2.9×). Additional experiments on SSCBench-KITTI-360 further validate our approach’s generalizability. Our approach excels in managing scene sparsity and recognizing small, safety-critical objects, highlighting its potential for practical applications in autonomous driving.

Abstract:
Image feature matching is a fundamental task in computer vision. Existing local feature matching methods can establish robust correspondences between image pairs. However, these methods heavily rely on dense local image features, making them susceptible to significant perspective differences, characterized by rotation and scale changes. To alleviate this limitation, we introduce a novel oriented Overlapping Region Alignment method, named ORA-NET, which presents a concise and efficient approach to enhance the performance of image feature matching methods. We introduce the Multidirectional Cross-scale Feature Aggregation module to aggregate rotation-equivariant features across multiple scales and model long-range dependencies. Additionally, the Oriented Overlap Alignment module estimates scale and rotation differences within overlapping regions using a coarse-to-fine rotation correction approach. Importantly, our method serves as a plug-and-play module that can be seamlessly integrated into other correspondence matching pipelines. Experimental results demonstrate that ORA-NET significantly enhances the matching performance of existing local feature matching methods, particularly in scenarios involving substantial perspective differences.

Abstract:
Recently, 3D Gaussian Splatting (3DGS) has garnered significant attention for its remarkable capacity to efficiently synthesize novel views with high fidelity. Nevertheless, 3DGS encounters challenges in accurately representing the geometry of real-world scenes. To address this issue, previous methods commonly utilize a depth-normal consistency term on 2D images to regulate the geometry of 3D Gaussians. However, these methods degrade in performance when dealing with low-texture surfaces or limited training views. In contrast, we present GPGS, a novel approach that directly regulates Gaussians in 3D space using Geometric Priors (GP). Given posed LiDAR scans and images, we organize the point clouds into a hierarchical voxel map. Each voxel contains occupancy information and explicitly reveals the internal planar or non-planar structure. We propose a novel divide-and-conquer strategy to separately regulate Gaussians in planar and non-planar voxels. For planar voxels, we design positional and rotational constraints to align Gaussians with the estimated plane. Considering the noisy ranging measurements of complex structures, we use depth-normal consistency to regularize Gaussians in non-planar voxels. Additionally, an occupancy-aware density control strategy is introduced to confine the densification process within occupied voxels, thus reducing artifacts. Extensive experiments on real-world datasets show that our proposed approach outperforms existing state-of-the-art methods in both geometric accuracy and visual quality.

Abstract:
While most current RGB-D-based category-level object pose estimation methods achieve strong performance, they face significant challenges in scenes lacking depth information. In this paper, we propose a novel category-level object pose estimation approach that relies solely on RGB images. This method enables accurate pose estimation in real-world scenarios without the need for depth data. Specifically, we design a transformer-based neural network for category-level object pose estimation, where the transformer is employed to predict and fuse the geometric features of the target object. To ensure that these predicted geometric features faithfully capture the object’s geometry, we introduce a geometric feature-guided algorithm, which enhances the network’s ability to effectively represent the object’s geometric information. Finally, we utilize the RANSAC-PnP algorithm to compute the object’s pose, addressing the challenges associated with variable object scales in pose estimation. Experimental results on benchmark datasets demonstrate that our approach is not only highly efficient but also achieves superior accuracy compared to previous RGB-based methods. These promising results offer a new perspective for advancing category-level object pose estimation using RGB images.

Abstract:
Localizing predefined 3D keypoints in a 2D image is an effective way to establish 3D-2D correspondences for instance-level 6DoF object pose estimation. However, unreliable localization results of invisible keypoints degrade the quality of correspondences. In this paper, we address this issue by localizing the important keypoints in terms of visibility. Since keypoint visibility information is currently missing in the dataset collection process, we propose an efficient way to generate binary visibility labels from available object-level annotations, for keypoints of both asymmetric objects and symmetric objects. We further derive real-valued visibility-aware importance from binary labels based on the PageRank algorithm. Taking advantage of the flexibility of our visibility-aware importance, we construct VAPO (Visibility-Aware POse estimator) by integrating the visibility-aware importance with a state-of-the-art pose estimation algorithm, along with additional positional encoding. VAPO can work in both CAD-based and CAD-free settings. Extensive experiments are conducted on popular pose estimation benchmarks including Linemod, Linemod-Occlusion, and YCB-V, demonstrating that VAPO clearly achieves state-of-the-art performances. Project page: https://github.com/RuyiLian/VAPO.

Abstract:
Image-goal navigation is a crucial yet challenging task that requires an agent to navigate to a goal location specified by an image. Modular methods decompose the problem into distinct subtasks and often involve explicit map construction, which can struggle in complex, unstructured environments. In contrast, end-to-end deep reinforcement learning (DRL)-based methods directly output actions from visual input, with recent improvements focusing on enhancing embedding fusion between the current observation and the goal image. However, both approaches fail to fully leverage the rich temporal relationships present in the agent’s visual-action history. In this paper, we address this limitation by employing a self-supervised transformer to predict masked portions of the agent’s visual-action embeddings. To promote spatio-temporal reasoning, a dual-attention shared transformer is utilized for both masked representation learning and policy generation. Our method demonstrates superior performance and generalization ability compared to 12 existing baselines across the Gibson, MP3D, and HM3D datasets. Code and trained models are available at https://github.com/hujch23/DaMVA.

Abstract:
Endoscopic submucosal dissection (ESD) is a technically difficult, minimally invasive, organ preserving resection technique that yields improved clinical outcomes when compared to current conventional procedures but requires experienced surgeons and specialized skills. Difficulty in applying tension during ESD is recognized as the single greatest barrier to wide adoption of the procedure, and solution of this problem is sure to have wide-reaching and immediate adoption. This work presents a compact wireless retraction device that is magnetically actuated and has a high force output with adaptable traction control. The retraction device is 25 mm long and 4 mm in diameter. The device has two modes of operation: first spooling to collect string slack, then transitions via external permanent magnet to internal string twisting to generate a large retraction force. In slack collection the device can contract 11 cm in length at a speed of 6.88 millimeters per second, then clutch to force mode to reach a peak retraction force of 1.33 N, leveraging the micro-transmission twisted string actuation. The wireless device is designed for endoscopic deployment to any surgical environment or lesion within the gastrointestinal tract.

Abstract:
Self-supervised single-view depth estimation, trained on video sequences, faces significant challenges when dynamic objects are present in the training data, as they violate the basic multi-view geometry assumptions used to compute photometric losses. We propose a novel approach that leverages the relationship between the depth of moving objects and their ground contact points. By iteratively propagating ground features to moving targets in perceptual layers, we recalibrate the depth of dynamic entities while preserving details. Our method maintains the end-to-end training paradigm without additional networks or complex training procedures. Our experiments demonstrate that our method achieves state-of-the-art performance when estimating depth for dynamic objects and attains superior generalization compared to existing approaches. The relevant experimental code can be accessed at: https://github.com/LiHuanLi/GroundMono

Abstract:
Offline reinforcement learning (RL) presents distinct challenges as it relies solely on observational data. A central concern in this context is ensuring the safety of the learned policy by quantifying uncertainties associated with various actions and environmental stochasticity. Traditional approaches primarily emphasize mitigating epistemic uncertainty by learning risk-averse policies, often overlooking environmental stochasticity. In this study, we propose an uncertainty-aware distributional offline RL method to simultaneously address both epistemic uncertainty and environmental stochasticity. We propose a model-free offline RL algorithm capable of learning risk-averse policies and characterizing the entire distribution of discounted cumulative rewards, as opposed to merely maximizing the expected value of accumulated discounted returns. Our method is rigorously evaluated through comprehensive experiments in both risk-sensitive and risk-neutral benchmarks, demonstrating its superior performance.

Abstract:
Passivity-based methods are widely used in tele-operation to guarantee stability, especially for the widely used Position-Force measured (PFm) architecture. Among them, Time Domain Passivity Approach (TDPA) achieves stability through passivation, generating a dissipation action that degrades the reference signals and thus, the performance of the system. Whereas passivization is necessary to ensure stability during contact, in free motion, it acts just as a disturb, without having a real impact on stability. In fact, during free motion the force feedback is zero, i.e. the teleoperation loop is not closed and thus, a stabilization action is not needed. Therefore, this paper proposes a formal demonstration that passivation is not needed during free motion. Accordingly, the paper introduces a new formulation of the TDPA to take into account the free motion condition. A one-degree-of-freedom case study is then proposed to provide a simple example to instantiate the formalism and to show the advantages of the proposed method, that achieve an almost total cancellation of the drift during free motion. Finally, the paper discusses the limitations of the method in real-case scenarios. In particular, how the inertia of tools mounted after the force sensors can affect the measurements and the perception of the system.

Abstract:
While autonomous racing performance in Time-Trial scenarios has seen significant progress and development, autonomous wheel-to-wheel racing and overtaking are still severely limited. These limitations are particularly apparent in real-life driving scenarios where state-of-the-art algorithms struggle to safely or reliably complete overtaking manoeuvres. This is important, as reliable navigation around other vehicles is vital for safe autonomous wheel-to-wheel racing. The F1Tenth Competition provides a useful opportunity for developing wheel-to-wheel racing algorithms on a standardised physical platform. The competition format makes it possible to evaluate overtaking and wheel-to-wheel racing algorithms against the state-of-the-art. This research presents a novel racing and overtaking agent capable of learning to reliably navigate a track and overtake opponents in both simulation and reality. The agent was deployed on an F1Tenth vehicle and competed against opponents running varying competitive algorithms in the real world. The results demonstrate that the agent’s training against opponents enables deliberate overtaking behaviours with an overtaking rate of 87% compared 56% for an agent trained just to race.

Abstract:
In this paper, we introduce a novel image-goal navigation approach, named RFSG. Our focus lies in leveraging the fine-grained connections between goals, observations, and the environment within limited image data, all the while keeping the navigation architecture simple and lightweight. To this end, we propose the spatial-channel attention mechanism, enabling the network to learn the importance of multi-dimensional features to fuse the goal and observation features. In addition, a self-distillation mechanism is incorporated to further enhance the feature representation capabilities. Given that the navigation task needs surrounding environmental information for more efficient navigation, we propose an image scene graph to establish feature associations at both the image and object levels, effectively encoding the surrounding scene information. Cross-scene performance validation was conducted on the Gibson and HM3D datasets, and the proposed method achieved state-of-the-art results among mainstream methods, with a speed of up to 53.5 frames per second on an RTX3080. This contributes to the realization of end-to-end image-goal navigation in real-world scenarios. The implementation and model of our method have been released at: https://github.com/nubot-nudt/RFSG.

Abstract:
The Rust programming language is an attractive choice for robotics and related fields, offering highly efficient and memory-safe code. However, a key limitation preventing its broader adoption in these domains is the lack of high-quality, well-supported Automatic Differentiation (AD)—a fundamental technique that enables convenient derivative computation by systematically accumulating data during function evaluation. In this work, we introduce ad-trait, a new Rust-based AD library. Our implementation overloads Rust’s standard floating-point type with a flexible trait that can efficiently accumulate necessary information for derivative computation. The library supports both forward-mode and reverse-mode automatic differentiation, making it the first operator-overloading AD implementation in Rust to offer both options. Additionally, ad-trait leverages Rust’s performance-oriented features, such as Single Instruction, Multiple Data acceleration in forward-mode AD, to enhance efficiency. Through benchmarking experiments, we show that our library is among the fastest AD implementations across several programming languages for computing derivatives. Moreover, it is already integrated into a Rust-based robotics library, where we showcase its ability to facilitate fast optimization procedures. We conclude with a discussion of the limitations and broader implications of our work.

Abstract:
Dual-arm robotic grasping is crucial for handling large objects that require stable and coordinated manipulation. While single-arm grasping has been extensively studied, datasets tailored for dual-arm settings remain scarce. We introduce a large-scale dataset of 16 million dual-arm grasps, evaluated under improved force-closure constraints. Additionally, we develop a benchmark dataset containing 300 objects with approximately 30,000 grasps, evaluated in a physics simulation environment, providing a better grasp quality assessment for dual-arm grasp synthesis methods. Finally, we demonstrate the effectiveness of our dataset by training a Dual-Arm Grasp Classifier network that outperforms the state-of-the-art methods by 15%, achieving higher grasp success rates and improved generalization across objects. Project page: https://dg16m.github.io/DG-16M/

Abstract:
This paper introduces Context Informed Trees (CIT), a sampling-based motion planning algorithm that enhances exploration efficiency by biasing sampling based on uncertainty estimation from local samples and connectivity information obtained during the search process. CIT is based on Flexible Informed Trees (FIT) and incorporates three key components: region-based sampling, uncertainty-driven weighting, and connection-greedy prioritization (CGP). It generates regions from sampled states based on local obstacle proximity, assigning weights to these regions using probability uncertainty estimation via kernel density estimation (KDE) classification. To further refine the sampling focus, CGP prioritizes regions that exhibit strong connectivity in previous searches, ensuring that exploration is directed toward unknown and critical areas that have a higher likelihood of contributing to feasible and efficient paths. The sampling process is then guided by a mixture of Gaussian distributions centered on weighted regions, where the weighting biases sampling toward more critical regions, thereby improving search efficiency and accelerating convergence. Benchmark evaluations demonstrate that CIT improves efficiency by reducing reliance on random sampling, which often leads to slower solution discovery and higher path costs. With biased sampling, CIT maintains strong performance in solving complex motion planning problems in \mathbbR^4 to \mathbbR^16 and has been demonstrated on a real-world manipulation task. A video showcasing our method and experimental results is available at: https://youtu.be/SG2cy9WmjD0.

Abstract:
In this paper, we present a novel probabilistic safe control framework for human-robot interaction that combines control barrier functions (CBFs) with conformal risk control to provide formal safety guarantees while considering complex human behavior. The approach uses conformal risk control to quantify and control the prediction errors in CBF safety values and establishes formal guarantees on the probability of constraint satisfaction during interaction. We introduce an algorithm that dynamically adjusts the safety margins produced by conformal risk control based on the current interaction context. Through experiments on human-robot navigation scenarios, we demonstrate that our approach significantly reduces collision rates and safety violations as compared to baseline methods while maintaining high success rates in goal-reaching tasks and efficient control. The code, simulations, and other supplementary material can be found on the project website: https://jakeagonzales.github.io/crc-cbf-website/.

Abstract:
This paper studies the problem of using a robot arm to manipulate a uniformly rotating chain with its bottom end fixed. Existing studies have investigated ideal rotational shapes for practical applications, yet they do not discuss how these shapes can be consistently achieved through manipulation planning. Our work presents a manipulation strategy for stable and consistent shape transitions. We find that the configuration space of such a chain is homeomorphic to a three-dimensional cube. Using this property, we suggest a strategy to manipulate the chain into different configurations, specifically from one rotation mode to another, while taking stability and feasibility into consideration. We demonstrate the effectiveness of our strategy in physical experiments by successfully transitioning from rest to the first two rotation modes. The concepts explored in our work have critical applications in ensuring safety and efficiency of drill string and yarn spinning operations.

Abstract:
Autonomous motion planning is critical for efficient and safe underwater manipulation in dynamic marine environments. Current motion planning methods often fail to effectively utilize prior motion experiences and adapt to real-time uncertainties inherent in underwater settings. In this paper, we introduce an Adaptive Heuristic Motion Planner framework that integrates a Heuristic Motion Space (HMS) with Bayesian Networks to enhance motion planning for autonomous under-water manipulation. Our approach employs the Probabilistic Roadmap (PRM) algorithm within HMS to optimize paths by minimizing a composite cost function that accounts for distance, uncertainty, energy consumption, and execution time. By leveraging HMS, our framework significantly reduces the search space, thereby boosting computational performance and enabling real-time planning capabilities. Bayesian Networks are utilized to dynamically update uncertainty estimates based on real-time sensor data and environmental conditions, thereby refining the joint probability of path success. Through extensive simulations and real-world test scenarios, we showcase the advantages of our method in terms of enhanced performance and robustness. This probabilistic approach significantly advances the capability of autonomous underwater robots, ensuring optimized motion planning in the face of dynamic marine challenges.

Abstract:
Generating motions for robots interacting with objects of various shapes is a complex challenge, further complicated by the robot’s geometry and multiple desired behaviors. While current robot programming tools (such as inverse kinematics, collision avoidance, and manipulation planning) often treat these problems as constrained optimization, many existing solvers focus on specific problem domains or do not exploit geometric constraints effectively. We propose an efficient first-order method, Augmented Lagrangian Spectral Projected Gradient Descent (ALSPG), which leverages geometric projections via Euclidean projections, Minkowski sums, and basis functions. We show that by using geometric constraints rather than full constraints and gradients, ALSPG significantly improves real-time performance. Compared to second-order methods like iLQR, ALSPG remains competitive in the unconstrained case. We validate our method through toy examples and extensive simulations, and demonstrate its effectiveness on a 7-axis Franka robot, a 6-axis P-Rob robot and a 1:10 scale car in real-world experiments. Source codes, experimental data and videos are available on the project webpage: https://sites.google.com/view/alspg-oc

Abstract:
Underwater Object Detection (UOD) techniques are critical for Autonomous Underwater Vehicle (AUV), which must operate in harsh underwater environments characterized by low visibility while satisfying the lightweight and real-time constraints required for vehicle-mounted systems. Current methods typically rely on underwater image enhancement combined with object detection to adapt to underwater conditions. However, these approaches mainly focus on the spatial domain, often overlooking the frequency-domain characteristics of the underwater environment. This oversight limits the removal of noise factors, such as scattering, blurring, distortion, and uneven illumination, and diminishes the focus of object edges and textures. Additionally, the increased parameter size and higher computational cost render them less suitable for real-time detection. To address these, the Wavelet-Based Frequency Decomposition and Aggregation Network (WFDA) was proposed, which leverages the Wavelet Transform (WT) to decompose features into high- and low-frequency components for effective feature modeling and fusion-based downsampling. Specifically, the Wavelet-Based Feature Decomposition Modeling (WDM) module utilized multi-level wavelet decomposition to hierarchically model features across different frequency bands, while the Wavelet-Based Feature Aggregation Downsampling (WAD) module refined and extracted core features through single-level wavelet decomposition combined with channel aggregation. Evaluations on four public datasets demonstrate that WFDA achieved state-of-the-art (SOTA) performance and efficiency, making it well-suited for real-time, high-accuracy detection on robotic platforms. Code is available at https://github.com/Mariiiiooooo/WFDA.

Abstract:
Swarming microrobots offer great promise for targeted delivery in biofluidic environments. However, current approaches insufficiently utilize the operator’s perceptual awareness and interactive decision-making capabilities. This work proposes a real-time navigation and control strategy with haptic feedback for delivering magnetic microswarm, in which the haptic feedback system provides microswarm-environment interaction to the operator. The real-time tracking system continuously monitors the position and shape of the microswarm in the remote environment, transmitting data to the control system for decision-making. This integration can achieve real-time perception and feedback of the microswarm’s state and motion process. Moreover, the strategy successfully demonstrates navigation and shape-adaptive regulation of the microswarm under static, downstream and three-dimensional (3D) upstream flow conditions. The experimental results show that the haptic feedback enables real-time trajectory and velocity adjustments during navigation, improving control robustness and delivery accuracy. Our work expands a haptic feedback-enabled microswarm control in dynamic conditions, providing an adaptive swarm control strategy in complex biomedical environments.

Abstract:
The effective fusion of infrared and visible images could enhance environment perception during robot rescue mission by combining complementary information from both sensors. However, most existing fusion methods are developed for images captured under normal conditions, which limits their performance in real-world rescue scenarios where images often suffer from diverse degradation such as haze, low-light, noise, low-contrast, and so on. To address this challenge, we propose MT-Fusion, a novel deep learning workflow based on multi-task learning (MTL) that targets multiple specific degradation scenarios. MT-Fusion incorporates specific encoder modules for processing various degraded images, a degradation attention mechanism for fusion, and a shared decoder for image reconstruction. Extensive experiments demonstrate that our proposed specific scenario-guided image fusion strategy has obvious advantages in robot perception in the image fusion performance and degradation treatment. For the inference stage, we propose a selector that could automatically categorize the degradation of input and activate the specific encoder. The MT-Fusion also provides a more practical solution for enhancing robot rescue operations in challenging environments.

Abstract:
Latency and computational cost often limit the use of Nonlinear Model Predictive Control (NMPC) in real-time robotics. To address this limitation, our work investigates FPGA-implemented Neural Controllers (NC) trained through supervised learning, mimicking NMPC. We show that inexpensive embedded FPGA hardware is sufficient to implement these neural controllers for high-frequency control of robotic systems. We demonstrate kilohertz control rates for a cartpole and offload control to the FPGA hardware on the F1TENTH race car. The FPGA NC outperforms NMPC on the cartpole, due to the faster control rate afforded by faster NC inference. The code and hardware implementation for this paper are available at https://github.com/SensorsINI/Neural-Control-Tools.

Abstract:
In the domain of Simultaneous Localization and Mapping (SLAM), loop closure is a linchpin for achieving accurate and consistent 3D environment mapping. However, the process is fraught with abrupt light changes and motion blur. These elements introduce uncertainties and inaccuracies in the data captured by sensors, severely undermining the system’s robustness. To address this critical challenge, we present a novel SDF-guided keyframe selection algorithm tailored for loop closure. Our approach capitalizes on the geometric insights provided by the Signed Distance Function (SDF) to meticulously choose keyframes, effectively mitigating the impact of noisy data. By doing so, we enhance the reliability of loop closure, refine the accuracy of 3D map reconstructions, and fortify the overall stability of the system. Our algorithm’s efficacy is substantiated through comprehensive experiments on datasets like Replica, ScanNet, and Tum-RGBD. Notably, it can be easily integrated as a plug-and-play module into diverse existing methods, enhancing their performance across different scenarios. Real-world trials using a hand-held LeTMC-520 camera for indoor scene reconstruction further validate its practicality and effectiveness.

Abstract:
Variable impedance control is a control strategy widely used in physical human-robot collaboration (pHRC) and physical human-robot interaction (pHRI). Variable stiffness and damping parameters improve adaptability to changing environments and enhance safety in human-robot interaction. However, these adaptive parameters can compromise the stability of the system without proper management, particularly in dynamic environments. To address this, we propose a real-time parameter prediction method for variable impedance control using model predictive control (MPC) with Control Lyapunov Function (CLF). Unlike the method that sets the terminal constraint as the equilibrium position, the proposed method guarantees system stability even when parameters change or external disturbances occur, ensuring safe and adaptive robot behavior. Moreover, the infeasibility issue is resolved by applying CLF instead of relying on the equilibrium position. Furthermore, considering stability throughout the prediction horizon, the stability of the system is strictly guaranteed. The proposed method was validated through comparative experiments with the method that sets the terminal constraint as the equilibrium position in both simulations and real-world environments using the Franka Emika Panda robot. Through these experiments, the proposed controller demonstrated a significant reduction in parameter computation time, achieving approximately 97.13% and 96.20% faster computation in simulation tests compared to conventional method, while consistently ensuring stability under various disturbances including human interaction, tool vibration, and contact loss scenarios.

Abstract:
Humanoid robots are increasingly being developed for seamless interaction with humans in diverse domains, yet generating expressive and physically-feasible motions remains a core challenge. We propose a robust and automated pipeline for motion retargeting that enables the generation of natural, stable, and highly expressive motions for a wide variety of humanoid robots using different motion data sources, including noisy pose estimations. To ensure robustness, our approach unifies motions from different kinematic structures into a common canonical rig, systematically refines the motion trajectory to address infeasible poses, enforces foot-contact constraints, and enhances stability. The retargeted motion is then refined to closely follow the source motion while respecting each robot’s physical limits. Through extensive experiments on 12 simulated robots and validation on three real robots, we show that our methodology reliably produces expressive upper-body movements with consistent foot contact. This work represents an important step towards automating robust and expressive motion generation for humanoid robots, enabling deployment in various real-world scenarios.

Abstract:
Accurate relative state observation of Unmanned Underwater Vehicles (UUVs) for tracking uncooperative targets remains a significant challenge due to the absence of GPS, complex underwater dynamics, and sensor limitations. Existing localization approaches rely on either global positioning infrastructure or multi-UUV collaboration, both of which are impractical for a single UUV operating in large or unknown environments. To address this, we propose a novel persistent relative 6D state estimation framework that enables a single UUV to estimate its relative motion to a non-cooperative target using only successive noisy range measurements from two monostatic sonar sensors. Our key contribution is an observability-enhanced attitude control strategy, which optimally adjusts the UUV’s orientation to improve the observability of relative state estimation using a Kalman filter, effectively mitigating the impact of sensor noise and drift accumulation. Additionally, we introduce a rigorously proven Lyapunov-based tracking control strategy that guarantees long-term stability by ensuring that the UUV maintains an optimal measurement range, preventing localization errors from diverging over time. Through theoretical analysis and simulations, we demonstrate that our method significantly improves 6D relative state estimation accuracy and robustness compared to conventional approaches. This work provides a scalable, infrastructure-free solution for UUVs tracking uncooperative targets underwater.

Abstract:
This paper addresses the challenges of calibrating Panoramic Annular Lens (PAL) systems, which exhibit unique projection characteristics due to their imaging relationship designed to compress blind zones. Traditional camera calibration methods often fail to accurately capture these properties. To resolve this limitation, we propose a novel projection model that incorporates angular modulation, enabling a more accurate representation of the PAL system’s imaging process. This formulation significantly improves the model’s ability to describe the relationship between object space and image space. We evaluate our approach on both synthetic and real-world datasets tailored for PAL cameras. Experimental results demonstrate that the model achieves sub-pixel accuracy, with reprojection errors typically ranging from 0.1 to 0.3 pixels on 2048×2048 images when using five distortion terms. This level of precision surpasses existing calibration models for panoramic cameras, making our method particularly suitable for high-accuracy applications. The datasets used in this study are publicly available at https://github.com/wwendy233/PALcalib.

Abstract:
This paper considers the inverse kinematics problem of a robotic arm applying minimum mean square error with variance-based control. The proposed algorithm achieves optimal results by minimizing the average error, even when considering variance calculations. Its performance is comparable to that of the algorithm that utilizes optimally tuned singular value decomposition (SVD). The calculated variance values are added to the diagonal terms of the matrix as in the damped least squares method in the inverse matrix operation. This indicates that optimal performance can be achieved even when a Moore-Penrose pseudo-inverse matrix is employed instead of SVD. The effectiveness of the proposed method is validated with seven-degree-of-freedom (7-DoF) (1 rail + 6-DoF arm) and 6-DoF robots. By introducing practical error control methods, this paper contributes to enhancing the overall comprehension of error-related algorithms.

Abstract:
In everyday life, frequently used objects like cups often have unfixed positions and multiple instances within the same category, and their carriers frequently change as well. As a result, it becomes challenging for a robot to efficiently navigate to a specific instance. To tackle this challenge, the robot must capture and update scene changes and plans continuously. However, current object navigation approaches primarily focus on the semantic level and lack the ability to dynamically update scene representation. To address these limitations, this paper captures the relationships between frequently used objects and their static carriers. Specifically, it constructs an open-vocabulary Carrier-Relationship Scene Graph (CRSG) and updates the carrying status during robot navigation to reflect the dynamic changes of the scene. Based on the CRSG, we further propose an instance navigation strategy that models the navigation process as a Markov Decision Process. At each step, decisions are informed by Large Language Model’s commonsense knowledge and visual-language feature similarity. We designed a series of long-sequence navigation tasks for frequently used everyday items in the Habitat simulator. The results demonstrate that by updating the CRSG, the robot can efficiently navigate to moved targets. Additionally, we deployed our algorithm on a real robot and validated its practical effectiveness. The project page can be found here: https://OpenObject-Nav.github.io.

Abstract:
Supernumerary robotic limbs (SRLs) can assist humans in achieving efficient and comfortable work in daily life or industrial assembly scenarios, requiring SRLs to switch between rigidity and flexibility to perform compliant movements while also providing stable support for humans to reduce fatigue from prolonged standing, existing SRLs struggle to achieve this transition. In this study, a variable stiffness supernumerary robotic limb (VSSRL) is implemented, capable of adjusting its position and stiffness through pneumatic-tendon coupled actuation. The position of the VSSRL is accurately modulated by tendons, while its stiffness is controlled by pneumatic-tendon coupled actuation, tendons significantly increase the overall stiffness of the VSSRL, and the fiber-reinforced actuators (FRAs) can dynamically adjust its stiffness in response to changes in dynamic loads. Furthermore, a kinematic model of the VSSRL and a stiffness model under the coupling of FRAs and tendons are developed. Then, the trajectory and stiffness of the VSSRL in task execution are assigned based on human motion, and a multi-objective control system for both position and stiffness of the VSSRL is designed based on reinforcement learning (RL) algorithm, achieving collaborative control of position and stiffness for the VSSRL. The accuracy of the control system is validated through experiments, which demonstrate that the load capacity of the VSSRL is significantly enhanced under the action of tendons and FRAs, and that the VSSRL is able to provide various modes of assistance for daily life activities.

Abstract:
Inertial sensing estimation methods allows human motion tracking in the absence of optical tracking and joint encoders, but the methods are rather developed for quasistatic motion due to the limited motion capability of humanoids. This paper presents a new method that tracks highly dynamic motion using Inertial Measurement Unit (IMU) measurements. Unlike conventional methods dependent on quasistatic motion for inclination correction with the measured gravity vector, the proposed method uses accelerometers to correct the rotational rate. This is achieved by placing sensors on the ends of links, and converting the acceleration measured at the ends to angular rate based on centrifugal forces. Measuring human motions of low and high intensities is used to identify any strengths and weaknesses of the proposed method with different applications. The proposed technique maintains an acceptable error for both quasistatic and highly dynamic motions and can be used to accurately visualize measured motions.

Abstract:
This paper proposes FABG (Facial Affective Behavior Generation), an end-to-end imitation learning system for human-robot interaction, designed to generate natural and fluid facial affective behaviors. In interaction, effectively obtaining high-quality demonstrations remains a challenge. In this work, we develop an immersive virtual reality (VR) demonstration system that allows operators to perceive stereoscopic environments. This system ensures that "the operator’s visual perception matches the robot’s sensory input" and "the operator’s actions directly determine the robot’s behaviors" - as if the operator replaces the robot in human interaction engagements. We propose a prediction-driven latency compensation strategy to reduce robotic reaction delays and enhance interaction fluency. FABG naturally acquires human interactive behaviors and subconscious motions driven by intuition, eliminating manual behavior scripting. We deploy FABG on a real-world 25 degree-of-freedom (DoF) humanoid robot, validating its effectiveness through four fundamental interaction tasks: affective interaction, dynamic tracking, foveated attention, and gesture recognition, supported by data collection and policy training.

Abstract:
Decentralized multi-agent navigation under uncertainty is a complex task that arises in numerous robotic applications. It requires collision avoidance strategies that account for both kinematic constraints, sensing and action execution noise. In this paper, we propose a novel approach that integrates the Model Predictive Path Integral (MPPI) with a probabilistic adaptation of Optimal Reciprocal Collision Avoidance. Our method ensures safe and efficient multi-agent navigation by incorporating probabilistic safety constraints directly into the MPPI sampling process via a Second-Order Cone Programming formulation. This approach enables agents to operate independently using local noisy observations while maintaining safety guarantees. We validate our algorithm through extensive simulations with differential-drive robots and benchmark it against state-of-the-art methods, including ORCA-DD and B-UAVC. Results demonstrate that our approach outperforms them while achieving high success rates, even in densely populated environments. Additionally, validation in the Gazebo simulator confirms its practical applicability to robotic platforms. A source code is available at: http://github.com/PathPlanning/MPPI-Collision-Avoidance.

Abstract:
Micro-nano robots must break the symmetry of the flow field to generate net displacement in the low Reynolds number environment. The spherical micro-robots utilize the frictional forces generated through interaction with the surface. We designed a magnetic microroller robot powered by the rotating AC magnetic field. Here, we employed dual measurements of laser ranging and computer vision to demonstrate that a single 100 μm microroller maintains a lubrication film of 1 to 15 μm with the surface during normal motion. We found that the translational velocity of the microroller is correlated with the lubrication film thickness. Based on the robot's gravity, we controlled an additional downward gradient magnetic field to effectively increase the load of robot and reduce the lubrication film thickness, thereby controllably increasing the translational velocity of the robot. For example, the gradient magnetic field generated by superimposing a 30mA direct current input can reduce the lubrication film thickness from 8 μm to 4 μm in a 10 Hz rotating magnetic field, and increase the translational velocity from 230 μm/s to 460 μm/s. The enhancement of the robot's motion performance enables it to better control its movement in fluids. Finally, we validated the strategy for controllable acceleration of micro-scale particles rolling on surfaces, applied to control fluid motion in multiple arteries within blood vessels. These results offer deeper insights into the physical motion mechanism of surface robots and hold significant implications for future applications in biomedical engineering.

Affiliations: School of Mechanical Engineering, Tiangong University, Tianjin, China; School of Computer Science and Technology, Tiangong University, Tianjin, China; College of Information Science and Technology, Beijing University of Chemical Technology, Beijing, China; College of Mechanical and Electrical Engineering, Beijing University of Chemical Technology, Beijing, China; School of Information Science, Japan Advanced Institute of Science and Technology, Nomi, Japan; Graduate School of Science and Engineering, Ritsumeikan University, Shiga, Japan

Abstract:
This study introduces a novel burrowing robot that achieves effective locomotion inside granular media through the synergistic integration of high-frequency vibration-induced environmental-phase-transition (EPT) and adaptive morphing. The robotic system employs three key innovations: 1) an asymmetric arm trajectory mechanism generating directional propulsion, 2) a vibration-mediated granular fluidization system reducing environmental resistance, and 3) passively adaptive claws demonstrating phase-dependent configuration changes. Experimental results demonstrate that the synchronization of morphologically adaptive claws and high-frequency vibration significantly improves locomotion performance. Additionally, numerical simulations based on Adams-EDEM coupling provide deeper insights into the interaction mechanisms between the robot and granular media. This work advances fundamental understanding of terradynamic locomotion by demonstrating environmental modification as a viable strategy for resistance reduction, while providing a bio-inspired framework for developing versatile robotic systems capable of navigating complex particulate environments.

Abstract:
This study investigates a novel Xθ actuation system driven by a reluctance actuator (RA) and two accompanying moving magnet actuators (MMAs). The system enables precise control of both translational (x) and rotational (θ) motion, offering a two-degree-of-freedom (2DOF) solution for high-precision applications. The two MMAs introduce additional force and torque dynamics through the solenoid and permanent magnet (PM) pairs. Flexure hinges assist with the retraction force of the mover element, providing the necessary stiffness without introducing frictional effects. The system was modeled analytically, optimized, and validated experimentally with a developed feedback and feedback control, achieving steady-state errors of approximately ±7 µm in x translation and ±0.3 mrad in θ rotation which can be attributed to systematic errors in the sensor itself. The most relevant application is the fastscan mirror in extreme ultraviolet (EUV) lithography where specific targeted rotational and translational trajectories can benefit light beam positioning, such as wavefront corrections. This system allows translation and rotation specifications to be realized in one actuation unit, opening up more design possibilities for controlling precision motion systems.

Abstract:
Autonomous driving has the potential to set the stage for more efficient future mobility, requiring the research domain to establish trust through safe, reliable and transparent driving. Large Language Models (LLMs) possess reasoning capabilities and natural language understanding, presenting the potential to serve as generalized decision-makers for ego-motion planning that can interact with humans and navigate environments designed for human drivers. While this research avenue is promising, current autonomous driving approaches are challenged by combining 3D spatial grounding and the reasoning and language capabilities of LLMs. We introduce BEV-Driver, an LLM-based model for end-to-end closed-loop driving in CARLA that utilizes latent BEV features as perception input. BEVDriver includes a BEV encoder to efficiently process multi-view images and 3D LiDAR point clouds. Within a common latent space, the BEV features are propagated through a Q-Former to align with natural language instructions and passed to the LLM that predicts and plans precise future trajectories while considering navigation instructions and critical scenarios. On the LangAuto benchmark, our model reaches up to 18.9% higher performance on the Driving Score compared to SoTA methods.

Abstract:
Robotic manipulation can involves multiple manipulators to complete a task. In those cases, the complexity of performing the task in a coordinated manner increases, requiring coordinated planning while avoiding collisions between robots and environmental elements. For these challenges, we propose a robotic arm control algorithm based on Learning from Demonstration to independently learn the tasks of each arm, followed by a graph-based communication method using Gaussian Belief Propagation. Our method enables the resolution of decoupled dual-arm tasks learned independently with-out requiring coordinated planning. The algorithm generates smooth, collision-free solutions between arms and environmental obstacles while ensuring efficient movements without the need for constant replanning. Its efficiency has been validated through experiments and comparisons against another multi-robot control method in simulation using PyBullet with two opposing IIWA robots, as well as a mobile robot with two UR3 arms, which has also been used for real-world testing. Code provided in https://adrianprados.github.io/GaussianBeliefPropagationDualArm/

Abstract:
This paper presents a novel approach to image-goal navigation by integrating 3D Gaussian Splatting (3DGS) with Visual Navigation Models (VNMs), a method we refer to as GSplatVNM. VNMs offer a promising paradigm for image-goal navigation by guiding a robot through a sequence of point-of-view images without requiring metrical localization or environment-specific training. However, constructing a dense and traversable sequence of target viewpoints from start to goal remains a central challenge, particularly when the available image database is sparse. To address these challenges, we propose a 3DGS-based viewpoint synthesis framework for VNMs that synthesizes intermediate viewpoints to seamlessly bridge gaps in sparse data while significantly reducing storage overhead. Experimental results in a photorealistic simulator demonstrate that our approach not only enhances navigation efficiency but also exhibits robustness under varying levels of image database sparsity.

Abstract:
This paper presents a novel collaborative online dense mapping system for multiple Unmanned Aerial Vehicles (UAVs). The system confers two primary benefits: it facilitates simultaneous UAVs co-localization and real-time dense map reconstruction, and it recovers the metric scale even in GNSS-denied conditions. To achieve these advantages, Ultrawideband (UWB) measurements, monocular Visual Odometry (VO), and co-visibility observations are jointly employed to recover both relative positions and global UAV poses, thereby ensuring optimality at both local and global scales. In the proposed methodology, a two-stage optimization strategy is proposed to reduce optimization burden. Initially, relative Sim3 transformations among UAVs are swiftly estimated, with UWB measurements facilitating metric scale recovery in the absence of GNSS. Subsequently, a global pose optimization is performed to effectively mitigate cumulative drift. By integrating UWB, VO, and co-visibility data within this framework, both local geometric consistency and global pose accuracy are robustly maintained. Through comprehensive simulation and empirical real-world testing, we demonstrate that our system not only improves UAV positioning accuracy in challenging scenarios but also facilitates the high-quality, online integration of dense point clouds in large-scale areas. This research offers valuable contributions and practical techniques for precise, real-time map reconstruction using an autonomous UAV fleet, particularly in GNSS-denied environments.

Abstract:
Category-level 6D pose estimation in cluttered and occluded environments is a challenging task. Most existing methods rely on deterministic point-based correspondences to estimate target poses, which cannot consider the uncertainty for occluded objects, and thus result in inferior performance. In this paper, we propose a diffusion model guided by occlusion-aware observations to adaptively refine the object poses in occluded and cluttered scenes. Specifically, we first extract various 2D and 3D features from an RGB-D image to construct the conditions of diffusion model. In the reverse diffusion process, the model is guided by implicit correspondences, perception distance, and occlusion relationships to refine the noisy pose sampled from a standard Gaussian distribution. With several denoising steps, our method can produce accurate results that are consistent with image observations in occluded scenarios. The experimental results show that the proposed method can outperform baseline methods in major metrics in occlusion scenarios. Furthermore, our approach can also be applied in robotic grasping and manipulation tasks through grasping experiments in a cluttered enviroment on a physical UR5 robot.

Abstract:
Novel view image synthesis for large-scale outdoor traffic scenes presents significant challenges, including inaccurate depth measurements, moving objects, wide-angle rendering requirements, and the increased demand for memory and computational resources. In this paper, we propose an adaptive pipeline that constructs high-fidelity 3D surfel models and synthesizes realistic novel views in real time. Our contributions are threefold: 1) developing depth-refinement and moving-object-removal techniques to robustly reconstruct surfel-based scene geometry, while minimizing computational overhead; 2) developing a self-adaptive rendering mechanism which adjusts surfel geometry for large-scale scenes within constrained memory; 3) developing a hyper-parameter tuning approach for optimal surfel construction and rendering performance. An optional GAN-based inpainting module fills missing backgrounds (e.g., sky). Experiments on the KITTI dataset and CARLA simulator show that our method achieves image quality comparable to SOTA NeRF and 3D Gaussian Splatting techniques with significantly improved computational efficiency. This makes our approach particularly well-suited for large-scale traffic scenarios. Our simulation datasets with ground-truth data and source code are available at https://github.com/Billy1203/SurfelMapping.

Abstract:
In vivo image-guided multi-pipette patch-clamp is essential for studying cellular interactions and network dynamics in neuroscience. However, current procedures mainly rely on manual expertise, which limits accessibility and scalability. Robotic automation presents a promising solution, but achieving precise real-time detection of multiple pipettes remains a challenge. Existing methods focus on ex vivo experiments or single pipette use, making them inadequate for in vivo multi-pipette scenarios. To address these challenges, we propose a heatmap-augmented coarse-to-fine learning technique to facilitate multi-pipette real-time localisation for robot-assissted in vivo patch-clamp. More specifically, we introduce a Generative Adversarial Network (GAN)-based module to remove background noise and enhance pipette visibility. We then introduce a two-stage Transformer model that starts with predicting the coarse heatmap of the pipette tips, followed by the fine-grained coordination regression module for precise tip localisation. To ensure robust training, we use the Hungarian algorithm for optimal matching between the predicted and actual locations of tips. Experimental results demonstrate that our method achieved >98% accuracy within 10μm, and > 89% accuracy within 5μm for the localisation of multi-pipette tips. The average MSE is 2.52 μm.

Abstract:
This paper proposes a learning-based passive fault-tolerant control (PFTC) method for quadrotor capable of handling arbitrary single-rotor failures, including conditions ranging from fault-free to complete rotor failure, without requiring any rotor fault information or controller switching. Unlike existing methods that treat rotor faults as disturbances and rely on a single controller for multiple fault scenarios, our approach introduces a novel Selector-Controller network structure. This architecture integrates fault detection module and the controller into a unified policy network, effectively combining the adaptability to multiple fault scenarios of PFTC with the superior control performance of active fault-tolerant control (AFTC). To optimize performance, the policy network is trained using a hybrid framework that synergizes reinforcement learning (RL), behavior cloning (BC), and supervised learning with fault information. Extensive simulations and real-world experiments validate the proposed method, demonstrating significant improvements in fault response speed and position tracking performance compared to state-of-the-art PFTC and AFTC approaches. Video and code will be available at https://github.com/HITSZcjh/uav_ftc.

Abstract:
Performing target tracking and surveillance in dynamic obstacle environments requires maintaining continuous visual focus on the target while ensuring collision avoidance. This paper presents a safety-critical tracking control method that ensures dynamic obstacles remain outside the camera’s line of sight while simultaneously avoiding collisions between the chaser vehicle and obstacles. A novel real-time occlusion detection function is developed, and motion constraints are systematically integrated using a hybrid framework combining the artificial potential field (APF) method with an observer-based control strategy. To address temporal-sensitive tasks, a prescribed time controller (PTC) based on time-scale transformation technique has been proposed. Furthermore, a prescribed time linear extended state observer (PTESO) is proposed, featuring a simplified structure to enable rapid and accurate estimation of unknown environmental disturbances and non-linear terms. Finally, the effectiveness of the proposed method was verified via simulation in a simplified physical scenario.

Abstract:
Affordance understanding, the task of identifying actionable regions on 3D objects, plays a vital role in allowing robotic systems to engage with and operate within the physical world. Although Visual Language Models (VLMs) have excelled in high-level reasoning and long-horizon planning for robotic manipulation, they still fall short in grasping the nuanced physical properties required for effective human-robot interaction. In this paper, we introduce PAVLM (Point cloud Affordance Vision-Language Model), an innovative framework that utilizes the extensive multimodal knowledge embedded in pre-trained language models to enhance 3D affordance understanding of point cloud. PAVLM is an approach to integrates a geometric-guided propagation module with hidden embeddings from large language models (LLMs) to enrich visual semantics. On the language side, we prompt Llama-3.1 models to generate refined context-aware text, augmenting the instructional input with deeper semantic cues. Experimental results on the 3D-AffordanceNet benchmark demonstrate that PAVLM outperforms baseline methods for both full and partial point clouds, particularly excelling in its generalization to novel open-world affordance tasks of 3D objects. For more information, visit our project site: pavlm-source.github.io.

Abstract:
In intraocular microsurgery with minute operational scales, instruments pass through non-visible regions of the anterior segment, where robot-assisted surgery, which heavily relies on visual perception, fails to determine the instrument’s attitude relative to the eyeball. This compromises surgical flexibility, increases risks, and hinders autonomous surgery development. Therefore, a framework for predicting instrument trajectories in non-visible regions during robot-assisted microsurgery has been proposed to mitigate the risks of retinal and lens injuries caused by blind operations and enhance surgical procedures’ intelligence and autonomy. First, a lightweight reconstruction of the anterior segment environment is performed under controlled knowledge guidance to construct a global map. Second, the tip position of the surgical instrument is detected through multi-sensor fusion, enabling the perception of instrument-environment interactions under visual constraints. Based on this, a long short-term spatiotemporal aggregation algorithm for instrument trajectory prediction is proposed, which enhances surgical safety by providing high-precision predictions of the instrument tip’s motion trajectory. Experiments show that the framework achieved a 0.0435 mm average prediction error in non-visible regions, corresponding to 0.03% of the region in a single dimension and 7.25% of the surgical instrument’s diameter. This significantly enhances the precision of robot-assisted surgery under visual constraints and provides robust technical support for safe, intelligent, and autonomous intraocular robotic surgery.

Abstract:
Human-robot interaction requires robots to process language incrementally, adapting their actions in real-time based on evolving speech input. Existing approaches to language-guided robot motion planning typically assume fully specified instructions, resulting in inefficient stop-and-replan behavior when corrections or clarifications occur. In this paper, we introduce a novel reasoning-based incremental parser which integrates an online motion planning algorithm within the cognitive architecture. Our approach enables continuous adaptation to dynamic linguistic input, allowing robots to update motion plans without restarting execution. The incremental parser maintains multiple candidate parses, leveraging reasoning mechanisms to resolve ambiguities and revise interpretations when needed. By combining symbolic reasoning with online motion planning, our system achieves greater flexibility in handling speech corrections and dynamically changing constraints. We evaluate our framework in real-world human-robot interaction scenarios, demonstrating online adaptions of goal poses, constraints, or task objectives. Our results highlight the advantages of integrating incremental language understanding with real-time motion planning for natural and fluid human-robot collaboration. The experiments are demonstrated in the accompanying video at www.acin.tuwien.ac.at/42d5.

Abstract:
Accurate prediction of human behavior is crucial for AI systems to effectively support real-world applications, such as autonomous robots anticipating and assisting with human tasks. Real-world scenarios frequently present challenges such as occlusions and incomplete scene observations, which can compromise predictive accuracy. Thus, traditional video-based methods often struggle due to limited temporal and spatial perspectives. Large Language Models (LLMs) offer a promising alternative. Having been trained on a large text corpus describing human behaviors, LLMs likely encode plausible sequences of human actions in a home environment. However, LLMs, trained primarily on text data, lack inherent spatial awareness and real-time environmental perception. They struggle with understanding physical constraints and spatial geometry. Therefore, to be effective in a real-world spatial scenario, we propose a multimodal prediction framework that enhances LLM-based action prediction by integrating physical constraints derived from human trajectories. Our experiments demonstrate that combining LLM predictions with trajectory data significantly improves overall prediction performance. This enhancement is particularly notable in situations where the LLM receives limited scene information, highlighting the complementary nature of linguistic knowledge and physical constraints in understanding and anticipating human behavior.Project page: https://sites.google.com/view/trllmƒusp=sharingGithub repo: https://github.com/kojirotakeyama/TR-LLM/blob/main/readme.md

Abstract:
Electroactive polymer actuators based on PVDF-TrFE-CTFE terpolymers are gaining prominence in applications requiring compliant, high-precision actuation, notably in soft robotics for minimally invasive surgical tools. However, the relaxor ferroelectric properties of these materials introduce significant nonlinear hysteresis, which impairs control accuracy. This paper presents a model-based control architecture that overcomes these challenges by integrating an analytically inverted generalized Prandtl-Ishlinskii hysteresis model with model predictive control (MPC). The proposed framework effectively linearizes the hysteresis behavior while optimizing control inputs under actuator constraints, ensuring rapid response and zero steady-state error in quasi-static regimes. The developed model has been compared against experimental measurements of actuator deflection and electric displacement, demonstrating strong correlation between predicted and observed behavior. Numerical simulations demonstrate that the MPC strategy achieves short settling times and improved trajectory tracking compared to conventional proportional-integral-derivative control. These results underscore the potential of the integrated approach for precision-critical urological applications.

Abstract:
Pouch motors continue to attract research attention owing to their simple fabrication process, low cost, and excellent energy density. Existing pouch motors based on the liquid-gas phase transition (LGPT) principle exhibit significant stroke and force outputs but suffer from slow responses. Pouch motors that rely on the electrohydraulic actuation (EHA) demonstrate rapid responses and broad bandwidths, yet their stroke/force outputs remain limited. This paper presents a novel thermal-electrostatic dual-modal soft pouch motor (TES-SPM) that synergistically combines the advantages of LGPT and EHA. The output performance of the TES-SPM in both the LGPT and EHA modes is characterized by extensive experiments. The effects of key parameters including the liquid volumes and actuation voltage/current amplitudes are also investigated in experiments. In the EHA mode, the TES-SPM can exert a stroke of 2.5 mm within a rapid ~ 0.06 s, while in the LGPT mode, it is able to exhibit a maximum stroke of 22.8 mm and a blocking force of ~ 80 N. A novel folding fan-inspired actuator and accordion-inspired soft gripper based on the serially attached TES-SPM units are developed to demonstrate the potentials of soft robotic applications. The TES-SPM designed in this paper is envisioned to have promising applications in industrial soft grippers and wearable assistive devices.

Abstract:
Detecting driver fatigue is critical for road safety, as drowsy driving remains a leading cause of traffic accidents. Many existing solutions rely on computationally demanding deep learning models, which result in high latency and are unsuitable for embedded robotic devices with limited resources (such as intelligent vehicles/cars) where rapid detection is necessary to prevent accidents. This paper introduces LiteFat, a lightweight spatio-temporal graph learning model designed to detect driver fatigue efficiently while maintaining high accuracy and low computational demands. LiteFat involves converting streaming video data into spatio-temporal graphs (STG) using facial landmark detection, which focuses on key motion patterns and reduces unnecessary data processing. LiteFat uses MobileNet to extract facial features and create a feature matrix for the STG. A lightweight spatio-temporal graph neural network is then employed to identify signs of fatigue with minimal processing and low latency. Experimental results on benchmark datasets show that LiteFat performs competitively while significantly reduced computational complexity and latency as compared to current state-of-the-art methods. This work advances the development of real-time, resource-efficient human fatigue detection systems that can be implemented upon embedded robotic devices.

Abstract:
Accurate and physically feasible human motion prediction is crucial for safe and seamless human-robot collaboration. While recent advancements in human motion capture enable real-time pose estimation, the practical value of many existing approaches is limited by the lack of future predictions and consideration of physical constraints. Conventional motion prediction schemes rely heavily on past poses, which are not always available in real-world scenarios. To address these limitations, we present a physics-informed learning framework that integrates domain knowledge into both training and inference to predict human motion using inertial measurements from only 5 IMUs. We propose a network that accounts for the spatial characteristics of human movements. During training, we incorporate forward and differential kinematics functions as additional loss components to regularize the learned joint predictions. At the inference stage, we refine the prediction from the previous iteration to update a joint state buffer, which is used as extra inputs to the network. Experimental results demonstrate that our approach achieves high accuracy, smooth transitions between motions, and generalizes well to unseen subjects. The source code and data are available at https://github.com/ami–iit/paper_guo_2025_iros_human_kinematics_prediction.

Abstract:
Autonomous systems have advanced significantly, but challenges persist in accident-prone environments where robust decision-making is crucial. A single vehicle’s limited sensor range and obstructed views increase the likelihood of accidents. Multi-vehicle connected systems and multi-modal approaches, leveraging RGB images and LiDAR point clouds, have emerged as promising solutions. However, existing methods often assume the availability of all data modalities and connected vehicles during both training and testing, which is impractical due to potential sensor failures or missing connected vehicles. To address these challenges, we introduce a novel framework MMCD (Multi-Modal Collaborative Decision-making) for connected autonomy. Our framework fuses multi-modal observations from ego and collaborative vehicles to enhance decision-making under challenging conditions. To ensure robust performance when certain data modalities are unavailable during testing, we propose an approach based on cross-modal knowledge distillation with a teacher-student model structure. The teacher model is trained with multiple data modalities, while the student model is designed to operate effectively with reduced modalities. In experiments on connected autonomous driving with ground vehicles and aerial-ground vehicles collaboration, our method improves driving safety by up to 20.7%, surpassing the best-existing baseline in detecting potential accidents and making safe driving decisions. More information can be found on our website https://ruiiu.github.io/mmcd.

Abstract:
Robot-assisted dressing is a popular but challenging topic in the field of robotic manipulation, offering significant potential to improve the quality of life for individuals with mobility limitations. Currently, the majority of research on robot-assisted dressing focuses on how to put on loose-fitting clothing, with little attention paid to tight garments. For the former, since the armscye is larger, a single robotic arm can usually complete the dressing task successfully. However, for the latter, dressing with a single robotic arm often fails due to the narrower armscye and the property of diminishing rigidity in the armscye, which eventually causes the armscye to get stuck. This paper proposes a bimanual dressing strategy suitable for dressing tight-fitting clothing. To facilitate the encoding of dressing trajectories that adapt to different human arm postures, a spherical coordinate system for dressing is established. We uses the azimuthal angle of the spherical coordinate system as a task-relevant feature for bimanual manipulation. Based on this new coordinate, we employ Gaussian Mixture Model (GMM) and Gaussian Mixture Regression (GMR) for imitation learning of bimanual dressing trajectories, generating dressing strategies that adapt to different human arm postures. The effectiveness of the proposed method is validated through various experiments.

Abstract:
We address the problem of generating long-horizon videos for robotic manipulation tasks. Text-to-video diffusion models have made significant progress in photorealism, language understanding, and motion generation but struggle with long-horizon robotic tasks. Recent works use video diffusion models for high-quality simulation data and predictive rollouts in robot planning. However, these works predict short sequences of the robot achieving one task and employ an autoregressive paradigm to extend to the long horizon, leading to error accumulations in the generated video and in the execution. To overcome these limitations, we propose a novel pipeline that bypasses the need for autoregressive generation. We achieve this through a threefold contribution: 1) we first decompose the high-level goals into smaller atomic tasks and generate keyframes aligned with these instructions. A second diffusion model then interpolates between each of the two generated frames, achieving the long-horizon video. 2) We propose a semantics preserving attention module to maintain consistency between the keyframes. 3) We design a lightweight policy model to regress the robot joint states from generated videos. Our approach achieves state-of-the-art results on two benchmarks in video quality and consistency while outperforming previous policy models on long-horizon tasks.

Abstract:
In cluttered, reconfigurable environments with varied objects, materials, and sizes—where obstacles can be added, removed, or moved—planning robot arm actions between grasps is a key challenge for traditional bin picking. We propose a trajectory planner that quickly determines how to grasp and precisely assemble customized products in reconfigurable environment. Inspired by voting and global registration methods, our strategy reduces false positives and enhances object identification accuracy, even with vastly different point cloud scales. We present an image processing technique combining best-fit and iterative closest point methods to improve system robustness, adaptability, and stability for objects of varying shapes, sizes, and materials, achieving millimeter-level 3D localization with millions of scene points. An enhanced tree method manages uneven 3D space distribution, enabling fast collision-free planning. A space configuration algorithm minimizes computational load, supporting complex tasks like liquid handling. After digital twin verification of collision-free paths, the robotic arm executes assembly tasks. This approach is validated through Lego assembly and other industrial use cases.

Abstract:
When performing total hip replacement (THR) surgery, high-quality preparation of acetabulum is critical as it contributes to the patient’s recovery speed and the consistency of bone ingrowth. Conventionally, surgeons prepare the acetabulum manually by reaming it with a handheld electric drill and a reamer. It not only increases the surgeon’s workload but more importantly, it is difficult to control the reaming depth and direction accurately. Utilizing an admittance-controlled (AC) collaborative robot (cobot) to enable physical human-robot collaboration (pHRC) possesses a promising solution. For primitive AC, a compromise must be made between compliance and task accuracy. In this paper, we present a novel variable admittance control (VAC) design that considers the reactive force of bone while ensuring the passivity and stability of the system during pHRC-assisted acetabular preparation. The qualitative results show that VAC was more desirable by users than the conventional manual reaming method. Compared to other pHRC controls, quantitative results on user energy consumption, reaming error, and smoothness showed the proposed VAC can achieve a balance between physical workload and acetabular quality. Compared to manual reaming, VAC reduced the reaming error by 67.47% and improved the final acetabulum surface smoothness by 18.30%.

Abstract:
Mapping and scene representation are fundamental to reliable planning and navigation in mobile robots. While purely geometric maps using voxel grids allow for general navigation, obtaining up-to-date spatial and semantically rich representations that scale to dynamic large-scale environments remains challenging. In this work, we present CURB-OSG, an open-vocabulary dynamic 3D scene graph engine that generates hierarchical decompositions of urban driving scenes via multi-agent collaboration. By fusing the camera and LiDAR observations from multiple perceiving agents with unknown initial poses, our approach generates more accurate maps compared to a single agent while constructing a unified open-vocabulary semantic hierarchy of the scene. Unlike previous methods that rely on ground truth agent poses or are evaluated purely in simulation, CURB-OSG alleviates these constraints. We evaluate the capabilities of CURB-OSG on real-world multi-agent sensor data obtained from multiple sessions of the Oxford Radar RobotCar dataset. We demonstrate improved mapping and object prediction accuracy through multi-agent collaboration as well as evaluate the environment partitioning capabilities of the proposed approach. To foster further research, we release our code and supplementary material at https://ovcurb.cs.uni-freiburg.de.

Abstract:
We present MeCO, the Medium Cost Open-source autonomous underwater vehicle (AUV), a versatile autonomous vehicle designed to support research and development in underwater human-robot interaction (UHRI) and marine robotics in general. An inexpensive platform to build compared to similarly-capable AUVs, the MeCO design and software are released under open-source licenses, making it a cost effective, extensible, and open platform. It is equipped with UHRI-focused systems, such as front and side facing displays, light-based communication devices, a transducer for acoustic interaction, and stereo vision, in addition to typical AUV sensing and actuation components. Additionally, MeCO is capable of real-time deep learning inference using the latest edge computing devices, while maintaining low-latency, closed-loop control through high-performance microcontrollers. MeCO is designed from the ground up for modularity in internal electronics, external payloads, and software architecture, exploiting open-source robotics and containerarization tools. We demonstrate the diverse capabilities of MeCO through simulated, closed-water, and open-water experiments. All resources necessary to build and run MeCO, including software and hardware design, have been made publicly available.

Abstract:
Commercial UAVs are an emerging security threat as they are capable of carrying hazardous payloads or disrupting air traffic. To counter UAVs, we introduce an autonomous 3D target encirclement and interception strategy. Unlike traditional ground-guided systems, this strategy employs autonomous drones to track and engage non-cooperative hostile UAVs, which is effective in non-line-of-sight conditions, GPS denial, and radar jamming, where conventional detection and neutralization from ground guidance fail. Using two noisy real-time distances measured by drones, guardian drones estimate the relative position from their own to the target using observation and velocity compensation methods, based on anti-synchronization (AS) and an X−Y circular motion combined with vertical jitter. An encirclement control mechanism is proposed to enable UAVs to adaptively transition from encircling and protecting a target to encircling and monitoring a hostile target. Upon breaching a warning threshold, the UAVs may even employ a suicide attack to neutralize the hostile target. We validate this strategy through real-world UAV experiments and simulated analysis in MATLAB, demonstrating its effectiveness in detecting, encircling, and intercepting hostile drones. More details: https://youtu.be/5eHW56lPVto.

Abstract:
Ultrasound (US) imaging is increasingly used in spinal procedures due to its real-time, radiation-free capabilities; however, its effectiveness is hindered by shadowing artifacts that obscure deeper tissue structures. Traditional approaches, such as CT-to-US registration, incorporate anatomical information from preoperative CT scans to guide interventions, but they are limited by complex registration requirements, differences in spine curvature, and the need for recent CT imaging. Recent shape completion methods can offer an alternative by reconstructing spinal structures in US data, while being pretrained on large set of publicly available CT scans. However, these approaches are typically offline and have limited reproducibility. In this work, we introduce a novel integrated system that combines robotic ultrasound with real-time shape completion to enhance spinal visualization. Our robotic platform autonomously acquires US sweeps of the lumbar spine, extracts vertebral surfaces from ultrasound, and reconstructs the complete anatomy using a deep learning-based shape completion network. This framework provides interactive, real-time visualization with the capability to autonomously repeat scans and can enable navigation to target locations. This can contribute to better consistency, reproducibility, and understanding of the underlying anatomy. We validate our approach through quantitative experiments assessing shape completion accuracy and evaluations of multiple spine acquisition protocols on a phantom setup. Additionally, we present qualitative results of the visualization on a volunteer scan.

Abstract:
In multi-agent safety-critical scenarios, traditional autonomous driving frameworks face significant challenges in balancing safety constraints and task performance. These frameworks struggle to quantify dynamic interaction risks in real-time and depend heavily on manual rules, resulting in low computational efficiency and conservative strategies. To address these limitations, we propose a Dynamic Residual Safe Reinforcement Learning (DRS-RL) framework grounded in a safety-enhanced networked Markov decision process. It’s the first time that the weak-to-strong theory is introduced into multi-agent decision-making, enabling lightweight dynamic calibration of safety boundaries via a weak-to-strong safety correction paradigm. Based on the multi-agent dynamic conflict zone model, our framework accurately captures spatiotemporal coupling risks among heterogeneous traffic participants and surpasses the static constraints of conventional geometric rules. Moreover, a risk-aware prioritized experience replay mechanism mitigates data distribution bias by mapping risk to sampling probability. Experimental results reveal that the proposed method significantly outperforms traditional RL algorithms in safety, efficiency, and comfort. Specifically, it reduces the collision rate by up to 92.17%, while the safety model accounts for merely 27% of the main model’s parameters.

Abstract:
Unpredictable and complex aerodynamic effects pose significant challenges to achieving precise flight control, emphasizing the necessity of adaptive control via data- driven models. Moreover, real hardware usually requires high-frequency and has limited on-chip computation, making it challenging to balance the model complexity and computational cost. To address these challenges, we incorporate a linearized Gaussian process (GP) to model the external aerodynamics and combine it with linear model predictive control, enabling real-time computability. More importantly, to compensate for the control performance sacrificed by GP linearization and reduce on-chip GP computations, we design active data collection strategies using Bayesian optimization with additive GP, reducing the performance sacrifice as much as possible. Specifically, we decompose the performance into force and trajectory partitions, where the force model is for the downstream controller, and the trajectory model is used to guide collection. Experimental results show that we can achieve comparable tracking errors with full GP (not real-time computable) while maintaining real-time computable on the real Crazyflies.

Abstract:
This work proposes a mmWave radar’s scene flow estimation framework supervised by data from a widespread visual-inertial (VI) sensor suite, allowing crowdsourced training data from smart vehicles. Current scene flow estimation methods for mmWave radar are typically supervised by dense point clouds from 3D LiDARs, which are expensive and not widely available in smart vehicles. While VI data are more accessible, visual images alone cannot capture the 3D motions of moving objects, making it difficult to supervise their scene flow. Moreover, the temporal drift of VI rigid transformation also degenerates the scene flow estimation of static points. To address these challenges, we propose a drift-free rigid transformation estimator that fuses kinematic model-based ego-motions with neural network-learned results. It provides strong supervision signals to radar-based rigid transformation and infers the scene flow of static points. Then, we develop an optical-mmWave supervision extraction module that extracts the supervision signals of radar rigid transformation and scene flow. It strengthens the supervision by learning the scene flow of dynamic points with the joint constraints of optical and mmWave radar measurements. Extensive experiments demonstrate that, in smoke-filled environments, our method even outperforms state-of-the-art (SOTA) approaches using costly LiDARs.

Abstract:
Advancements in biophotonics have driven the development of miniaturized imaging probes for high-resolution in vivo imaging. Probe-based confocal laser endomicroscopy (pCLE) enables cellular-level visualization of tissues but remains challenging for retinal imaging due to the need for non-contact operation, tremor compensation, and precise focal control. This study introduces a novel handheld confocal endomi-croscope system that integrates a custom-built imaging probe, an optical coherence tomography (OCT) distance sensor, and motor-assisted tremor suppression to improve imaging stability and resolution. The system employs a fiber-based common-path swept-source OCT (CPSS-OCT) sensor to maintain a stable focal distance while compensating for involuntary hand tremors using motorized stabilization. A gated recurrent unit (GRU)-based tremor prediction algorithm further enhances image stability. The imaging probe features a PZT tube-driven fiber cantilever resonance for Lissajous scanning, providing a wide field of view with minimal image distortion. In experiments using bovine eye samples, the CR score improved from 0.318 to 0.472, with a 48.43% increase in the in-focus condition when tremor compensation was activated, confirming enhanced image clarity and stability. Experimental results demonstrate that the system effectively stabilizes imaging, reduces motion artifacts, and ensures high-resolution, non-contact retinal imaging. By addressing the limitations of conventional pCLE devices, this system represents a significant advancement in ophthalmic imaging and can potentially improve retinal diagnostics and precision-guided interventions.

Abstract:
In contact-rich tasks such as polishing and drilling, inevitable physical interactions often lead to task deviations due to interference, typically resulting in excessive contact forces and eventual task failure. To tackle these challenges, we propose an innovative human-guided visual-impedance control framework. Specifically, we first introduce an interaction model within image feature space, which models the dynamics of human-robot-environment interactions. Subsequently, human operation skills are characterized through human-guided wrenches, and acts on visual features through a projection matrix, thus integrating human-guided wrenches with visual-impedance interaction dynamics. Finally, leveraging this framework, we develop a novel variable-stiffness visual-impedance control strategy. The impedance parameters are optimized online via Quadratic Program, ensuring that the end-tool contact force converges to desired value while adhering to safety constraints. The validity of the proposed framework was established through polish experiments.

Abstract:
Event-based semantic segmentation has great potential in autonomous driving and robotics due to the advantages of event cameras, such as high dynamic range, low latency, and low power cost. Unfortunately, current artificial neural network (ANN)-based segmentation methods suffer from high computational demands, the requirements for image frames, and massive energy consumption, limiting their efficiency and application on resource-constrained edge/mobile platforms. To address these problems, we introduce SLTNet, a Spike-driven Lightweight Transformer-based Network designed for event-based semantic segmentation. Specifically, SLTNet is built on efficient spike-driven convolution blocks (SCBs) to extract rich semantic features while reducing the model’s parameters. Then, to enhance the long-range contextual feature interaction, we propose novel spike-driven transformer blocks (STBs) with binary mask operations. Based on these basic blocks, SLTNet employs a high-efficiency single-branch architecture while maintaining the low energy consumption of the Spiking Neural Network (SNN). Finally, extensive experiments on DDD17 and DSEC-Semantic datasets demonstrate that SLTNet outperforms state-of-the-art (SOTA) SNN-based methods by at most 9.06% and 9.39% mIoU, respectively, with extremely 4.58× lower energy consumption and 114 FPS inference speed. Our code is open-sourced and available at https://github.com/longxianlei/SLTNet-v1.0.

Abstract:
Accurate 3D reconstruction in surgical scenarios is essential for visualizing dynamic tissues with complex anatomical geometries. While 3D Gaussian Splatting (3D-GS) has been explored as an efficient approach to scene modeling, occlusion-induced voids and suboptimal detail optimization have limited its application in surgery. This work introduces a Complexity-Guided 3D Gaussian Splatting (CG-3DGS) framework, in which occlusion regions are globally filled by a state-of-the-art optical flow-based video inpainting method. A frequency–spatial aware refinement (FSAR) mechanism is proposed, allowing spectral signatures and spatial gradients to be jointly analyzed to enhance critical anatomical features (e.g., blood vessels). This mechanism adaptively guides Gaussian densification based on scene-specific anatomical complexity. Experimental results demonstrate that the proposed framework achieves higher reconstruction fidelity while maintaining efficient rendering speeds.

Abstract:
We present a framework for learning dexterous in-hand manipulation with multifingered hands using visuo-motor diffusion policies. Our system enables complex in-hand manipulation tasks, such as unscrewing a bottle lid with one hand, by leveraging a fast and responsive teleoperation setup for the four-fingered Allegro Hand. We collect high-quality expert demonstrations using an augmented reality (AR) interface that tracks hand movements and applies inverse kinematics and motion retargeting for precise control. The AR headset provides real-time visualization, while gesture controls streamline teleoperation. To enhance policy learning, we introduce a novel demonstration outlier removal approach based on HDBSCAN clustering and the Global-Local Outlier Score from Hierarchies (GLOSH) algorithm, effectively filtering out low-quality demonstrations that could degrade performance. We evaluate our approach extensively in real-world settings and provide all experimental videos on the project website.1.

Abstract:
The multi-drone waypoint traversal has significant potential for aerial robot swarms in various applications. However, it still faces challenges including low time efficiency, susceptibility to local minima, poor resilience to external disturbances, high computational complexity, and high communication burden. To address these issues, we propose a two-stage swarm planning framework by integrating an offline global trajectory generator and an online distributed local trajectory planner. This approach not only ensures time-optimality but also enhances resistance to external disturbances. Specifically, a complementary progress constraint (CPC)-based global trajectory planning method is first presented to generate globally optimal reference trajectories. Then, by taking these trajectories as global guidance, a local planner is designed to guarantee collision-free traversal. In the local planner, we present a distributed local re-planning algorithm by embedding the positional constraints constructed by Voronoi diagrams into the model predictive contouring control (MPCC). The drones only exchange their position information, significantly reducing the communication load. Additionally, the Voronoi-based spatial constraints allow the swarms to eliminate the collision risk caused by asynchronous communication. To reduce onboard computational resource requirements, the local planner adopts the real-time iteration (RTI) technique, executing the optimization only once per control cycle. Both simulation and real-world experiments demonstrate that our approach outperforms state-of-the-art methods in terms of waypoint tracking accuracy, safety, and global time optimality.

Abstract:
Recognition of human manipulation actions in real-time is essential for safe and effective human-robot interaction and collaboration. The challenge lies in developing a model that is both lightweight enough for real-time execution and capable of generalization. While some existing methods in the literature can run in real-time, they struggle with temporal scalability, i.e., they fail to adapt to long-duration manipulations effectively. To address this, leveraging the generalizable scene graph representations, we propose a new Factorized Graph Sequence Encoder network that not only runs in real-time but also scales effectively in the temporal dimension, thanks to its factorized encoder architecture. Additionally, we introduce Hand Pooling operation, a simple pooling operation for more focused extraction of the graph-level embeddings. Our model outperforms the previous state-of-the-art real-time approach, achieving a 14.3% and 5.6% improvement in F1-macro score on the KIT Bimanual Action (Bimacs) Dataset and Collaborative Action (CoAx) Dataset, respectively. Moreover, we conduct an extensive ablation study to validate our network design choices. Finally, we compare our model with its architecturally similar RGB-based model on the Bimacs dataset and show the limitations of this model in contrast to ours on such an object-centric manipulation dataset. Our code and trained models are available at https://github.com/eneserdo/FGSE.

Abstract:
Recent advancements in robotic eye surgery and intraoperative 4D optical coherence tomography (iOCT) imaging could enable fully or partially autonomous robotic procedures and enhanced surgical visualization. A fundamental requirement for such applications is rapid semantic segmentation of intraoperative 4D OCT data, which is capable of acquiring volumes at video rate, to provide real-time three-dimensional scene perception. Significant advancements have been made in learning-based 2D and 3D OCT segmentation techniques, pushing the boundaries of accuracy and performance. However, despite these achievements, the computational demands of 2D and 3D convolutions make real-time intraoperative processing of 4D OCT infeasible, even with substantial computational resources.This work introduces a novel real-time iOCT volume segmentation methodology. The novelty consists of a dynamic motion-aware A-scan sampling strategy, followed by an efficient segmentation approach, guaranteeing both speed and accuracy of segmentation. Our A-scan-based processing network leverages a 1D convolution approach to resolve the complexities of multi-dimensional kernels and allow for maximum parallelization, resulting in significantly faster performance. We further show that OCT volume segmentation can be reconstructed from a sparse A-scan sampling strategy that prioritizes areas in which inter-volume motion was detected, and that even missing anatomical surface information below the surgical tools can be reconstructed. Our results show high segmentation performance in dynamic surgical environments and video-rate segmentation performance meeting the demanding processing requirements of 4D OCT and leading to substantial speed improvements over previous methods.

Abstract:
With the continuous advancement of minimally invasive surgery, the incisions have become increasingly smaller, leading to a proportional convergence between the physical dimensions of surgical instruments and the operational space. This phenomenon has exacerbated the issue of visual obstruction caused by the overlapping of the instruments, which now stands as a significant technical impediment to the enhancement of precision in minimally invasive procedures. Vanilla approaches rely on deep learning-based image inpainting techniques to address this issue, but their results are unreliable for surgeons making decisions. This work rethinks hardware design, sacrificing resolution to increase the view of cameras, effectively filling the instrument occluded areas through multi-view technology. Subsequently, a super-resolution method is employed to restore the details of the inpainted images. This innovative approach transforms the uncertainty of deep learning from a large-range pixel inference problem into a more controllable pixel interpolation task, thereby significantly enhancing the reliability and accuracy of the repaired results. Taking spinal endoscopic surgery as an application scenario, we designed a miniature multi-view endoscope system tailored to the specific needs of this surgery. Phantom experiments are conducted to verify the feasibility of the proposed instrument transparency method. The results demonstrate the potential of the proposed method for improving minimally invasive surgery.

Abstract:
Impedance control can be achieved within a model predictive control (MPC) framework for optimization and constraint compliance. However, user-defined or optimization-derived impedance models can be too conservative to achieve a timely convergence, or too aggressive to ensure safety. To address this, an MPC-based impedance control framework with learning-based tuning for predefined-time (PdT) convergence is proposed. On the low level, the framework dynamically selects between a task-oriented and a safety-oriented impedance model based on real-time interaction force modeling and safety assessments, ensuring optimal performance and maintaining safety while interacting with unknown and complex environments. On the high level, the framework achieves PdT convergence via reinforcement learning for meta-parameter tuning, allowing users to specify the desired convergence time upper bound. Lastly, the superiority of the proposed framework is validated on interaction safety and PdT convergence via experiments.

Abstract:
Electrical capacitance tomography (ECT) is a contactless and non-invasive imaging technique, which visualizes the internal permittivity distribution around a region utilizing boundary capacitance measurements. It has been widely used in the fields of object classification, tactile sensing and multiphase flows monitoring. However, due to the inherent nonlinearity and ill-conditioned nature of the ECT inverse problem, its practical implementation remains limited by challenges in the image reconstruction accuracy. To tackle the above problems, we propose a patch-based transformer method (PT) for an accurate reconstruction of ECT images. Specifically, the complex capacitance-to-image mapping is systematically decoupled into the capacitance-to-patch feature extraction and patch-to-image reconstruction, enabling more efficient and accurate permittivity distribution recovery through localized feature learning and global context integration. Additionally, a simulation ECT dataset for objects with varying sizes and positions is established.

Abstract:
This paper presents a novel robust online calibration framework for Ultra-Wideband (UWB) anchors in UWB-aided Visual-Inertial Navigation Systems (VINS). Accurate anchor positioning, a process known as calibration, is crucial for integrating UWB ranging measurements into state estimation. While several prior works have demonstrated satisfactory results by using robot-aided systems to autonomously calibrate UWB systems, there are still some limitations: 1) these approaches assume accurate robot localization during the initialization step, ignoring localization errors that can compromise calibration robustness, and 2) the calibration results are highly sensitive to the initial guess of the UWB anchors’ positions, reducing the practical applicability of these methods in real-world scenarios. Our approach addresses these challenges by explicitly incorporating the impact of robot localization uncertainties into the calibration process, ensuring robust initialization. To further enhance the robustness of the calibration results against initialization errors, we propose a tightly-coupled Schmidt Kalman Filter (SKF)-based online refinement method, making the system suitable for practical applications. Simulations and real-world experiments validate the improved accuracy and robustness of our approach.

Abstract:
Enabling robots to grasp and reposition human limbs can significantly enhance their ability to provide assistive care to individuals with severe mobility impairments, particularly in tasks such as robot-assisted bed bathing and dressing. However, existing assistive robotics solutions often assume that the human remains static or quasi-static, limiting their effectiveness. To address this issue, we present Manip4Care, a modular simulation pipeline that enables robotic manipulators to grasp and reposition human limbs effectively. Our approach features a physics simulator equipped with built-in techniques for grasping and repositioning while considering biomechanical and collision avoidance constraints. Our grasping method employs antipodal sampling with force closure to grasp limbs, and our repositioning system utilizes the Model Predictive Path Integral (MPPI) and vector-field-based control method to generate motion trajectories under collision avoidance and biomechanical constraints. We evaluate this approach across various limb manipulation tasks in both supine and sitting positions and compare outcomes for different age groups with differing shoulder joint limits. Additionally, we demonstrate our approach for limb manipulation using a real-world mannequin and further showcase its effectiveness in bed bathing tasks. Our implementation is available at https://github.com/yubink2/Manip4Care.

Abstract:
In recent years, the development of robots capable of operating in both aerial and aquatic environments has gained significant attention. This study presents the design and fabrication of a novel aerial-aquatic locomotion robot (AALR). Inspired by the diving beetle, the AALR incorporates a biomimetic propulsion mechanism with power and recovery strokes. The variable stiffness propulsion module (VSPM) uses low melting point alloy (LMPA) and variable stiffness joints (VSJ) to achieve efficient aquatic locomotion while reducing harm to marine life. The AALR’s innovative design integrates the VSPM into the arms of a traditional quadrotor, allowing for effective aerial-aquatic locomotion. The VSPM adjusts joint stiffness through temperature control, meeting locomotion requirements in both aerial and aquatic modes. A dynamic model for the VSPM was developed, with iterative improved dimensional parameters to increase propulsion force. Experiments focused on aquatic mode analysis and demonstrated the AALR’s swimming capability, achieving a maximum swimming speed of 77 mm/s underwater. The results confirm the AALR’s effective performance in water environment, highlighting its potential for versatile, eco-friendly operations.

Abstract:
Amidst the surge in the use of Artificial Intelligence (AI) for control purposes, classical and model-based control methods maintain their popularity due to their transparency and deterministic nature. However, advanced controllers like Nonlinear Model Predictive Control (NMPC), despite proven capabilities, face adoption challenges due to their computational complexity and unpredictable closed-loop performance in complex validation systems. This paper introduces ExAMPC, a methodology bridging classical control and explainable AI by augmenting the NMPC with data-driven insights to improve the trustworthiness and reveal the optimization solution and closed-loop performance’s sensitivities to physical variables and system parameters. By employing a low-order spline embedding, we reduce the open-loop trajectory dimensionality by over 95%, and integrate it with SHAP and Symbolic Regression from eXplainable AI (XAI) for an approximate NMPC, enabling intuitive physical insights into the NMPC’s optimization routine. The prediction accuracy of the approximate NMPC is enhanced through physics-inspired continuous-time constraints penalties, reducing the predicted continuous trajectory violations by 93%. ExAMPC also enables accurate forecasting of the NMPC’s computational requirements with explainable insights on worst-case scenarios. Experimental validation on automated valet parking and autonomous racing with lap-time optimization, demonstrates the methodology’s practical effectiveness for potential real-world applications.

Abstract:
Ego-Motion estimation is vital for drones when flying in GPS-denied environments. Vision-Based methods struggle when flight speed increases and close-by objects lead to difficult visual conditions with considerable motion blur and large occlusions. To tackle this, vision is typically complemented by state estimation filters that combine a drone model with inertial measurements. However, these drone models are currently learned in a supervised manner with ground-truth data from external motion capture systems, limiting scalability to different environments and drones. In this work, we propose a self-supervised learning scheme to train a neural-network-based drone model using only onboard monocular video and flight controller data (IMU and motor feedback). We achieve this by first training a self-supervised relative pose estimation model, which then serves as a teacher for the drone model. To allow this to work at high speed close to obstacles, we propose an improved occlusion handling method for training self-supervised pose estimation models. Due to this method, the root mean squared error of resulting odometry estimates is reduced by an average of 15%. Moreover, the student neural drone model can be successfully obtained from the onboard data. It even becomes more accurate at higher speeds compared to its teacher, the self-supervised vision-based model. We demonstrate the value of the neural drone model by integrating it into a traditional filter-based VIO system (ROVIO), resulting in superior odometry accuracy on aggressive 3D racing trajectories near obstacles. Self-Supervised learning of ego-motion estimation represents a significant step toward bridging the gap between flying in controlled, expensive lab environments and real-world drone applications. The fusion of vision and drone models will enable higher-speed flight and improve state estimation, on any drone in any environment.

Abstract:
Ultra-Wideband (UWB) is widely used to mitigate drift in visual-inertial odometry (VIO) systems. Consistency is crucial for ensuring the estimation accuracy of a UWB-aided VIO system. An inconsistent estimator can degrade localization performance, where the inconsistency primarily arises from two main factors: (1) the estimator fails to preserve the correct system observability, and (2) UWB anchor positions are assumed to be known, leading to improper neglect of calibration uncertainty. In this paper, we propose a consistent and tightly-coupled visual-inertial-ranging odometry (CVIRO) system based on the Lie group. Our method incorporates the UWB anchor state into the system state, explicitly accounting for UWB calibration uncertainty and enabling the joint and consistent estimation of both robot and anchor states. Further-more, observability consistency is ensured by leveraging the invariant error properties of the Lie group. We analytically prove that the CVIRO algorithm naturally maintains the system’s correct unobservable subspace, thereby preserving estimation consistency. Extensive simulations and experiments demonstrate that CVIRO achieves superior localization accuracy and consistency compared to existing methods.

Abstract:
Medical ultrasound (US) imaging is widely used in clinical examinations due to its portability, real-time capability, and radiation-free nature. To address inter- and intra-operator variability, robotic ultrasound systems have gained increasing attention. However, their application in challenging intercostal imaging remains limited due to the lack of an effective scan path generation method within the constrained acoustic window. To overcome this challenge, we explore the potential of tactile cues for characterizing subcutaneous rib structures as an alternative signal for ultrasound segmentation-free bone surface point cloud extraction. Compared to 2D US images, 1D tactile-related signals offer higher processing efficiency and are less susceptible to acoustic noise and artifacts. By leveraging robotic tracking data, a sparse tactile point cloud is generated through a few scans along the rib, mimicking human palpation. To robustly map the scanning trajectory into the intercostal space, the sparse tactile bone location point cloud is first interpolated to form a denser representation. This refined point cloud is then registered to an image-based dense bone surface point cloud, enabling accurate scan path mapping for individual patients. Additionally, to ensure full coverage of the object of interest, we introduce an automated tilt angle adjustment method to visualize structures beneath the bone. To validate the proposed method, we conducted comprehensive experiments on four distinct phantoms. The final scanning waypoint mapping achieved Mean Nearest Neighbor Distance (MNND) and Hausdorff distance (HD) errors of 3.41 mm and 3.65 mm, respectively, while the reconstructed object beneath the bone had errors of 0.69 mm and 2.2 mm compared to the CT ground truth.

Abstract:
Collaborative Perception enables multiple agents, such as autonomous vehicles and infrastructure, to share sensor data via vehicular networks so that each agent gains an extended sensing range and better perception quality. Despite its promising benefits, realizing the full potential of such systems faces significant challenges due to inherent imperfections in underlying system layers, consisting of network layer imperfections and hardware-level noises. Such imperfections and noises include packet loss in vehicular networks, localization errors from GPS measurements, and synchronization errors caused by clock deviation and network latency. To address these challenges, we propose a novel end-to-end collaborative perception framework, SCORPION, that harnesses the AI co-design of the application layer and system layer to tackle the aforementioned imperfections. SCORPION consists of three main components: lost bird’s eye view feature reconstruction (L-BEV-R) recovers lost spatial features during lossy V2X communication, while deformable spatial cross attention (DSCA) and temporal alignment (TA) compensate for localization and synchronization errors in feature fusion. Experimental results on both synthetic and real-world collaborative 3D object detection datasets demonstrate that SCORPION advances the state-of-the-art collaborative perception methods by 5.9 - 13.2 absolute AP on both standard and noisy scenarios.

Abstract:
Weakly supervised monocular 3D detection, while less annotation-intensive, often struggles to capture the global context required for reliable 3D reasoning. Conventional label-efficient methods focus on object-centric features, neglecting contextual semantic relationships that are critical in complex scenes. In this work, we propose a Context-Aware Weak Supervision for Monocular 3D object detection, namely CA-W3D, to address this limitation in a two-stage training paradigm. Specifically, we first introduce a pre-training stage employing Region-wise Object Contrastive Matching (ROCM), which aligns regional object embeddings derived from a trainable monocular 3D encoder and a frozen open-vocabulary 2D visual grounding model. This alignment encourages the monocular encoder to discriminate scene-specific attributes and acquire richer contextual knowledge. In the second stage, we incorporate a pseudo-label training process with a Dual-to-One Distillation (D2OD) mechanism, which effectively transfers contextual priors into the monocular encoder while preserving spatial fidelity and maintaining computational efficiency during inference. Extensive experiments conducted on the public KITTI benchmark demonstrate the effectiveness of our approach, surpassing the SoTA method over all metrics, highlighting the importance of contextual-aware knowledge in weakly-supervised monocular 3D detection. For implementation details: CAW3D

Abstract:
Autonomous aerial vehicles play a critical role in search and rescue operations, where navigation through cluttered and confined environments is essential. To this end, this paper presents a novel trajectory planning framework for omnidirectional drones that dynamically adjusts tracking velocity based on the platform’s proximity to obstacles, ensuring a balance between safety and efficiency in cluttered and challenging environments. The proposed approach generates a geometric path to the target location. At each waypoint, the minimum distance between the drone’s convex hull and surrounding obstacles is determined, allowing the computation of the velocity constraints. By slowing down near obstacles and accelerating in open spaces, the method enhances both safety and maneuverability. The framework is validated through real-world experiments using the OmniOcta UAV, demonstrating its ability to navigate through constrained spaces. Furthermore, we present an experimental study to investigate key sources of tracking deviations, including propeller dynamics and aerodynamic interactions near obstacles.

Abstract:
We introduce R2LDM, an innovative approach for generating dense and accurate 4D radar point clouds, guided by corresponding LiDAR point clouds. Instead of utilizing range images or bird’s eye view (BEV) images, we represent both LiDAR and 4D radar point clouds using voxel features, which more effectively capture 3D shape information. Subsequently, we propose the Latent Voxel Diffusion Model (LVDM), which performs the diffusion process in the latent space. Additionally, a novel Latent Point Cloud Reconstruction (LPCR) module is utilized to reconstruct point clouds from high-dimensional latent voxel features. As a result, R2LDM effectively generates LiDAR-like point clouds from paired raw radar data. We evaluate our approach on two different datasets, and the experimental results demonstrate that our model achieves 6- to 10-fold densification of radar point clouds, outperforming state-of-the-art baselines in 4D radar point cloud super-resolution. Furthermore, the enhanced radar point clouds generated by our method significantly improve downstream tasks, achieving up to 31.7% improvement in point cloud registration recall rate and 24.9% improvement in object detection accuracy.

Abstract:
Recent developments in 3D Gaussian Splatting have made significant advances in surface reconstruction. However, scaling these methods to large-scale scenes remains challenging due to high computational demands and the complex dynamic appearances typical of outdoor environments. These challenges hinder the application in aerial surveying and autonomous driving. This paper proposes a novel solution to reconstruct large-scale surfaces with fine details, supervised by full-sized images. Firstly, we introduce a coarse-to-fine strategy to reconstruct a coarse model efficiently, followed by adaptive scene partitioning and sub-scene refining from image segments. Additionally, we integrate a decoupling appearance model to capture global appearance variations and a transient mask model to mitigate interference from moving objects. Finally, we expand the multi-view constraint and introduce a single-view regularization for texture-less areas. Our experiments were conducted on the publicly available dataset GauU-Scene V2, which was captured using unmanned aerial vehicles. To the best of our knowledge, our method outperforms existing NeRF-based and Gaussian-based methods, achieving high-fidelity visual results and accurate surface from full-size image optimization. Open-source code will be available on GitHub.

Abstract:
High-definition (HD) maps are essential for autonomous driving, as they provide precise road information for downstream tasks. Recent advances highlight the potential of temporal modeling in addressing challenges like occlusions and extended perception range. However, existing methods either fail to fully exploit temporal information or incur substantial computational overhead in handling extended sequences. To tackle these challenges, we propose MambaMap, a novel framework that efficiently fuses long-range temporal features in the state space to construct online vectorized HD maps. Specifically, MambaMap incorporates a memory bank to store and utilize information from historical frames, dynamically updating BEV features and instance queries to improve robustness against noise and occlusions. Moreover, we introduce a gating mechanism in the state space, selectively integrating dependencies of map elements in high computational efficiency. In addition, we design innovative multi-directional and spatial-temporal scanning strategies to enhance feature extraction at both BEV and instance levels. These strategies significantly boost the prediction accuracy of our approach while ensuring robust temporal consistency. Extensive experiments on the nuScenes and Argoverse2 datasets demonstrate that our proposed MambaMap approach outperforms state-of-the-art methods across various splits and perception ranges. Source code will be available at https://github.com/ZiziAmy/MambaMap.

Abstract:
With the growing number of patients experiencing knee-related conditions, total knee arthroplasty (TKA) has become a common procedure, where a 3D visualisation of the patient’s tibia and fibula is essential for preoperative planning. Traditional imaging techniques, such as computed tomography (CT), often expose patients to high levels of radiation or impose significant financial costs. As an alternative, this paper proposes a novel approach that reconstructs a 3D model of the tibia and fibula using only two X-ray images (taken from the coronal and sagittal planes) and a general template, significantly reducing radiation exposure and financial burden. Our algorithm of 3D reconstruction for patient-specific anatomies combines point-based deformation with deep learning techniques. Initially, the general model undergoes a preliminary deformation to match the patient tibia and fibula dimensions. This pre-deformed model then serves as a template, followed by a fine deformation process via a self-supervised graph convolutional network (GCN), whose parameters are trained iteratively by comparing the template projection and the X-ray measurements. Following tests in simulations, cadaver experiments, and in-vivo experiments, our proposed algorithm demonstrates state-of-the-art accuracy and exceptional robustness across different evaluation metrics. Our code is available at https://github.com/DrKaiPan/tfDeform_GCN.git

Abstract:
Accurate system identification is crucial for reducing trajectory drift in bipedal locomotion, particularly in reinforcement learning and model-based control. In this paper, we present a novel control framework that integrates system identification into the reinforcement learning training loop using differentiable simulation. Unlike traditional approaches that rely on direct torque measurements, our method estimates system parameters using only trajectory data (positions, velocities) and control inputs. We leverage the differentiable simulator MuJoCo-XLA to optimize system parameters, ensuring that simulated robot behavior closely aligns with real-world motion. This framework enables scalable and flexible parameter optimization. It supports fundamental physical properties such as mass and inertia. Additionally, it handles complex system nonlinear behaviors, including advanced friction models, through neural network approximations. Experimental results show that our framework significantly improves trajectory following. It reduces rotational deviation by 75% and increases travel distance in the commanded direction by 46% compared to a baseline reinforcement learning method.

Affiliations: Embodied Intelligence Lab, the National Elite Institute of Engineering, Chongqing University, Chongqing, China; TECNALIA, Basque Research and Technology Alliance (BRTA), Donostia-San Sebastian, Spain; Department of Design, Manufacturing and Engineering Management, Centre for Precision Manufacturing, University of Strathclyde, Glasgow, U.K.; Institute for Imaging, Data and Communications, School of Engineering, University of Edinburgh, Edinburgh, U.K.; Department of Biomedical Engineering, University of Strathclyde, Glasgow, U.K.

Abstract:
Surgical robot task automation has recently attracted great attention due to its potential to benefit both surgeons and patients. Reinforcement learning (RL) based approaches have demonstrated promising ability to perform automated surgical manipulations on various tasks. To address the exploration challenge, expert demonstrations can be utilized to enhance the learning efficiency via imitation learning (IL) approaches. However, the successes of such methods normally rely on both states and action labels. Unfortunately, action labels can be hard to capture or their manual annotation is prohibitively expensive owing to the requirement for expert knowledge. Emulating expert behaviour using noisy or inaccurate labels poses significant risks, including unintended surgical errors that may result in patient discomfort or, in more severe cases, tissue damage. It therefore remains an appealing and open problem to leverage expert data composed of pure states into RL. In this work, we present an actor-critic RL framework, termed AC-SSIL, to overcome this challenge of improving learning process with state-only demonstrations collected by an unknown expert policy. It adopts a self-supervised IL method, dubbed SSIL, to effectively incorporate expert states into RL paradigms by retrieving from demonstrations the nearest neighbours of the query state and utilizing the bootstrapping of actor networks. It applies similarity-based regularization and improves its prediction capacity jointly with the actor network. We showcase through experiments on an open-source surgical simulation platform that our method delivers remarkable improvements over the RL baseline and exhibits comparable performance against action based IL methods, which implies the efficacy and potential of our method for expert demonstration-guided learning scenarios. Code will be made publicly available at https://github.com/Jingshuai-cqu/AC-SSIL.

Affiliations: School of Mechanical Engineering, Anhui University of Technology, Ma’anshan, China; School of Intelligence Science and Technology, University of Science and Technology Beijing, Beijing, China; College of Quality and Technical Supervision, Hebei University, Baoding, China; School of Engineering and Technology, China University of Geosciences (Beijing), Beijing, China; ByteDance Seed, Beijing, China; School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing

Abstract:
Force estimation is the core indicator for evaluating the performance of tactile sensors, and it is also the key technical path to achieving precise force feedback mechanisms. This study proposes a design method for a visual tactile sensor (VBTS) that integrates a magnetic perception mechanism, and develops a new tactile sensor called MagicGel. The sensor uses strong magnetic particles as markers and captures magnetic field changes in real time through Hall sensors. On this basis, MagicGel achieves the coordinated optimization of multimodal perception capabilities: it not only has fast response characteristics, but also can perceive non-contact status information of home electronic products. Specifically, MagicGel simultaneously analyzes the visual characteristics of magnetic particles and the multimodal data of changes in magnetic field intensity, ultimately improving force estimation capabilities.

Abstract:
Map matching and registration are essential tasks in robotics for localisation and integration of multi-session or multi-robot data. Traditional methods rely on cameras or LiDARs to capture visual or geometric information but struggle in challenging conditions like smoke or dust. Magnetometers, on the other hand, detect magnetic fields, revealing features invisible to other sensors and remaining robust in such environments. In this paper, we introduce Mag-Match, a novel method for extracting and describing features in 3D magnetic vector field maps to register different maps of the same area. Our feature descriptor, based on higher-order derivatives of magnetic field maps, is invariant to global orientation, eliminating the need for gravity-aligned mapping. To obtain these higher-order derivatives map-wide given point-wise magnetometer data, we leverage a physics-informed Gaussian process to perform efficient and recursive probabilistic inference of both the magnetic field and its derivatives. We evaluate MagMatch in simulated and real-world experiments against a SIFT-based approach, demonstrating accurate map-to-map, robot-to-map, and robot-to-robot transformations—even without initial gravitational alignment.

Abstract:
Various studies on perception-aware planning have been proposed to enhance the state estimation accuracy of quadrotors in visually degraded environments. However, many existing methods heavily rely on prior environmental knowledge and face significant limitations in previously unknown environments with sparse localization features, which greatly limits their practical application. In this paper, we present a perception-aware planning method for quadrotor flight in unknown and feature-limited environments that properly allocates perception resources among environmental information during navigation. We introduce a viewpoint transition graph that allows for the adaptive selection of local target viewpoints, which guide the quadrotor to efficiently navigate to the goal while maintaining sufficient localizability and without being trapped in feature-limited regions. During the local planning, a novel yaw trajectory generation method that simultaneously considers exploration capability and localizability is presented. It constructs a localizable corridor via feature co-visibility evaluation to ensure localization robustness in a computationally efficient way. Through validations conducted in both simulation and real-world experiments, we demonstrate the feasibility and real-time performance of the proposed method. The source code is released for the reference of the community1

Abstract:
Testing and validating Autonomous Vehicle (AV) performance in safety-critical and diverse scenarios is crucial before real-world deployment. However, manually creating such scenarios in simulation remains a significant and time-consuming challenge. This work introduces a novel method that generates dynamic temporal scene graphs corresponding to diverse traffic scenarios, on-demand, tailored to user-defined preferences, such as AV actions, sets of dynamic agents, and criticality levels. A temporal Graph Neural Network (GNN) model learns to predict relationships between ego-vehicle, agents, and static structures, guided by real-world spatiotemporal interaction patterns and constrained by an ontology that restricts predictions to semantically valid links. Our model consistently outperforms the baselines in accurately generating links corresponding to the requested scenarios. We render the predicted scenarios in simulation to further demonstrate their effectiveness as testing environments for AV agents.

Abstract:
This paper proposes an innovative virtual chain-based kinematic calibration for the 4PPa-2PaR parallel manipulators with subchain architectures. Conventional calibration methods for such architectures suffer from inherent limitations due to coupled parameter constraints and restricted solution spaces caused by joint displacement and structural parameter dependencies. The presented methodology introduces three fundamental advancements: (1) a novel parameter assignment strategy enabling independent joint/link parameter definition across different kinematic chains, (2) systematic transformation of constrained optimization into an unconstrained one, and (3) significant expansion of error parameter solution space through virtual chain modeling. Comparative experiment on the physical prototype demonstrate improvements in both orientation and position accuracy compared to existing methods.

Abstract:
4D millimeter-wave radar plays a pivotal role in autonomous driving due to its cost-effectiveness and robustness in adverse weather. However, the application of 4D radar point cloud in 3D perception tasks is hindered by its inherent sparsity and noise. To address these challenges, we propose LGDD, a novel local-global synergistic dual-branch 3D object detection framework using 4D radar. Specifically, we first introduce a point-based branch, which utilize a voxel-attended point feature extractor (VPE) to integrate semantic segmentation with cluster voting, thereby mitigating radar noise and extracting local-clustered instances features. Then, for the conventional pillar-based branch, we design a query-based feature pre-fusion (QFP) to address the sparsity and enhance global context representation. Additionally, we devise a proposal mask to filter out noisy points, enabling more focused clustering on regions of interest. Finally, we align the local instances with global context through semantics-geometry aware fusion (SGF) module to achieve comprehensive scene understanding. Extensive experiments demonstrate that LGDD achieves state-of-the-art performance on the public View-of-Delft and TJ4DRadSet datasets. Source code is available at https://github.com/shawnnnkb/LGDD.

Abstract:
Early detection of gastrointestinal (GI) cancer is critical for improving treatment outcomes and survival rates. Yet conventional endoscopic techniques remain invasive and labor-intensive, thus presenting significant challenges for cancer screening on large populations. Current commercially available sponge-based sampling devices are passive and limited in their reach to the esophagus, hindering comprehensive sampling in the stomach. Here, for the first time, we report the SpongeBot – a non-invasive soft mini-robot designed for active cell sampling in the upper GI tract, with a particular focus on the stomach. The SpongeBot integrates an open-cell sponge and a magnetic actuator, enabling precise and controlled sampling under a wireless external magnetic field. To accommodate the intricate anatomy of the stomach, the robot is capable of transitioning between two modes of motion — the navigation and the sampling mode, allowing trajectory control and targeted sampling at desired locations. Kinematic model is established to accurately represent the locomotion of the robot on wet mucosa surfaces. Pilot testing on ex vivo porcine stomachs is successfully performed with sufficient cells sampled for subsequent clinical laboratory testing. Histological analysis shows the sampling causes no detectable damage to the mucosa layer. SpongeBot has the potential as a cell sampling device for the upper GI tract to be deployed in primary care settings for cancer prevention.

Abstract:
Accurate point cloud information is important for robot perception and autonomous driving. Although advanced 4D radar can provide point cloud with higher resolution than 3D radar, its data still contains a significant amount of noise due to measurement principle. To solve this issue, we propose RDN (Radar Denoising Network), a denoising network specifically designed for 4D radar. RDN includes three innovative modules: First, to overcome the noisy nature of radar points, we design a feature similarity-based farthest point sampling module (FS-FPS), which can extract representative sampling points from the noisy point cloud. Secondly, to address feature propagation issues caused by the sparse and long-range characteristics of 4D radar points, we introduce a virtual feature point prediction (VFP) module and an iterative upsampling (IUS) module. The VFP module generates virtual feature points through the network to serve as bridges for information transmission, while the IUS module uses an iterative approach to gradually refine feature propagation. The experiments on MSC-RAD4D and NTU4DRadLM datasets demonstrate the effectiveness and generalization of our method. Besides, odometry experiments prove the practical value of point cloud denoising in improving robot perception.

Abstract:
Infrared image helps improve the perception capabilities of autonomous driving in complex weather conditions such as fog, rain, and low light. However, infrared image often suffers from low contrast, especially in non-heat-emitting targets like bicycles, which significantly affects the performance of downstream high-level vision tasks. Furthermore, achieving contrast enhancement without amplifying noise and losing important information remains a challenge. To address these challenges, we propose a task-oriented infrared image enhancement method. Our approach consists of two key components: layer decomposition and saliency information extraction. First, we design an l0-l1 layer decomposition method for infrared images, which enhances scene details while preserving dark region features, providing more features for subsequent saliency information extraction. Then, we propose a morphological reconstruction-based saliency extraction method that effectively extracts and enhances target information without amplifying noise. Our method improves the image quality for object detection and semantic segmentation tasks. Extensive experiments demonstrate that our approach outperforms state-of-the-art methods.

Abstract:
Planar surfaces are commonly found in man-made underwater environments and can be employed to support underwater SLAM. This work focuses on 3D plane extraction, building on two-dimensional acoustic scans collected from an imaging sonar. The novel contribution of our algorithm exploits the sonar’s wider beamwidth and ability to collect secondary echoes from these structures to extract a three-dimensional surface from the acquired acoustic image. Building on a Hough Transform-based algorithm adapted to polar-based acoustic imagery, line feature detection supports plane representation segmentation. An inverse sensor model is subsequently employed to estimate additional plane parameters: inclination, length, and height. Experimental assessment in a confined controlled environment is introduced, validating the accuracy of the algorithm. Additional results from a dam shaft scenario are also presented to assess the potential of the developed tool.

Abstract:
This study proposes an optoelectronic navigation strategy leveraging Ag-SiO2 microspheres as “microtruck” to overcome the limitations of traditional optoelectronic tweezers (OET) in manipulating negative dielectrophoresis (nDEP) particles. By dynamically adjusting electric field frequency and optical parameters, we regulate particle-induced dielectrophoretic forces (PiDEP) to achieve efficient adsorption, high-speed transport, and site-specific unloading of nDEP-responsive cargo. Experimental results demonstrate a seven times enhancement in manipulation velocity compared to conventional direct optical methods, along with the capability for simultaneous multi-particle transport. In addition, we utilized finite element simulations to analyze the optimal electric field frequency and optical parameters for the microtruck’s loading and unloading processes. Furthermore, a systematic analysis of critical velocities and failure modes under varying cargo loads further validates the robustness of this approach. Demonstrated within a labyrinthine microenvironment, this strategy enables programmable navigation, sequential cargo handling, and micrometer positional accuracy. This study provides an efficient solution for biomedical applications, including precise single-cell manipulation and targeted drug delivery.

Abstract:
This paper addresses the challenges of automating vibratory sieve shaker operations in a materials laboratory, focusing on three critical tasks: 1) dual-arm lid manipulation in 3 cm clearance spaces, 2) bimanual handover in overlapping workspaces, and 3) obstructed powder sample container delivery with orientation constraints. These tasks present significant challenges, including inefficient sampling in narrow passages, the need for smooth trajectories to prevent spillage, and suboptimal paths generated by conventional methods. To overcome these challenges, we propose a hierarchical planning framework combining Prior-Guided Path Planning and Multi-Step Trajectory Optimization. The former uses a finite Gaussian mixture model to improve sampling efficiency in narrow passages, while the latter refines paths by shortening, simplifying, imposing joint constraints, and B-spline smoothing. Experimental results demonstrate the framework’s effectiveness: planning time is reduced by up to 80.4%, and waypoints are decreased by 89.4%. Furthermore, the system completes the full vibratory sieve shaker operation workflow in a physical experiment, validating its practical applicability for complex laboratory automation.

Abstract:
Transparent objects are common in industrial automation and daily life. However, accurate visual perception of these objects remains challenging due to their reflective and refractive properties. Most previous studies fail to capture contextual information or typically rely on regression-based methods at the decoder stage, suffering from overfitting and unsatisfactory object details. To overcome these limitations, we present a novel depth completion framework for transparent objects with diffusion denoising approach (DCT-Diffusion). First, we adopt a transformer-based encoder to globally learn the depth relationships from different parts of the input by modeling long-distance dependencies. Then, we propose to introduce the diffusion model to generate refined depth maps from random depth distribution. Through iterative refinement, our model can progressively enhance depth map details and achieves fine-grained performance. Lastly, a conditioned fusion module is developed, which utilizes encoder features as visual conditions and fuses them with the denoising block at each step using augmented attention. Extensive comparative studies and cross-domain experiments prove that the DCT-Diffusion outperforms previous methods and significantly improves the robustness and generalization ability. Moreover, visualization results further illustrate that our method can generate depth maps with more complete geometry and clearer boundaries, achieving satisfactory results.

Abstract:
Improving the adaptability of small-sized quadruped robots has been a longstanding challenge in robotics. However, the weak whole-body coordination in existing small-sized quadruped robots limits their locomotion in many environments. In this work, we propose a teacher-student online learning framework for agile whole-body control of small-sized quadruped robots with a flexible spine. We first select a simple and effective gait pattern, the diagonal symmetrical sequence, using a dynamics model. Based on the reference motions provided by the gait pattern and combined with privileged information, we train a teacher policy to generate high-quality motion data. After setting the state space to match the actual robot’s state space, we initialize the robot’s initial state using the teacher data and train a student policy. Finally, we deploy the student policy on the SQuRo-Lite, a small-sized quadruped robot with a flexible spine, demonstrating that our approach can achieve stable yet dynamic locomotion for walking and turning. In the variable-spacing slalom experiment, the robot is able to flexibly adjust the motion patterns of its spine and legs based on commands, enabling dynamic changes in its turning radius. This further validates that our approach can achieve agile whole-body control for small-sized quadruped robots. This work helps broaden the application scenarios of small quadruped robots.

Abstract:
This paper presents a model-driven distributed controller development approach, supported by Petri nets modeling that relies on low-code strategies. The proposed approach addresses distributed controller systems having automation and/or cyber-physical systems as targeted application areas. The distributed controller system can be viewed as a globally asynchronous locally synchronous (GALS) system and its development is fully supported by a web-based tool framework offering comprehensive support for integrating hardware-software co-design techniques when mapping components to specific execution platforms. All development phases are supported, starting with the specification by editing the Petri net model and ending up with the deployment to specific implementation platforms, including FPGAs and microcontroller-based ones, adopting a low-code strategy. The framework also supports simulation and behavioral property verification. An example of an automation system application is presented to illustrate the adequacy of the approach.

Abstract:
Bioinspired magnetic cilia attract the tremendous attention of researchers due to flexible nature and remotely controlled manipulations of droplets and fluids. Nonetheless, controlling catalytic process by magnetic cilia remains underexplored. Here, we present magnetic soft cilia carpets with different magnetizations to control enzyme-like chemical reactions. We demonstrate a methodology to optimize material, geometric, and magnetic field parameters to achieve high-frequency oscillations of the cilia under an alternating magnetic field at 15 Hz. We also show stable oscillation of water-based droplets on the cilia surface at a magnetic field frequency of 10 Hz without droplet detachment. Furthermore, we demonstrate enhanced droplet catalysis resulting from droplet oscillation on the cilia surface. As a concept for automated lab analysis, we demonstrate droplet transport by magnetic cilia onto a flexible pH sensor. Finally, as a proof-of-concept, we show control of nanozyme reaction rates by varying cilia magnetization angles, achieving a fourfold enhancement of reaction rate under a rotating magnetic field compared to unmagnetized cilia. These findings offer a promising approach to remotely control enzyme-like reactions and droplet catalysis using magnetic cilia.

Abstract:
Multi-modal fusion perception enhances robotic performance in complex tasks by providing more comprehensive information than single modality. While tactile and proprioceptive sensing are effective for direct contact tasks like grasping, current research mainly focuses on vision-language fusion, neglecting other embodied modalities. The primary challenges of this limitation are the difficulty in generating natural language labels for embodied information like tactile and proprioception and aligning them with vision and language. To address this, we introduce VLaPT, a novel multi-modal grasping dataset that aligns vision and language (VL) with posture and tactile (PT), enabling robots to sense differently from environment to self. VLaPT includes 75 objects, 1,533 grasps, and over 78K synchronized vision-language-posturetactile pairs. The dataset incorporates structured, rich-text descriptions generated using modality-level language annotation templates, ensuring effective cross-modality alignment. Leveraging this dataset, we trained a lightweight multi-modal alignment framework, CLIP-ME, which enhances the performance of several downstream tasks with only a 5% increase in parameters. The VLaPT is publicly available in https://huggingface.co/datasets/xsdfasfgsa/VLaPT.

Abstract:
The persistent multi-solution challenge in parallel robots’ forward kinematics (FK) has impeded high-precision real-time control. Current data-driven approaches face limitations in predicting accurate and unique solutions, ensuring cross-architectural generalizability, and validating results through continuous trajectory experiments. To address these issues, this work proposes the Partition-Learning-Selection-Augmentation (PLSA) framework, which systematically resolves FK multi-solution challenges. PLSA clusters potential solutions through data partitioning, predicts all feasible solutions in parallel using deep neural networks (DNNs), integrates a selection mechanism to identify optimal solutions, and refines accuracy via the Newton-Raphson method. Cross-configuration tests on Stewart and 3-RRS parallel robots validate PLSA’s adaptability to different architectures, achieving at least 98.99% accuracy and a computation speed of approximately 30Hz. Additionally, three neural networks (CNN, KAN, and Transformer) are implemented and compared in the Learning-based Selection module, demonstrating PLSA’s generalizability across diverse networks. Comparative studies against analytical, numerical iterative, and prior data-driven methods confirm PLSA’s unique multi-solution resolution capability, delivering submillimeter accuracy with millisecond-level computation, thus establishing a real-time FK calculation methodology.

Abstract:
Mobile robots can perform increasingly impressive feats in controlled environments. Many real applications, though, especially for walking robots, introduce a high degree of unforeseen difficulties, yet require very robust robot operation. In these cases, it is still often not possible to guarantee the needed reliability.We present an approach to utilize unsupervised anomaly detection to implement a fear-based adaptation of robot behavior. This allows robots to automatically and quickly react to any type of unexpected problems. Neither the environment nor the type of disturbance has to be known beforehand, as the system requires only a small amount of baseline data for training, which can be collected in a laboratory environment. Additionally, it can work on arbitrary robot hardware and be integrated in all types of robot control structures.We evaluated our approach in simulation and on state of the art walking robots, ANYmal, Spot and our own six-legged walking robot prototype, in a realistic field test environment in the Tabernas desert in Spain. Our results showcase that we can quickly detect arbitrary problems based on significantly different types of sensor data and decrease robot fall rates in the most extreme scenarios from 56% to 4%. This promises significant increases in robustness for all types of walking robots in highly challenging and previously unknown environments.

Abstract:
In this study, we introduce an innovative algorithm for enhanced navigational scene understanding in complex maritime environments by utilizing large language models (LLM) and visual language models (VLM) to achieve autonomous maritime situational awareness. The proposed algorithm interprets the meanings of various features and marks on detected objects in maritime contexts. By combining this information with radar and camera data, the algorithm generates cost maps for safe navigation. This approach offers two key benefits: (1) the ability to identify navigable areas considering obstacles, maritime marks, rules, and ship intentions, and (2) decision-making support based on reasoning, bridging the information gap between human operators and perception results. The performance of the proposed approach is demonstrated using a real-world dataset. The detailed information can be found at: https://yeongha-shin.github.io/vlmllm-maritime/

Affiliations: School of Electronics and Information Engineering, Tongji University, and State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Shanghai, China; Shanghai Research Institute for Intelligent Autonomous Systems, Tongji University, and State Key Laboratory of Intelligent Autonomous Systems, and Frontiers Science Center for Intelligent Autonomous Systems, Shanghai, China; School of Logistics Engineering, Shanghai Maritime University, Shanghai, China

Abstract:
Multi-modal perception plays a crucial role in preventing deformation and damage during the robotic manipulation of deformable objects. However, integrating new heterogeneous modalities into existing robotic perception frameworks remains a significant challenge, primarily due to the need for massive amounts of paired data. In this paper, we propose Uni-Zipper, a scalable multi-modal fusion framework designed to expand new modalities with the help of semantic enhancement without relying on paired data. Uni-Zipper consists of a tokenizer that projects various modalities into a shared embedding space, a summary word embedding layer with a feature dictionary, a modality alignment space, and dynamic reconfigurable task heads. To facilitate efficient integration and extension of new modalities, the Zipper alignment mechanism is employed, effectively bridging the modality gap between different input types. Our experimental results demonstrate that Uni-Zipper successfully fuses four modalities and enhances performance in downstream tasks. Despite a 12% decrease in parameter count, Uni-Zipper maintains comparable performance.

Abstract:
Vision-Language-Action (VLA) models demonstrate remarkable potential for generalizable robotic manipulation. The performance of VLA models can be improved by integrating with action chunking, a critical technique for effective control. However, action chunking linearly scales up action dimensions in VLA models with increased chunking sizes. This reduces the inference efficiency. Therefore, accelerating VLA integrated with action chunking is an urgent need. To tackle this problem, we propose PD-VLA, the first parallel decoding framework for VLA models integrated with action chunking. Our framework reformulates autoregressive decoding as a nonlinear system solved by parallel fixed-point iterations. This approach preserves model performance with mathematical guarantees while significantly improving decoding speed. In addition, it enables training-free acceleration without architectural changes, as well as seamless synergy with existing acceleration techniques. Extensive simulations validate that our PD-VLA maintains competitive success rates while achieving 2.52× execution frequency on manipulators (with 7 degrees of freedom) compared with the fundamental VLA model. Furthermore, we experimentally identify the most effective settings for acceleration. Finally, real-world experiments validate its high applicability across different tasks.

Abstract:
Learning from Demonstrations (LfD) methods are applied to transfer human skills to robots from expert demonstrations, enabling them to perform complex tasks. However, existing methods often struggle to handle such long-horizon human skills as cleaning or wiping stains on the surface, which involve multiple periodic and transitional movement primitives. To address this limitation, this paper proposes a novel framework for segmenting, learning, and generalizing multi-periodic human skills, enabling robots to effectively learn different movement primitives and execute these skills in new environments. Specifically, the framework introduces an unsupervised learning method to segment long-horizon human demonstrations into periodic and discrete movement primitives. Further, a novel type of discrete dynamical movement primitives, namely transitional movement primitives, is employed to enhance the fluidity of combining different periodic movement primitives in skills. These primitives collectively form a lightweight state machine during task execution, where state transitions are governed by visual perception, thereby enabling generalization to long-horizon tasks composed of arbitrary numbers of periodic subtasks. To validate the effectiveness of the proposed approach, we conduct extensive experimental evaluations, including step-by-step validation of each method in simulation and the implementation of the entire presented framework in the real world. The results confirm that the proposed framework accurately learns and generalizes multi-periodic human skills, providing a feasible solution for transferring complex multi-periodic demonstrations to robots in practical applications. The project website can be found at: https://nkrobotlab.github.io/LAMPS/

Abstract:
Autonomous assembly is an essential capability for industrial and service robots, with Peg-in-Hole (PiH) insertion being one of the core tasks. However, PiH assembly in unknown environments is still challenging due to uncertainty in task parameters, such as the hole position and orientation, resulting from sensor noise. Although context-based meta reinforcement learning (RL) methods have been previously presented to adapt to unknown task parameters in PiH assembly tasks, the performance depends on a sample-inefficient procedure or human demonstrations. Thus, to enhance the applicability of meta RL in real-world PiH assembly tasks, we propose to train the agent to use information from the robot’s forward kinematics and an uncalibrated camera. Furthermore, we improve the applicability by efficiently adapting the meta-trained agent to use data from force/torque sensor. Finally, we propose an adaptation procedure for out-of-distribution tasks whose parameters are different from the training tasks. Experiments on simulated and real robots prove that our modifications enhance the sample efficiency during meta training, real-world adaptation performance, and generalization of the context-based meta RL agent in PiH assembly tasks compared to previous approaches.

Abstract:
Husky Carbon, a robot developed by Northeastern University, serves as a research platform to explore unification of posture manipulation and thrust vectoring. Unlike conventional quadrupeds, its joint actuators and thrusters enable enhanced control authority, facilitating thruster-assisted narrow-path walking. While a unified Model Predictive Control (MPC) framework optimizing both ground reaction forces and thruster forces could theoretically address this control problem, its feasibility is limited by the low torque-control bandwidth of the system’s lightweight actuators. To overcome this challenge, we propose a decoupled control architecture: a Raibert-type controller governs legged locomotion using position-based control, while an MPC regulates the thrusters augmented by learned Contact Residual Dynamics (CRD) to account for leg-ground impacts. This separation bypasses the torque-control rate bottleneck while retaining the thruster MPC to explicitly account for leg-ground impact dynamics through learned residuals. We validate this approach through both simulation and hardware experiments, showing that the decoupled control architecture with CRD performs more stable behavior in terms of push recovery and cat-like walking gait compared to the decoupled controller without CRD.

Abstract:
Bipedal robots, due to their anthropomorphic design, offer substantial potential across various applications, yet their control is hindered by the complexity of their structure. Currently, most research focuses on proprioception-based methods, which lack the capability to overcome complex terrain. While visual perception is vital for operation in human-centric environments, its integration complicates control further. Recent reinforcement learning (RL) approaches have shown promise in enhancing legged robot locomotion, particularly with proprioception-based methods. However, terrain adaptability, especially for bipedal robots, remains a significant challenge, with most research focusing on flat-terrain scenarios. In this paper, we introduce a novel mixture of experts teacher-student network RL strategy, which enhances the performance of teacher-student policies based on visual inputs through a simple yet effective approach. Our method combines terrain selection strategies with the teacher policy, resulting in superior performance compared to traditional models. Additionally, we introduce an alignment loss between the teacher and student networks, rather than enforcing strict similarity, to improve the student’s ability to navigate diverse terrains. We validate our approach experimentally on the Limx Dynamic P1 bipedal robot, demonstrating its feasibility and robustness across multiple terrain types.

Abstract:
6D object pose estimation suffers from reduced accuracy when applied to metallic objects. We set out to improve the state-of-the-art by addressing challenges such as reflections and specular highlights in industrial applications. Our novel BOP-compatible dataset [1], [2], featuring a diverse set of metallic objects (cans, household, and industrial items) under various lighting and background conditions, provides additional geometric and visual cues. We demonstrate that these cues can be effectively leveraged to enhance overall performance. To illustrate the usefulness of the additional features, we improve upon the GDRNPP [3] algorithm by introducing an additional keypoint prediction and material estimator head in order to improve spatial scene understanding. Evaluations on the new dataset show improved accuracy for metallic objects, supporting the hypothesis that additional geometric and visual cues can improve learning.

Abstract:
We previously proposed cable-driven wearable devices for exercise and gait assistance. It was a lightweight and suit-type device with programmable resistance or assistance adjustment capabilities. In this paper, we introduce 1) a wearable gym device designed to focus on lower limb muscles and 2) a stationary gym device for comprehensive strength (weight) training. The actuation module can be used interchangeably for both devices. This actuation module includes a control board and a cable-driven actuator with a smaller size, improved strength, and greater speed compared to previous version. Its compact size makes easy implementation into our proposed devices. To evaluate the effectiveness of these devices, we conducted surface electromyography (sEMG) experiments during exercises comparing the effects of the developed devices with traditional dumbbells to confirm their efficacy.

Abstract:
In this paper, two technologies are proposed to deal with the problem of flight safty of multiple unmanned aerial vehicles (UAVs) in unknown environments. One technology is to optimize the front-end path generated by traditional path planning methods in order to better match the dynamics of UAVs to obtain the back-end movement trajectories of UAVs. The other technology is to introduce the collision detection adjustment region such that collision avoidance can be realized for multiple UAVs by dynamic replanning of UAV’s trajectory under local neighborhood communication. Finally, according to simulation and real-world experimental results, the effectiveness of the proposed technologies is verified for the flight safty of multiple UAVs in unknown environments.

Abstract:
High-precision tiny object alignment remains a common and critical challenge for humanoid robots in real world. To address this problem, this paper proposes a vision-based framework for precisely estimating and controlling the relative position between a handheld tool and a target object for humanoid robots, e.g., a screwdriver tip and a screw head slot. By fusing images from the head and torso cameras on a robot with its head joint angles, the proposed Transformer-based visual servoing method can correct the handheld tool’s positional errors effectively, especially at a close distance. Experiments on M4-M8 screws demonstrate an average convergence error of 0.8-1.3 mm and a success rate of 93%-100%. Through comparative analysis, the results validate that this capability of high-precision tiny object alignment is enabled by the Distance Estimation Transformer architecture and the Multi-Perception-Head mechanism proposed in this paper.

Abstract:
Physical human-robot interaction (pHRI) has been demonstrated to be essential in the implementation of social assistive robots (SARs), which require advanced sensing capabilities for accurate and responsive engagement. This study presents the development and validation of a fiber Bragg grating (FBG) sensor network integrated into the hand of the CASTOR robot to classify complex pHRIs. Nine pHRIs were collected and evaluated within the high five, pets, handshakes, hits, and pinches categories. Four machine learning (ML) algorithms were tested, and the Bagged Decision Tree Classifier (BDTC) achieved the best performance. During testing, the model achieved an accuracy of 98%. The results demonstrate that the proposed FBG sensor network can classify complex pHRIs. Future work will explore additional instrumented areas of the robot and expand the physical interaction analysis to enhance social robot adaptability and user experience.

Abstract:
In multi-robot systems, capturing the complex and dynamic interaction relationships is essential for enhancing autonomous collaboration. However, existing learning-based approaches usually overlook the understanding of these relationships, leading to reliability issues and hindering their application to real-world scenarios. This paper proposes a novel approach called Interpretable Heuristic Graph Structure Learning (IHGSL) to better comprehend the complex collaborative relationships in multi-robot systems. We first construct a predicate space to define diverse predicates that express fundamental relationships. Then we employ the variational information bottleneck technique to acquire a latent representation of the current observation by aligning it with the historical trajectory. On this basis, the predicates that the robot should currently focus on the most are learned, and some interaction relationships are established accordingly. Thereby an interpretable relationship graph is generated heuristically to guide the achievement of multi-robot autonomous collaborative decision-making. Through experimental evaluation, we demonstrate the process of relationship inference, thus validating the interpretability of IHGSL. Compared with existing methods, IHGSL also achieves superior collaboration performance, which highlights the effectiveness of the learned heuristic graph structure.

Abstract:
Soft robots are promising to offer flexibility in environmental interaction tasks through compliant deformations. However, the infinite degrees of freedom and high nonlinearity of dynamics pose significant challenges in dynamic modeling and control in soft robots. While online reinforcement learning (RL) is promising for designing policies directly from data, the black-box policy learning process suffers from data inefficiency and sim-to-real gap, limiting its applications in soft robots. To address these challenges, we propose a novel offline RL with Koopman operators (KORL) framework to generate control policies for soft robots without using physical simulators or real-world interactions. In particular, we first utilize a deep neural network to map dynamics of soft robots to a lifted Koopman observable space, which is inherently linear. Then, an offline RL algorithm with a control-informed actor is designed to learn the robotic policy in the linear observable space. This is significantly different from the black-box policy design in existing offline RL paradigms. The designed Koopman observable enables efficient model-free policy learning with linear control theory, improving control performance while preserving interpretability in policy learning. The effectiveness of our KORL framework is validated in a real-world soft robotic system. Comparative experimental results demonstrate that our method outperforms state-of-the-art methods in target-reaching and trajectory-tracking tasks.

Abstract:
Age-related sarcopenia weakens balance in older adults, highlighting the need for effective assistive robots. How-ever, existing assistive technologies often suffer from mechanical, kinematic, and control incompatibilities with the human body. Here, we present a wearable assistive robotic system that integrates muscle-like actuation and control algorithms to directly compensate for impaired muscle force generation at the physiological level. To address mechanical incompatibility, the system employs artificial muscle actuators that replicate the contraction dynamics of human muscles. By mimicking the natural direction of muscle force generation and the anatomical anchor points of muscle attachment, it also resolves kinematic incompatibility. Furthermore, the system uses real-time electromyography (EMG) signals for neuromuscular sensing and implements a muscle-like recruitment and rate coding algorithm to control twisted fiber soft actuators, thereby addressing control incompatibility through human-in-the-loop signal integration. A comprehensive evaluation using the Functional Reach Test (FRT) with ten older adults demonstrated a 16.6% average increase in total ankle joint torque and a statistically significant improvement in forward reach distance (from 66.4 ± 17.4 cm in the OFF condition to 77.2 ± 16.6 cm in the ON condition). These findings highlight the system’s potential to mitigate age-related muscle decline and establish a novel muscle-like actuation–motion–control paradigm for wearable assistive robotics.

Abstract:
Subjective self-disclosure is an important feature of human social interaction. While much has been done in the social and behavioural literature to characterise the features and consequences of subjective self-disclosure, little work has been done thus far to develop computational systems that are able to accurately model it. Even less work has been done that attempts to model specifically how human interactants self-disclose with robotic partners. It is becoming more pressing as we require social robots to work in conjunction with and establish relationships with humans in various social settings. In this paper, our aim is to develop a custom multimodal attention network based on models from the emotion recognition literature, training this model on a large self-collected self-disclosure video corpus, and constructing a new loss function, the scale preserving cross entropy loss, that improves upon both classification and regression versions of this problem. Our results show that the best performing model, trained with our novel loss function, achieves an F1 score of 0.83, an improvement of 0.48 from the best baseline model. This result makes significant headway in the aim of allowing social robots to pick up on an interaction partner’s self-disclosures, an ability that will be essential in social robots with social cognition.

Abstract:
Learning-Based methods have achieved strong performance for quadrupedal locomotion. However, several challenges prevent quadrupeds from learning helpful indoor skills that require interaction with environments and humans: lack of end-effectors for manipulation, limited semantic under-standing using only simulation data, and low traversability and reachability in indoor environments. We present a system for quadrupedal mobile manipulation in indoor environments. It uses a front-mounted gripper for object manipulation, a low-level controller trained in simulation using egocentric depth for agile skills like climbing and whole-body tilting, and pre-trained vision-language models (VLMs) with a third-person fisheye and an egocentric RGB camera for semantic understanding and command generation. We evaluate our system in two unseen environments without any real-world data collection or training. Our system can zero-shot generalize to these environments and complete tasks, like following user’s commands to fetch a randomly placed stuff toy after climbing over a queen-sized bed, with a 60% success rate.

Abstract:
This paper presents a deformable spherical robot with a six-strut topological structure capable of achieving multimodal locomotion in complex amphibious environments. The robot realizes isotropic rolling and asymmetric jumping through its innovative geometric-based configuration while integrating an airbag-driven module for underwater buoyancy control. Based on collision dynamics analysis, we develop a prototype of the deformable spherical robot. Experiments conducted on land, in transition zones, and underwater validate the robot’s multimodal locomotion feasibility in multi-medium environments.

Abstract:
Undesired lateral and longitudinal wheel slippage can disrupt a mobile robot’s heading angle, traction, and, eventually, desired motion. This issue makes the robotization and accurate modeling of heavy-duty machinery very challenging because the application primarily involves off-road terrains, which are susceptible to uneven motion and severe slippage. As a step toward robotization in skid-steering heavy-duty robot (SSHDR), this paper aims to design an innovative robust model-free control system developed by neural networks to strongly stabilize the robot dynamics in the presence of a broad range of potential wheel slippages. Before the control design, the dynamics of the SSHDR are first investigated by mathematically incorporating slippage effects, assuming that all functional modeling terms of the system are unknown to the control system. Then, a novel tracking control framework to guarantee global exponential stability of the SSHDR is designed as follows: 1) the unknown modeling of wheel dynamics is approximated using radial basis function neural networks (RBFNNs); and 2) a new adaptive law is proposed to compensate for slippage effects and tune the weights of the RBFNNs online during execution. Simulation and experimental results verify the proposed tracking control performance of a 4,836 kg SSHDR operating on slippery terrain.

Abstract:
Robot-assisted surgery has significantly advanced surgical precision, yet the development of autonomous surgical robots remains hindered by their limited understanding of complex surgical actions. Current systems lack the ability to effectively perceive and interpret intricate surgical relationships, which restricts their capability to assist surgeons in dynamic surgical environments. To overcome these challenges, a novel self-supervised learning method for surgical action recognition has been proposed, aimed at enhancing the understanding of surgical actions. The method has introduced a dynamic masking with attention-based action localization module to focus the model on critical spatial regions where actions occur, enabling surgical view guidance for intelligent surgical robot while extracting key features. Moreover, a graph-enhanced adaptive feature selection module is employed to assign relevance to features and capture the temporal relationships between adjacent frames. Long Short-Term Memory has been utilized to model long-term dependencies across video sequences, while multi-view contrastive learning facilitates the extraction of discriminative features from both masked and unmasked sequences. Experimental results demonstrate a 3.4% improvement in Average Precision and an Area Under Receiver Operating Characteristic Curve of 92.9% on Neuro67 dataset for surgical action recognition. The method enables dynamic adjustments to the surgical view, achieving surgical visual navigation. These advancements contribute to the development of intelligent and autonomous surgical robots capable of assisting surgeons in complex and dynamic surgical settings.

Abstract:
Goal-conditioned reinforcement learning (GCRL) is an effective method for multi-goal robotic manipulation tasks. Many studies based on hindsight experience replay (HER) and hindsight goal generation (HGG) have achieved the autonomous acquisition of robotic manipulation in reward-sparse environments and have greatly improved the learning efficiency of GCRL. However, these methods perform poorly in environments with obstacles and distant goals. In this paper, we propose hindsight goal diffusion and graph-based experience replay (HGD-GER) for complex robotic manipulation. First, obstacle-avoiding graphs in environments with obstacles are constructed, and the graph-based distance metric between different goals is established. Second, the proposed HGD approach utilizes the inherent denoising mechanism of diffusion models and obstacle-avoiding graph-based distance to generate exploration goals, thereby promoting the exploration of obstacle-bypassing areas. Then, GER module modifies the reward value of experience replay by graph-based distance, thereby avoiding the bias introduced by HER and improving the learning performance of the RL algorithm under sparse reward conditions. Finally, we conducted experiments on three robotic manipulation tasks with obstacles and distant goals, and the results show that the proposed HGD-GER achieves excellent learning performance. Additionally, the proposed method is deployed on the physical robot.

Abstract:
Traditional swarm robots rely on specific communication and planning strategies to coordinate particular tasks. Human swarms exhibit distinctive characteristics due to their capacity for language-based communication and active reasoning. This paper presents an exploratory approach to robotic swarm intelligence that leverages Large Language Models (LLMs) to emulate human-like active problem-solving behaviors. We introduce a decentralized multi-robot system where each robot initially only has its local information and does not know of the existence of the other robots. The robots utilize LLMs for reasoning and natural language for inter-robot communication, enabling them to discover peers, share information, and coordinate actions dynamically. In a series of experiments in zero-shot settings, we observed human-like social behaviors, including mutual discovery, identification, information exchange, collaboration, negotiation, and error correction. While the technical approach is straightforward, the main contribution lies in exploring the interactive societies that LLM-driven robots form – a form of robot social dynamics (or robotic social behavior analysis), examining how human-like communication protocols and collaborative structures emerge among robots through language-based interaction. In this context, we use the term "robot social dynamics" to describe the interaction patterns that arise within robot collectives, inspired by, but distinct from traditional human anthropology.

Abstract:
Articulated tracked robots face significant challenges in maintaining stable locomotion over uneven terrain due to unknown contact points between tracks and ground, which are critical for dynamic control. Unlike legged robots, where contact locations can be predicted, tracked systems require real-time adaptation to varying terrains. This paper presents C-TRAC, a terrain-adaptive control framework that integrates reinforcement learning with a contact-modeling variational autoencoder (C-VAE) to enable robust obstacle traversal. We first train a C-VAE in simulation to reconstruct high-fidelity contact information (position and binary probability) from noisy sensor measurements. This model learns a latent representation of terrain contacts, capturing complex interactions between the robot’s kinematics and environment. Subsequently, we employ an asymmetric Soft Actor-Critic (SAC) algorithm to optimize a control policy that leverages the predicted contact data for adaptive track control during locomotion. Extensive experiments validate C-TRAC in both simulated and real-world scenarios. In benchmark tests against state-of-the-art (SOTA) methods using RoboCup Rescue Robot League environments, our approach achieves superior obstacle traversal speed (up to 66.67% faster on 45◦ staircase) and stability (up to 47.53% more stable on the oblique terrace) compared to contact-agnostic RL baselines and model-based methods. Notably, zero-shot sim-to-real transfer demonstrates consistent performance in unstructured outdoor ruins, also confirming the framework’s practicality.

Affiliations: PCA Lab, Key Lab of Intelligent Perception and Systems for High-Dimensional Information of Ministry of Education, School of Computer Science and Engineering, Nanjing University of Science and Technology, Nanjing, China; College of Computer Science, Nankai University, Tianjin, China; School of Civil Engineering and Architecture, Changzhou Institute of Technology, Changzhou, China; State Key Laboratory of Internet of Things for Smart City (SKLIOTSC), Faculty of Science and Technology, University of Macau, Macao, China

Abstract:
Freespace detection plays an important role in autonomous driving. In recent years, deep learning based freespace detection methods have performed well in urban scenes. However, for off-road scenes, freespace detection poses significant challenges due to the complexity of the scenes and the lack of clear edges. The existing methods have not effectively fused LiDAR data and camera images. In this paper, we propose a Pyramid Cross-Modal Feature Fusion Network (PCMF2-Net) for off-road freespace detection. The dense depth maps are concatenated with RGB images and used as input along with surface normal maps. The dual branch CNN-Transformer encoder combines convolutional neural networks and transformers to extract local and global features from RGBD images and surface normal maps, respectively. Then, in the pyramid cross-modal feature fusion module, the multi-scale and multimodal encoder features are fused in a top-down manner. In addition, we also use an edge segmentation task and a two-step training strategy to further improve performance. Experiments on the off-road freespace detection dataset (ORFD) demonstrate that the proposed PCMF2-Net achieves a competitive result of 93.9% IoU at a speed of 23 Hz.

Abstract:
Sampling-based Model Predictive Control (MPC) algorithms such as Model Predictive Path Integral (MPPI) excel in managing nonlinear constraints and complex systems. However, their conventional sampling strategies often result in suboptimal local solutions. To address this problem, we propose RESM-MPPI, a novel dynamic obstacle avoidance algorithm that integrates the Risk Euclidean Safety Metric (RESM), which is an enhanced version of the Conventional Euclidean Safety Metric (CESM), to more effectively quantify collision risks between autonomous mobile robots (AMRs) and dynamic obstacles. Our approach extends the classical Control Barrier Function (CBF) framework by introducing the Risk Control Barrier Function (RCBF) and integrating a Control Obstacle Avoidance Annealing (COAA) sampling strategy to enhance obstacle avoidance performance. This combination enables the generation of safe and smooth trajectories for AMRs in dynamic environments. Sufficient simulations and real-world experiments demonstrate the effectiveness of the proposed method. Experimental videos are available at: https://youtu.be/WUchIzz_0wU.

Abstract:
Autonomous motion planning under unknown nonlinear dynamics presents significant challenges. An agent needs to continuously explore the system dynamics to acquire its properties, such as reachability, in order to guide system navigation adaptively. In this paper, we propose a hybrid planning-control framework designed to compute a feasible trajectory toward a target. Our approach involves partitioning the state space and approximating the system by a piecewise affine (PWA) system with constrained control inputs. By abstracting the PWA system into a directed weighted graph, we incrementally update the existence of its edges via affine system identification and reach control theory, introducing a predictive reachability condition by exploiting prior information of the unknown dynamics. Heuristic weights are assigned to edges based on whether their existence is certain or remains indeterminate. Consequently, we propose a framework that adaptively collects and analyzes data during mission execution, continually updates the predictive graph, and synthesizes a controller online based on the graph search outcomes. We demonstrate the efficacy of our approach through simulation scenarios involving a mobile robot operating in unknown terrains, with its unknown dynamics abstracted as a single integrator model.

Abstract:
As space exploration advances, the demand for assembling large-scale structures in orbit, such as telescopes and space stations, continues to grow due to transportation size constraints. Robotic systems play a critical role in these tasks, requiring precise control and efficient energy management in the challenging conditions of space. This paper proposes a hybrid control strategy that combines Model Predictive Control (MPC) with Reinforcement Learning (RL) to optimize the performance of a 7-degree-of-freedom walking-assembly integrated space robot. MPC optimizes control inputs by predicting the robot's response over a defined horizon while handling constraints in real time, and RL dynamically tunes MPC parameters to adapt the control system to task-specific priorities. The proposed controller operates in two distinct modes: energy-efficient mode and accuracy-focused mode, balancing energy consumption and task precision. Simulation results validate the approach, demonstrating significant energy savings while maintaining high accuracy during space assembly tasks.

Abstract:
Deep Reinforcement Learning (DRL) based navigation methods have demonstrated promising results for mobile robots, but suffer from limited action flexibility in confined spaces. Conventional DRL approaches predominantly learn forward-motion policies, causing robots to become trapped in complex environments where backward maneuvers are necessary for recovery. This paper presents MAER-Nav (Mirror-Augmented Experience Replay for Robot Navigation), a novel framework that enables bidirectional motion learning without requiring explicit failure-driven hindsight experience replay or reward function modifications. Our approach integrates a mirror-augmented experience replay mechanism with curriculum learning to generate synthetic backward navigation experiences from successful trajectories. Experimental results in both simulation and real-world environments demonstrate that MAER-Nav significantly outperforms state-of-the-art methods while maintaining strong forward navigation capabilities. The framework effectively bridges the gap between the comprehensive action space utilization of traditional planning methods and the environmental adaptability of learning-based approaches, enabling robust navigation in scenarios where conventional DRL methods consistently fail.

Abstract:
Fixed-Lag smoothing is widely employed as a backend in localization tasks. Generally, increasing the window length leads to better accuracy, but demands more computational resources. Therefore, determining an appropriate window length and whether a fixed length should be maintained throughout the localization process are worth studying. Assuming independent and identically distributed noise based on the distance-independent characteristic of LiDAR ranging errors, we propose an uncertainty-based adaptive sliding window (ASW) strategy. Through mathematical derivation, the reference uncertainty is affected by the LiDAR feature distribution of each frame. Consequently, we develop a multimodal LiDAR inertial odometry and mapping framework based on ASW, which integrates mechanical and solid-state LiDAR to enhance odometry accuracy and mapping density. By designing a joint matching module, our approach leverages the strengths of distinct scanning patterns. Additionally, we incorporate loop closure detection in the mapping process to minimize cumulative drift. Extensive experiments conducted on both public and self-collected datasets demonstrate the effectiveness of our method. Compared to the state-of-the-art method, our approach improves the average accuracy by 10.3%. We also provide an open-source implementation for further studies. https://github.com/wowhhhhgd/ASW-LIOM.

Abstract:
The integration of origami structures into soft robotics has enriched the adaptability and functionality of the soft robots. Our research group has developed a cable-driven origami robot attached to an arc frame, which enables its deployment in an MR bore and manipulation of medical tools. However, the control of such origami robots still faces challenges such as nonlinear dynamics, unstructured environment, and external payload. This paper introduces an active modeling compensation control method using Koopman operator (K-AMCC) for the Yoshimura origami manipulator, enabling its accurate trajectory tracking with payloads under different orientations. This active modeling method exploits Koopman operator theory and Kalman filter to estimate the model error and synergies the linear quadratic regulator to compensate the modeling errors. The rectangular and circular trajectory tracking experiments under varying payloads and orientations were carried out. The results demonstrate the K-AMCC method’s ability to improve trajectory accuracy significantly, which lays a solid foundation for further medical applications such as needle manipulation and laser ablation in an MR environment.

Abstract:
Guiding vector fields (GVFs) have been widely applied in robotic path-following control. However, most, if not all, of the existing studies derive control algorithms that only render the path-following error asymptotically converging to zero, while more stringent time constraints on the path-following error convergence have not been fully studied. In this paper, by introducing a signum-based function, we propose a finite-time GVF that enables a nonholonomic robot to follow an arbitrary smooth nD desired path within a finite time. Note that the finite time is dependent on the initial condition and can be computed in advance. In practical applications, we design a controller based on the proposed GVF for the unicycle model. This controller drives a nonholonomic robot’s velocity to align with that of the GVF within a finite time. In addition, we introduce the extension of the proposed GVF to the distributed motion coordination among an arbitrary number of robots. Finally, we conduct two experiments using unmanned ground vehicles to validate the effectiveness of the proposed algorithms.

Abstract:
Whole-body planning is critical for enabling robots to navigate effectively in complex and dense environments. Traditional obstacle-based planning methods methods often restrict the representation of both robots and obstacles to simple convex polyhedra. This limitation may fail to adequately address intricate geometries of real-world obstacles involved in constructing compact convex polyhedral envelopes around more intricate obstacle shapes found in such environments. In this paper, we propose an approximate convex decomposition (ACD) based method to generate convex polyhedral maps that effectively represent the non-convex shapes of robots as assemblies of multiple convex objects. Furthermore, we propose a differentiable convex polyhedron collision evaluation method to facilitate collision detection. Extensive experiments demonstrate that our method not only enhances the accuracy of collision detection in cluttered environments but also expands the potential applications of robotics in complex scenarios.

Abstract:
Recent advances in robotic manipulation leverage foundation models pre-trained on internet-scale data, where keypoint-based representations have shown promising results in spatial reasoning. However, existing approaches primarily focus on zero-shot generalization or human-collected demonstrations, with limited exploration of large-scale robotic datasets. In this work, we propose Keypoint-Aware Retrieval Augmented Generation (KARAG), a simple yet novel framework that synergistically integrates visual-language models (VLMs) with robotic datasets through retrieval-augmented generation (RAG). Our framework bridges the retrieval and generation phases via in-context learning with keypoint-aware constraints, enabling simultaneous utilization of internet-scale knowledge and structured robotic datasets. Extensive experiments in both simulated and real-world environments demonstrate that KARAG significantly enhances the stability and accuracy of VLM-generated outputs without requiring human demonstrations or additional training, achieving 10%–20% success rate improvements in real-world scenarios and 12%–46% improvements in simulation over the baseline. Furthermore, we present an algorithm for converting large robotic datasets into Keyframe-Keypoint-Trajectory representations to facilitate retrieval. Our dataset and implementation are publicly available at https://github.com/RobertAckleyLin/KARAG/.

Abstract:
Recent advances in large language models (LLMs) have led to significant progress in robotics, enabling embodied agents to understand and execute open-ended tasks. However, existing LLM-based approaches face limitations in grounding their outputs within the physical environment and aligning with the capabilities of the robot. While fine-tuning is an attractive approach to addressing these issues, the required data can be expensive to collect, especially when using very large language models. Smaller language models, while more computationally efficient, are less robust in task planning and execution, leading to a difficult trade-off between performance and tractability. In this paper, we present a novel, modular architecture designed to enhance the robustness of locally-executable LLMs in the context of robotics by addressing these grounding and alignment issues. We formalize the task planning problem within a goal-conditioned POMDP framework, identify key failure modes in LLM-driven planning, and propose targeted design principles to mitigate these issues. Our architecture introduces an "expected outcomes" module to prevent mischaracterization of subgoals and a feedback mechanism to enable real-time error recovery. Experimental results, both in simulation and on physical robots, demonstrate that our approach leads to significant improvements in success rates for pick-and-place and manipulation tasks, surpassing baselines using larger models. Through hardware experiments, we also demonstrate how our architecture can be run efficiently and locally. This work highlights the potential of smaller, locally-executable LLMs in robotics and provides a scalable, efficient solution for robust task execution and data collection.1

Abstract:
Land-air bimodal robots (LABR) are gaining attention for autonomous navigation, combining high mobility from aerial vehicles with long endurance from ground vehicles. However, existing LABR navigation methods are limited by suboptimal trajectories from mapping-based approaches and the excessive computational demands of learning-based methods. To address this, we propose a two-stage lightweight framework that integrates global key points prediction with local trajectory refinement to generate efficient and reachable trajectories. In the first stage, the Global Key points Prediction Network (GKPN) was used to generate a hybrid land-air keypoint path. The GKPN includes a Sobel Perception Network (SPN) for improved obstacle detection and a Lightweight Attention Planning Network (LAPN) to improves predictive ability by capturing contextual information. In the second stage, the global path is segmented based on predicted key points and refined using a mapping-based planner to create smooth, collision-free trajectories. Experiments conducted on our LABR platform show that our framework reduces network parameters by 14% and energy consumption during land-air transitions by 35% compared to existing approaches. The framework achieves real-time navigation without GPU acceleration and enables zero-shot transfer from simulation to reality during deployment.

Abstract:
Vision-based teleoperation systems are widely used due to their cost-effectiveness and intuitive operation. However, these systems often suffer from challenges such as hand occlusions, environmental variability, and the lack of tactile feedback, limiting their precision and applicability in complex tasks. To address these limitations, we present Air-Touch, a novel, low-cost visuotactile teleoperation system that integrates air pressure-based tactile feedback with lightweight hand pose estimation. AirTouch features an inflatable tactile bubble that provides adjustable feedback through closed-loop pneumatic control, enhancing the operator’s sense of interaction with remote environments. The system’s robust hand-tracking algorithm ensures accurate control even under dynamic and occlusion-prone conditions, while its hardware design eliminates the need for wearable devices, enabling intuitive operation. AirTouch supports a wide range of robotic end-effectors, including dexterous hands, parallel grippers, and suction cups, demonstrating versatility across multiple platforms. Extensive experiments validate AirTouch’s performance, achieving high precision in hand pose estimation and a 91% success rate in complex teleoperation tasks, all with a hardware cost as low as 39. These results highlight AirTouch as a scalable and practical solution for enhancing robotic teleoperation across industrial, medical, and hazardous scenarios.

Abstract:
Robotic manipulation is essential for the widespread adoption of robots in industrial and home settings and has long been a focus within the robotics community. Advances in artificial intelligence have introduced promising learning-based methods to address this challenge, with imitation learning emerging as particularly effective. However, efficiently acquiring high-quality demonstrations remains a challenge. In this work, we introduce an immersive VR-based teleoperation setup designed to collect demonstrations from a remote human user. We also propose an imitation learning framework called Haptic Action Chunking with Transformers (Haptic-ACT). To evaluate the platform, we conducted a pick-and-place task and collected 50 demonstration episodes. Results indicate that the immersive VR platform significantly reduces demonstrator fingertip forces compared to systems without haptic feedback, enabling more delicate manipulation. Additionally, evaluations of the Haptic-ACT framework in both the MuJoCo simulator and on a real robot demonstrate its effectiveness in teaching robots more compliant manipulation compared to the original ACT. Additional materials are available at https://sites.google.com/view/hapticact.

Abstract:
In recent years, Contrastive learning has shown great potential in traffic flow prediction tasks. However, existing contrastive learning methods have difficulties in dealing with missing data and noise, and it is difficult to fully capture local and global correlations by relying on a single contrast method. In this paper, a Decreasing Mask Spatio-Temporal Graph Comparison Learning Model (DMSTGCL) is proposed. The model dynamically adjusts the mask ratio through the adaptive mask reduction technique to effectively deal with the problem of missing data and noise. Meanwhile, the projection head is further combined with the TripleAttention mechanism in the spatio-temporal contrast learning process, which overcomes the limitations of a single contrast method and captures the complex relationships in local and global space more effectively. Experiments on three real-world datasets demonstrate that DMSTGCL achieves significantly higher prediction accuracy than existing methods.

Abstract:
Approximating model predictive control (MPC) using imitation learning (IL) allows for fast control without solving expensive optimization problems online. However, methods that use neural networks in a simple L2-regression setup fail to approximate multi-modal (set-valued) solution distributions caused by local optima found by the numerical solver or nonconvex constraints, such as obstacles, significantly limiting the applicability of approximate MPC in practice. We solve this issue by using diffusion models to accurately represent the complete solution distribution (i.e., all modes) up to kilohertz sampling rates. This work shows that diffusion-based AMPC significantly outperforms L2-regression-based approximate MPC for multi-modal action distributions. In contrast to most earlier work on IL, we also focus on running the diffusion-based controller at a higher rate and in joint space instead of end-effector space. Additionally, we propose the use of gradient guidance during the denoising process to consistently pick the same mode in closed loop to prevent switching between solutions. We propose using the cost and constraint satisfaction of the original MPC problem during parallel sampling of solutions from the diffusion model to pick a better mode online. We evaluate our method on the fast and accurate control of a 7-DoF robot manipulator both in simulation and on hardware deployed at 250 Hz, achieving a speedup of more than 70 times compared to solving the MPC problem online and also outperforming the numerical optimization (used for training) in success ratio.

Abstract:
Locomotion on unstructured terrain poses a significant challenge for wheeled mobile robots lacking reconfigurable mechanisms. Achieving both stability and agile motion in such environments requires a hybrid approach that leverages their adaptable nature to varying surface conditions while ensuring efficient mobility. In this research, we present Improbability Roller 2, a refined iteration of our hybrid mobile robot with variable-diameter wheels, simplifying the design while improving its maneuverability and adaptability on unstructured terrain. The new compliant outer wheel structure and its folding mechanism allow for a higher wheel size change ratio. With the combination of multimode steering, which integrates both differential drive control and steering based on wheel size disparity, the robot can now optimize locomotion on diverse terrain while maintaining traction. The robot was tested across various obstacles and multiple surface conditions to validate the effectiveness of the new wheel design and the dual steering strategy. Experiments, including slope and step climbing, confined space traversal, and locomotion on loose gravel and snow, demonstrated the robot’s improved terrain adaptability, consistent traction, and control across varying surfaces.

Abstract:
Multi-Robot Navigation Among Movable Obstacles (MR-NAMO) is a variant of the Multi-Agent Path Finding (MAPF) problem where the environment consists of both immovable and movable obstacles. In this paper, we introduce a special case of MR-NAMO called the multi-agent path finding with removable obstacles (MAPF-RO) problem in which the robots cooperate to remove some obstacles along the busy paths in the environment. We model the removable obstacles as pits and remove them by filling them using sandbags. Sandbags are modeled as movable obstacles but are removable when they are filled into a pit. The obstacles are removed or moved away from the paths while the total energy required for all robots to reach their goals is minimized. The nearby sandbag to fill a pit is identified by using a kd-tree-based heuristic search. The nearby robot to push a sandbag is identified using directional wavefront propagation algorithm. We simulate the scenario in randomized grid environments consisting of static, movable and removable obstacles. We find that our approach conserves energy by removing only the necessary obstacles and cooperatively shortens the path for other agents. This method can be applied to multi-robot cooperative environment modification, enabling robots to alter their surroundings to optimize task execution for their peers while reducing overall energy expenditure.

Abstract:
Depth estimation is one of the key technologies for realizing 3D perception in unmanned systems. Monocular depth estimation has been widely researched because of its low-cost advantage, but the existing methods face the challenges of poor depth estimation performance and blurred object boundaries on embedded systems. In this paper, we propose a novel monocular depth estimation model, BoRe-Depth, which contains only 8.7M parameters. It can accurately estimate depth maps on embedded systems and significantly improves boundary quality. Firstly, we design an Enhanced Feature Adaptive Fusion Module (EFAF) which adaptively fuses depth features to enhance boundary detail representation. Secondly, we integrate semantic knowledge into the encoder to improve the object recognition and boundary perception capabilities. Finally, BoRe-Depth is deployed on NVIDIA Jetson Orin, and runs efficiently at 50.7 FPS. We demonstrate that the proposed model significantly outperforms previous lightweight models on multiple challenging datasets, and we provide detailed ablation studies for the proposed methods. The code is available at https://github.com/liangxiansheng093/BoRe-Depth.

Abstract:
Magnetically controlled micro-nano robots hold revolutionary significance in the clinical targeted treatment of brain tumors. Imaging and tracking miniature robots can provide feedback for precise magnetic field control. The cooperation among micro-nano robots, magnetic field control system, and imaging system is a significant challenge for transitioning micro-nano robots from laboratory research to clinical applications. This study explores the control and spatial localization of magnetic nanorobot swarms in a highly realistic, human-sized vascular phantom which is manufactured using the raw CT scan images. The cerebral arterial vessels are the key focus area with four main inlets and twenty-six branch outlets. The simulation results show that, under the influence of a magnetic field, the nanorobots can accumulate at the target tumor site. The Kernelized Correlation Filter (KCF) algorithm was employed to achieve single-plane tracking of nanorobots. Furthermore, based on a biplanar imaging system, three-dimensional spatial trajectory tracking of nanorobots was realized. This study provides a reference for in vivo spatial localization and imaging of magnetic nanorobot swarms (MNRS) transported through vascular system.

Abstract:
Shared autonomy is the future of teleoperation as it reduces the teleoperator’s burden, enhances capabilities, and improves embodiment by offering seamless control of the robot. However, it remains rarely used, particularly with humanoid robots, as it faces numerous challenges. In this work, we introduce an innovative shared autonomy framework suitable for a wide range of robots, which we tested on a humanoid robot. This framework leverages Bayesian filtering over a Hidden Markov Model (HMM) to perform goal recognition, employing a landmark-based heuristic that minimizes computational demands while computing observation likelihoods without prior knowledge or a cost function. Once the teleoperator’s goal is identified, the robot assists according to its confidence level in the goal prediction. Assistance is provided by guiding the robot’s end-effector to reach a specified target position and orientation. In experiments with a diverse group of 10 teleoperators, conducted with video transmission delay, we achieved high accuracy in goal prediction and demonstrated significantly faster teleoperation time with shared autonomy.

Abstract:
Turn-taking is a crucial aspect of human-robot interaction, directly influencing conversational fluidity and user engagement. While previous research has explored turn-taking models in controlled environments, their robustness in real-world settings remains underexplored. In this study, we propose a noise-robust voice activity projection (VAP) model, based on a Transformer architecture, to enhance real-time turn-taking in dialogue robots. To evaluate the effectiveness of the proposed system, we conducted a field experiment in a shopping mall, comparing the VAP system with a conventional cloud-based speech recognition system. Our analysis covered both subjective user evaluations and objective behavioral analysis. The results showed that the proposed system significantly reduced response latency, leading to a more natural conversation where both the robot and users responded faster. The subjective evaluations suggested that faster responses contribute to a better interaction experience.

Abstract:
In this paper, we propose a probabilistic motion model for tendon actuated continuum robots that experience actuation transmission non-linearities due to cable slack and cable-sheath friction. The model is based on a Lie group formulation of the robot’s end-effector pose that incorporates a new simple backlash model. Bayesian parameter estimation is then employed to learn a probability distribution over the model’s parameters. This allows the uncertainty over the parameters to be propagated in the prediction of the end-effector’s trajectory. The model’s predictive capabilities are compared against the static Cosserat-rod-based model and the Kirchhoff model in simulation and are validated with experiments on a robotized medical endoscope.

Abstract:
Spherical tensegrity structure has good dynamic stability, support strength and flexibility, and is widely used in the field of mobile robot research. Most of the tensegrity spherical robots deform themselves to make gravity work to realize the motion, but the deformation of both rods and ropes affects the robot's motion efficiency and motion instability. In this paper, a new type of tensegrity spherical robot is proposed, which is powered by six fixed ducted thrusters, and the thrust is provided to induce the robot to roll when in different attitudes. This paper firstly introduces the structural design and principle of the robot. Secondly analyzes the magnitude of the propulsive force required for the robot's motion and establishes a kinematic model. Finally, the robot prototype model was built and the robot motion experiments were conducted in simulation and the real environment respectively. The experimental results show that the robot has a simple structure but high motion efficiency, and has a strong ability to adapt to the environment.

Abstract:
Developing general robotic systems capable of manipulating in unstructured environments is a significant challenge, particularly as the tasks involved are typically long-horizon and rich-contact, requiring efficient skill transfer across different task scenarios. To address these challenges, we propose knowledge graph-based skill library construction method. This method hierarchically organizes manipulation knowledge using "task graph" and "scene graph" to represent task-specific and scene-specific information, respectively. Additionally, we introduce "state graph" to facilitate the interaction between high-level task planning and low-level scene information. Building upon this foundation, we further propose a novel hierarchical skill transfer framework based on the skill library and tactile representation, which integrates high-level reasoning for skill transfer and low-level precision for execution. At the task level, we utilize large language models (LLMs) and combine contextual learning with a four-stage chain-of-thought prompting paradigm to achieve subtask sequence transfer. At the motion level, we develop an adaptive trajectory transfer method based on the skill library and the heuristic path planning algorithm. At the physical level, we propose an adaptive contour extraction and posture perception method based on tactile representation. This method dynamically acquires high-precision contour and posture information from visual-tactile images, adjusting parameters such as contact position and posture to ensure the effectiveness of transferred skills in new environments. Experiments demonstrate the skill transfer and adaptability capabilities of the proposed methods across different task scenarios. Project website: https://github.com/MingchaoQi/skill_transfer

Abstract:
Recently, multi-node inertial measurement unit (IMU)-based odometry for legged robots has gained attention due to its cost-effectiveness, power efficiency, and high accuracy. However, the spatial and temporal misalignment between foot-end motion derived from forward kinematics and foot IMU measurements can introduce inconsistent constraints, resulting in odometry drift. Therefore, accurate spatial-temporal calibration is crucial for the multi-IMU systems. Although existing multi-IMU calibration methods have addressed passive single-rigid-body sensor calibration, they are inadequate for legged systems. This is due to the insufficient excitation from traditional gaits for calibration, and enlarged sensitivity to IMU noise during kinematic chain transformations. To address these challenges, we propose A2I-Calib, an anti-noise active multi-IMU calibration framework enabling autonomous spatial-temporal calibration for arbitrary foot-mounted IMUs. Our A2I-Calib includes: 1) an anti-noise trajectory generator leveraging a proposed basis function selection theorem to minimize the condition number in correlation analysis, thus reducing noise sensitivity, and 2) a reinforcement learning (RL)-based controller that ensures robust execution of calibration motions. Furthermore, A2I-Calib is validated on simulation and real-world quadruped robot platforms with various multi-IMU settings, which demonstrates a significant reduction in noise sensitivity and calibration errors, thereby improving the overall multi-IMU odometry performance.

Abstract:
Rejecting outliers before applying classical robust methods is a common approach to increase the success rate of estimation, particularly when the outlier ratio is extremely high (e.g. 90%). However, this method often relies on sensor- or task-specific characteristics, which may not be easily transferable across different scenarios. In this paper, we focus on the problem of rejecting 2D-3D point correspondence outliers from 2D forward-looking sonar (2D FLS) observations, which is one of the most popular perception device in the underwater field but has a significantly different imaging mechanism compared to widely used perspective cameras and LiDAR. We fully leverage the narrow field of view in the elevation of 2D FLS and develop two compatibility tests for different 3D point configurations: (1) In general cases, we design a pairwise length in-range test to filter out overly long or short edges formed from point sets; (2) In coplanar cases, we design a coplanarity test to check if any four correspondences are compatible under a coplanar setting. Both tests are integrated into outlier rejection pipelines, where they are followed by maximum clique searching to identify the largest consistent measurement set as inliers. Extensive simulations demonstrate that the proposed methods for general and coplanar cases perform effectively under outlier ratios of 80% and 90%, respectively.

Abstract:
A sixteen-joint snake robot with full-body surface pressure sensing capabilities has been developed. A total of 64 thin film pressure sensors are evenly distributed on the surface of the robot. Four intelligent obstacle avoidance movements integrating surface pressure perception were investigated. They are as follows: the roll-over obstacle avoidance motion capable of autonomously switching between the regular rolling gait and the hump rolling gait, the autonomous crawling obstacle avoidance motion under unknown obstacle parameters, the intelligent winding and climbing motion on horizontal pipes with either unknown diameters or those with variable diameters, and the gap-crossing motion that can autonomously detect the gap position and cross over horizontal pipes with gaps. Finally, experiments were conducted in different scenarios to verify the feasibility of these four intelligent motions.

Abstract:
Prior flow matching methods in robotics have primarily learned velocity fields to morph one distribution of trajectories into another. In this work, we extend flow matching to capture second-order trajectory dynamics, incorporating acceleration effects either explicitly in the model or implicitly through the learning objective. Unlike diffusion models, which rely on a noisy forward process and iterative denoising steps, flow matching trains a continuous transformation (flow) that directly maps a simple prior distribution to the target trajectory distribution without any denoising procedure. By modeling trajectories with second-order dynamics, our approach ensures that the generated robot motions are smooth and physically executable, avoiding the jerky or dynamically infeasible trajectories that first-order models might produce. We empirically demonstrate that this second-order conditional flow matching yields superior performance on motion planning benchmarks, achieving smoother trajectories and higher success rates than baseline planners. These findings highlight the advantage of learning acceleration-aware motion fields, as our method outperforms existing motion planning methods in terms of trajectory quality and planning success. Our source code is available at: https://github.com/mkhangg/flow_mp.

Abstract:
The process of object search and relocation in an indoor environment, while intuitive for humans, remains a complex challenge for robots. Enabling robots to perform this task autonomously could have a substantial impact towards automation in both domestic and industrial settings. In this article, assuming a familiar environment, a set of target objects with their desired locations, and a robot with limited carrying capacity, we propose a novel methodology for object search and relocation. Given the human-like intuition exhibited by modern large language models (LLMs), they can be leveraged to guide object localization based on environmental context. Our approach integrates LLM-based prediction with graph-based path planning to create a human-like iterative search and relocation framework. The framework consists of an LLM predictor that suggests likely object locations (along with a likelihood score) and an adaptive path planner that dynamically updates the robot’s future path as new information becomes available during the search process. Prior relevant literature that employs LLM inference in indoor environments primarily focuses on assigning new or misplaced objects to appropriate locations. The aspect of enabling a search for a set of missing objects and planning their relocation to desired locations sets this article apart from prior literature. We compare our method to a patrol-based baseline with respect to the distance traversed by the robot in completing the search and relocation mission. In a medium sized indoor environment we demonstrate that it outperforms the baseline on an average by 31.2%.

Abstract:
Teleoperated robotic manipulators enable the collection of demonstration data, which can be used to train control policies through imitation learning. However, such methods can require significant amounts of training data to develop robust policies or adapt them to new and unseen tasks. While expert feedback can significantly enhance policy performance, providing continuous feedback can be cognitively demanding and time-consuming for experts. To address this challenge, we propose using a cable-driven teleoperation system that can provide spatial corrections with 6 degrees of freedom to the trajectories generated by a policy model. Specifically, we propose a correction method termed Decaying Relative Correction (DRC), which is based upon the spatial offset vector provided by the expert and exists temporarily, reducing the number of intervention steps required by an expert. Our results demonstrate that DRC reduces the required expert intervention rate by 30% compared to a standard absolute corrective method. Furthermore, we show that integrating DRC within an online imitation learning framework rapidly increases the success rate of manipulation tasks such as raspberry harvesting and cloth wiping.

Abstract:
In previous work, model-based reinforcement learning was applied to a real-world labyrinth game to demonstrate sample-efficient learning using world models. In this paper, we further enhance sample efficiency and autonomy by introducing selective reconstruction: instead of reconstructing the full visual observation, our approach reconstructs only the low-dimensional physical state signals (e.g., marble position and plate inclination), while still leveraging the complete visual input for decision-making. This targeted reconstruction focuses the world model on learning dynamics-relevant information, thereby reducing computational overhead and model complexity. Additionally, we incorporate prioritized experience replay to accelerate learning in newly explored regions of the maze and implement an autonomous marble reloader to eliminate manual resets. Together, these enhancements reduce the required collected experience from 5 hours to 1.5 hours while achieving comparable performance, and enable fully autonomous learning without human supervision.

Abstract:
This paper presents a learning-based approach for accurately estimating the 3D shape of flexible continuum robots subjected to external loads. The proposed method introduces a spatiotemporal neural network architecture that fuses multimodal inputs, including current and historical tendon displacement data and RGB images, to generate point clouds representing the robot’s deformed configuration. The network integrates a recurrent neural module for temporal feature extraction, an encoding module for spatial feature extraction, and a multimodal fusion module to combine spatial features extracted from visual data with temporal dependencies from historical actuator inputs. Continuous 3D shape reconstruction is achieved by fitting Bézier curves to the predicted point clouds. Experimental validation demonstrates that our approach achieves high precision, with mean shape estimation errors of 0.08 mm (unloaded) and 0.22 mm (loaded), outperforming state-of-the-art methods in shape sensing for TDCRs. The results validate the efficacy of deep learning-based spatiotemporal data fusion for precise shape estimation under loading conditions.

Abstract:
Navigating autonomous robots through dense forests and rugged terrains is especially daunting when exteroceptive sensors—such as cameras and LiDAR sensors— fail under occlusions, low-light conditions, or sensor noise. We present Blind-Wayfarer, a probing-driven navigation framework inspired by maze-solving algorithms that relies primarily on a compass to robustly traverse complex, unstructured environments. In 1,000 simulated forest experiments, Blind-Wayfarer achieved a 99.7% success rate. In real-world tests in two distinct scenarios—with rover platforms of different sizes—our approach successfully escaped forest entrapments in all 20 trials. Remarkably, our framework also enabled a robot to escape a dense woodland, traveling from 45 m inside the forest to a paved pathway at its edge. These findings highlight the potential of probing-based methods for reliable navigation in challenging perception-degraded field conditions. Videos and code are available on our website https://sites.google.com/view/blind-wayfarer

Abstract:
This research addresses the challenge of achieving high-precision tracking of time-varying trajectories under nonlinear disturbances and motion constraints in microsurgical robots. A hybrid control framework integrating fuzzy adaptive sliding mode control with radial basis function neural networks is proposed. This framework dynamically adjusts the sliding mode gain to suppress high-frequency jitter and compensate for unmodeled disturbances such as joint friction and tissue contact forces. Experiments conducted on a self-developed microscopic ophthalmic robot platform demonstrated that the trajectory tracking error was reduced to 1.1 μm, representing improvements of 85.9%, 76.1%, and 66.7% compared to PID control, sliding mode control and non-singular fast terminal sliding mode control respectively. The tracking delay was 19 milliseconds. In experiments on living pigs with central retinal artery occlusion, the system successfully performed intravascular injection, with a maximum error of 3.97 μm. This solution, through optimization via fuzzy logic and neural networks, achieves micron-level precision and robustness, effectively solving high-frequency control noise and low-frequency environmental disturbances, ensuring both the accuracy and safety of the microsurgical robot.

Affiliations: Intelligent Transportation, Hub of Systems, The Hong Kong University of Science and Technology (Guangzhou), China; Intelligent Transportation Thrust and Robotics and Autonomous Systems Thrust, Systems Hub, The Hong Kong University of Science and Technology (Guangzhou), China; Intelligent Transportation Thrust, Systems Hub, Internet of Things Thrust, Information Hub, The Hong Kong University of Science and Technology (Guangzhou), China

Abstract:
Unmanned Aerial Vehicles (UAVs) have emerged as versatile tools across various sectors, driven by their mobility and adaptability. This paper introduces SkyVLN, a novel framework integrating vision-and-language navigation (VLN) with Nonlinear Model Predictive Control (NMPC) to enhance UAV autonomy in complex urban environments. Unlike traditional navigation methods, SkyVLN leverages Large Language Models (LLMs) to interpret natural language instructions and visual observations, enabling UAVs to navigate through dynamic 3D spaces with improved accuracy and robustness. We present a multimodal navigation agent equipped with a fine-grained spatial verbalizer and a history path memory mechanism. These components allow the UAV to disambiguate spatial contexts, handle ambiguous instructions, and backtrack when necessary. The framework also incorporates an NMPC module for dynamic obstacle avoidance, ensuring precise trajectory tracking and collision prevention. To validate our approach, we developed a high-fidelity 3D urban simulation environment using AirSim, featuring realistic imagery and dynamic urban elements. Extensive experiments demonstrate that SkyVLN significantly improves navigation success rates and efficiency, particularly in new and unseen environments.

Abstract:
Anomaly detection and localization in automated industrial manufacturing can significantly enhance production efficiency and product quality. Existing methods are capable of detecting surface defects in pre-defined or controlled imaging environments. However, accurately detecting workpiece defects in complex and unstructured industrial environments with varying views, poses and illumination remains challenging. We propose a novel anomaly detection and localization method specifically designed to handle inputs with perturbative patterns. Our approach introduces a new framework based on a collaborative distillation heterogeneous teacher network (HetNet), an adaptive local-global feature fusion module, and a local multivariate Gaussian noise generation module. HetNet can learn to model the complex feature distribution of normal patterns using limited information about local disruptive changes. We conducted extensive experiments on mainstream benchmarks. HetNet demonstrates superior performance with approximately 10% improvement across all evaluation metrics on MSC-AD under industrial conditions, while achieving state-of-the-art results on other datasets, validating its resilience to environmental fluctuations and its capability to enhance the reliability of industrial anomaly detection systems across diverse scenarios. Tests in real-world environments further confirm that HetNet can be effectively integrated into production lines to achieve robust and real-time anomaly detection. Codes, images and videos are published on the project website at: https://zihuatanejoyu.github.io/HetNet/

Abstract:
Drifting is an advanced driving technique where the wheeled robot’s tire-ground interaction breaks the common non-holonomic pure rolling constraint. This allows high-maneuverability tasks like quick cornering, and steady-state drifting control enhances motion stability under lateral slip conditions. While drifting has been successfully achieved in four-wheeled robot systems, its application to single-track two-wheeled (STTW) robots, such as unmanned motorcycles or bicycles, has not been thoroughly studied. To bridge this gap, this paper extends the drifting equilibrium theory to STTW robots and reveals the mechanism behind the steady-state drifting maneuver. Notably, the counter-steering drifting technique used by skilled motorcyclists is explained through this theory. In addition, an analytical algorithm based on intrinsic geometry and kinematics relationships is proposed, reducing the computation time by four orders of magnitude while maintaining less than 6% error compared to numerical methods. Based on equilibrium analysis, a model predictive controller (MPC) is designed to achieve steady-state drifting and equilibrium points transition, with its effectiveness and robustness validated through simulations.

Abstract:
The semi-supervised medical image segmentation with a few annotated data can provide significant help in robot-assisted surgery. This step plays a pivotal role in identification of pathological regions, more appropriate planning of surgical procedures, and so on. In this work, we develop an evidence-based tri-branch cross-pseudo supervision model, which integrates evidence-based uncertainty estimation and multi-branch cross supervision to bolster the effectiveness of semi-supervised learning. The overall framework consists of a vanilla network and an evidential dual-branch network. Two evidential branches EPB and ERB are proposed to complement each other and improve the quality of pseudo-labels. The EPB places more focus on classification accuracy at the pixel level and the ERB emphasizes the similarity and overall integrity of the segmented regions. Then, a novel cross-pseudo supervision strategy among the three branches is designed, to guarantee that valuable and diverse unlabeled knowledge is explored and transferred for segmentation improvement. The effectiveness of the proposed method was verified on the ACDC dataset, achieving outstanding performance compared with other state-of-the-art methods. In addition, we conducted ablation study to validate the effectiveness of the evidential branches (EPB and ERB) and tri-branch cross-supervision strategy, respectively.

Abstract:
In this paper, we propose a framework, collective behavioral cloning (CBC), to learn the underlying interaction mechanism and control policy of a swarm system. Given the trajectory data of a swarm system, we propose a graph variational autoencoder (GVAE) to learn the local interaction graph. Based on the interaction graph and swarm trajectory, we use behavioral cloning to learn the control policy of the swarm system. To demonstrate the practicality of CBC, we deploy it on a real-world decentralized vision-based robot swarm system. A visual attention network is trained based on the learned interaction graph for online neighbor selection. Experimental results show that our method outperforms previous approaches in predicting both the interaction graph and swarm actions with higher accuracy. This work offers a promising approach for understanding interaction mechanisms and swarm dynamics in future swarm robotics research. Code and data are available 6.

Abstract:
Recent trends in SLAM and visual navigation have embraced 3D Gaussians as the preferred scene representation, highlighting the importance of estimating camera poses from a single image using a pre-built Gaussian model. However, existing approaches typically rely on an iterative render-compare-refine loop, where candidate views are first rendered using NeRF or Gaussian Splatting, then compared against the target image, and finally, discrepancies are used to update the pose. This multi-round process incurs significant computational overhead, hindering real-time performance in robotics. In this paper, we propose iGaussian, a two-stage feed-forward framework that achieves real-time camera pose estimation through direct 3D Gaussian inversion. Our method first regresses a coarse 6DoF pose using a Gaussian Scene Prior-based Pose Regression Network with spatial uniform sampling and guided attention mechanisms, then refines it through feature matching and multi-model fusion. The key contribution lies in our cross-correlation module that aligns image embeddings with 3D Gaussian attributes without differentiable rendering, coupled with a Weighted Multiview Predictor that fuses features from Multiple strategically sampled viewpoints. Experimental results on the NeRF Synthetic, Mip-NeRF 360, and T&T+DB datasets demonstrate a significant performance improvement over previous methods, reducing median rotation errors to 0.2° while achieving 2.87 FPS tracking on mobile robots, which is an impressive 10× speedup compared to optimization-based approaches. Project page: https://github.com/pythongod-exe/iGaussian

Abstract:
Robots operating in human-centric or hazardous environments must proactively anticipate and mitigate dangers beyond basic obstacle detection. Traditional navigation systems often depend on static maps, which struggle to account for dynamic risks, such as a person emerging from a suddenly opening door. As a result, these systems tend to be reactive rather than anticipatory when handling dynamic hazards. Recent advancements in pre-trained large language models and vision-language models (VLMs) create new opportunities for proactive hazard avoidance. In this work, we propose a zero-shot language-as-cost mapping framework that leverages VLMs to interpret visual scenes, assess potential dynamic risks, and assign risk-aware navigation costs preemptively, enabling robots to anticipate hazards before they materialize. By integrating this language-based cost map with a geometric obstacle map, the robot not only identifies existing obstacles but also anticipates and proactively plans around potential hazards arising from environmental dynamics. Experiments in simulated and diverse dynamic environments demonstrate that the proposed method significantly improves navigation success rates and reduces hazard encounters, compared to reactive baseline planners. Code and supplementary materials are available at https://github.com/Taekmino/LaC.

Abstract:
This work introduces a novel method for the generation of process-aligned robotic pathways specifically designed for surface processing applications. The proposed approach integrates the interpretation of sensor data, computer vision algorithms, and process knowledge modeling to address the complexities inherent in robotic programming. To mitigate programming challenges, the method incorporates intuitive interaction techniques, including hand gestures and human-computer interaction (HCI), thereby facilitating the efficient generation of robotic paths. Additionally, it augments the user’s teaching experience by enabling the seamless deployment of the methodology in serial production settings, accommodating the variability of both workpieces and environmental conditions. The proposed framework ensures smooth integration of robotic systems into complex workflows by aligning robotic paths with the unique requirements of surface processing tasks.

Abstract:
Precision is a crucial performance indicator for robot arms. During interacting with human, high precision enables a robot arm to be used effectively and safely, while low precision may lead to safety issues. Traditional methods for improving robot arm precision rely on error compensation. However, these methods are often not robust and lack adaptability. Learning-based methods offer greater flexibility and adaptability, while current researches show that they often fall short in achieving high precision and struggle to handle many scenarios requiring high precision. In this paper, we propose a novel high-precision robot arm manipulation framework based on online iterative learning and forward simulation, which can achieve positioning error (precision) less than end-effector physical minimum displacement. In other words, our proposed method can compensate for the precision-limitation of the hardware structure of the robot arms. Furthermore, we consider the joint angular resolution of the real robot arm, which is usually neglected in related works. A series of experiments on both simulation and real UR3 robot arm platforms demonstrate that our proposed method is effective and promising. The related code will be available soon.

Abstract:
The data-driven paradigm has shown great potential in solving many decision-making tasks. In the robot navigation realm, it also sparked a new trend. People believe powerful data-driven methods can learn efficient and general navigation policies from a vast offline dataset. However, robot navigation tasks differ from common planning tasks and present unique challenges. It often involves multi-objective optimization to meet arbitrary and ever-changing human preferences. It should also overcome the short-sighted problem to obtain globally optimal performance. Furthermore, high planning frequency is needed to address real-time demands. These factors obstruct the application of data-driven methods in robot navigation. To address these challenges, we integrate one of the most powerful data-driven methods, the diffusion model, into robot navigation. Our proposed approach, NaviDiffuser, utilizes a novel classification label to guide the diffusion model in capturing the complex connections between navigation and human preferences. Its Transformer network backbone outputs action sequences to alleviate short-sightedness. It also includes special distillation skills to boost the planning speed and quality. We conduct experiments in both simulated and real-world scenarios to evaluate our approach. In these experiments, NaviDiffuser not only demonstrates an extremely high arrival rate but also adjusts its navigation policy to align with different human preferences.

Abstract:
Learning to execute long-horizon mobile manipulation tasks is crucial for advancing robotics in household and workplace settings. However, current approaches are typically data-inefficient, underscoring the need for improved models that require realistically sized benchmarks to evaluate their efficiency. To address this, we introduce the LAMBDA (λ) benchmark1––Long-horizon Actions for Mobile-manipulation Benchmarking of Directed Activities––which evaluates the data efficiency of models on language-conditioned, long-horizon, multi-room, multi-floor, pick-and-place tasks using a dataset of manageable size, more feasible for collection. Our benchmark includes 571 human-collected demonstrations that provide realism and diversity in simulated and real-world settings. Unlike planner-generated data, these trajectories offer natural variability and replay-verifiability, ensuring robust learning and evaluation. We leverage λ to benchmark current end-to-end learning methods and a modular neuro-symbolic approach that combines foundation models with task and motion planning. We find that learning methods, even when pretrained, yield lower success rates, while a neuro-symbolic method performs significantly better and requires less data.

Abstract:
Performing striking aerobatic flight in complex environments demands manual designs of key maneuvers in advance, which is intricate and time-consuming as the horizon of the trajectory performed becomes long. This paper presents a novel framework that leverages diffusion models to automate and scale up aerobatic trajectory generation. Our key innovation is the decomposition of complex maneuvers into aerobatic primitives, which are short frame sequences that act as building blocks, featuring critical aerobatic behaviors for tractable trajectory synthesis. The model learns aerobatic primitives using historical trajectory observations as dynamic priors to ensure motion continuity, with additional conditional inputs (target waypoints and optional action constraints) integrated to enable user-editable trajectory generation. During model inference, classifier guidance is incorporated with batch sampling to achieve obstacle avoidance. Additionally, the generated outcomes are refined through post-processing with spatial-temporal trajectory optimization to ensure dynamical feasibility. Extensive simulations and real-world experiments have validated the key component designs of our method, demonstrating its feasibility for deploying on real drones to achieve long-horizon aerobatic flight.

Abstract:
Preference-based Reinforcement Learning (PbRL) methods provide a solution to avoid reward engineering by learning reward models based on human preferences. However, poor feedback- and sample- efficiency still remain the problems that hinder the application of PbRL. In this paper, we present a novel efficient query selection and preference-guided exploration method, called SENIOR, which could select the meaningful and easy-to-comparison behavior segment pairs to improve human feedback-efficiency and accelerate policy learning with the designed preference-guided intrinsic rewards. Our key idea is twofold: (1) We designed a Motion-Distinction-based Selection scheme (MDS). It selects segment pairs with apparent motion and different directions through kernel density estimation of states, which is more task-related and easy for human preference labeling; (2) We proposed a novel preference-guided exploration method (PGE). It encourages the exploration towards the states with high preference and low visits and continuously guides the agent achieving the valuable samples. The synergy between the two mechanisms could significantly accelerate the progress of reward and policy learning. Our experiments show that SENIOR outperforms other five existing methods in both human feedback-efficiency and policy convergence speed on six complex robot manipulation tasks from simulation and four real-worlds. Videos can be found on our project website: https://2025senior.github.io/

Abstract:
Open-set semantic mapping is crucial for openworld robots. Current mapping approaches either are limited by the depth range or only map beyond-range entities in constrained settings, where overall they fail to combine within-range and beyond-range observations. Furthermore, these methods make a trade-off between fine-grained semantics and efficiency. We introduce RayFronts, a unified representation that enables both dense and beyond-range efficient semantic mapping. RayFronts encodes task-agnostic openset semantics to both in-range voxels and beyond-range rays encoded at map boundaries, empowering the robot to reduce search volumes significantly and make informed decisions both within & beyond sensory range, while running at 8.84 Hz on an Orin AGX. Benchmarking the within-range semantics shows that RayFronts’s fine-grained image encoding provides 1.34× zero-shot 3D semantic segmentation performance while improving throughput by 16.5×. Traditionally, online mapping performance is entangled with other system components, complicating evaluation. We propose a planner-agnostic evaluation framework that captures the utility for online beyond-range search and exploration, and show RayFronts reduces search volume 2.2× more efficiently than the closest online baselines.

Abstract:
This paper presents a novel snake robot, JiAo, equipped with elliptical wheels that enable both wheeled and body-based locomotion. First, the design of each module of the snake robot is described, which consists of the body link and the transmission system of the elliptical wheels. Second, distinct control systems for wheeled and body-based locomotion are proposed. Finally, the prototype has been successfully developed and various experiments have been conducted, including crossing grasslands, crossing gaps, climbing slopes, navigating pipelines and climbing cylinders. In conclusion, JiAo demonstrates its versatility by effectively performing a wide range of tasks in various challenging scenarios.

Abstract:
The concept of morphological computation (MC) is applied in the robotics field to improve the design and reduce the complexity of control systems. The MC uses mechanical intelligence, where stiffness properties play an important role as constraints to enhance system flexibility and to store elastic energy. This can reduce the number of required actuators. According to the MC principle, This work proposes LITHE-joint: variable stiffness compliant spherical contact joint in an under-actuated system. This compact design for a 2-degrees of freedom (DOF) compliant spherical contact joint with controllable stiffness uses a pneumatic artificial muscle (PAM). This joint requires only one PAM actuator to control stiffness in a 2-DOF system, achieving a stiffness of up to 0.38 Nm/rad with a bandwidth of 0.1967 Nm/rad. With its variable stiffness properties, the joint is able to adapt its bending behavior, enabling energy redistribution of torque and angle. The modulation of torque and bending angle is governed by joint stiffness and the passive body dynamics. The benefits of the passive, compliant joint with a variable stiffness property are demonstrated by using as the spine of an under-actuated robot, controlling the passive bending of the body and the robot’s walking direction using the adjustable stiffness.

Abstract:
Garment grasping in low-light environments is a critical challenge for domestic intelligent robots, yet existing research has not sufficiently addressed this issue. In low-light conditions, the scarcity of visual features due to insufficient illumination causes different categories of garments to exhibit ambiguous feature similarities, thereby hindering the robot’s ability to detect the categories of different garments. Although traditional methods can compensate for visual deficiencies in low-light scenarios by applying preprocessing strategies that fuse infrared multimodal features, their complex computational processes incur significant computational overhead. To address this limitation, we propose a low-light garment detection model based on the student-teacher model. The innovation of DarkSeg lies in its replacement of complex multimodal feature fusion with an indirect feature alignment mechanism between the student and teacher models, thereby circumventing high computational demands. Through feature alignment, DarkSeg enables the student model to learn illumination-invariant structural representations from the infrared features provided by the teacher model, effectively correcting structural deficiencies in low-light environments. Furthermore, to evaluate DarkSeg’s feasibility for low-light clothing grasping, we propose a depth-perceptive grasping strategy and build a low-light multimodal garment detection dataset, DarkClothes. Extensive experiments deploying DarkSeg on a Baxter robot demonstrate that DarkSeg achieves a 22% improvement in the grasping success rate while reducing the model parameters by 99.08 million compared to traditional methods, validating the practical viability of DarkSeg for robotic garment grasping in low-light conditions. The code and dataset are available at https://github.com/Zhonghaifeng6/Darkseg

Affiliations: School of Engineering, University of Guelph, Guelph, Canada; Department of Electrical and Computer Engineering, Memorial University, St. John’s, Canada; Department of Electrical Engineering, Al-Zaytoonah University of Jordan, Amman, Jordan; Department of Aeronautical Engineering, Jordan University of Science and Technology, Irbid, Jordan; Bradley Department of Electrical and Computer Engineering, Virginia Tech, Blacksburg, VA, USA

Abstract:
The reliability of precision motion systems, such as semiconductor wafer scanners, is often influenced by nonlinear dynamics originating from components such as cable slabs. This paper introduces a data-driven framework for early fault diagnosis in these systems. Koopman operator theory is employed to derive a linear state-space model from experimental data, capturing the complex, hysteretic behavior of the cable slab. This model serves as a digital twin, and by comparing its predictions with real-time sensor measurements, operational anomalies can be detected. A systematic process for selecting observable functions yields a high-fidelity model with a tracking error of approximately ±1% across the operational range. When the proposed approach is tested against a state-of-the-art neural network model, it demonstrates a 75.4% reduction in reaction force prediction error. The framework successfully identifies an injected sensor noise fault (SNR of 20) in just 0.35 s using only force data, validating its potential to improve wafer scanner reliability.

Abstract:
This paper presents a compliant tensegrity robotic arm design that overcomes limitations related to stiffness variation and cascaded actuation. Due to the special design and actuation strategy, the system offers a large workspace using a small number of actuators and system parts. Key features include intrinsic compliance, enhanced stability in various configurations, and a modular tendon-driven actuation system that facilitates continuous stiffness adjustment for adaptive manipulation tasks. The system’s kinematics and actuation strategy are validated experimentally. Results demonstrate an increased workspace and precise control, offering potential applications in dynamic and human-interactive environments.

Abstract:
Occupancy mapping is crucial for distinguishing between known and unknown regions, which plays a significant role in the autonomous exploration of unmanned aerial vehicles (UAVs). However, the construction of high-quality maps is still a challenge. The challenge comes from the following factors. The vast amount of data captured by UAV exploration in large-scale environments brings computing and storage bottlenecks. Additionally, sensor noise and obstacle occlusions will affect the completeness of the map. To address these issues, this paper applies a lightweight mapping framework based on the Random Mapping Method (RMM) to the challenging task of real-time UAV exploration. This framework employs a linear parametric model, where RMM efficiently maps sensor data into a high-dimensional feature space, enabling rapid learning of occupancy states. We demonstrate that this approach is not only efficient in terms of computation and storage but is also particularly effective at inferring and completing unobserved map regions caused by sensor noise and obstacle occlusions. When the exploration is completed, the global occupancy grid map is stored and implicitly represented with limited parameters. Simulation and real-world experiments are conducted to verify their comprehensive performance compared to the typical methods and state-of-the-art methods.

Abstract:
Teleoperation systems for mobile robots face significant challenges in achieving seamless coordination across dynamic environments. We present MobiExo, a teleoperation system that unlocks seamless indoor-outdoor mobile manipulation. Our approach tackles two fundamental challenges: robust cross-environment localization and intuitive full-body control. A novel self-adaptive federated filter unifies GPS and SLAM, delivering continuous centimeter-level positioning (4.5±0.8 cm indoor, 6.8±1.2 cm outdoor) and eliminating transition errors. Simultaneously, an integrated hand-foot coordination framework translates the operator’s natural gait and gestures into fluid robot actions, maintaining remarkable millimeter-level end-effector precision (3.5±0.4 mm) during navigation. Extensive field trials validate our design, demonstrating high task success (96.7% indoor, 94.3% outdoor) and a 5.9× efficiency improvement in multi-location tasks over stationary setups. Code is available at: https://github.com/wangjianpeng200/MobiExo.git

Abstract:
Reinforcement Learning (RL) has the potential to enable extreme off-road mobility by circumventing complex kinodynamic modeling, planning, and control by simulated end-to-end trial-and-error learning experiences. However, most RL methods are sample-inefficient when training in a large amount of manually designed simulation environments and struggle at generalizing to the real world. To address these issues, we introduce VertiSelector (VS), an automatic curriculum learning framework designed to enhance learning efficiency and generalization by selectively sampling training terrain. VS prioritizes vertically challenging terrain with higher Temporal Difference (TD) errors when revisited, thereby allowing robots to learn at the edge of their evolving capabilities. By dynamically adjusting the sampling focus, VS significantly boosts sample efficiency and generalization within the VW-Chrono1 simulator built on the Chrono multi-physics engine. Furthermore, we provide simulation and physical results using VS on a Verti-4-Wheeler platform. These results demonstrate that VS can achieve 23.08% improvement in terms of success rate by efficiently sampling during training and robustly generalizing to the real world.

Abstract:
Mobile manipulators integrate the locomotion flexibility of quadruped robots with the operational capabilities of robotic manipulators. This integrated system is particularly effective for teleoperating explosive ordnance disposal (EOD) tasks in hazardous environments, enabling the safe handling of explosive devices. However, when the quadruped operates in narrow corridors or cluttered spaces, its ability to reposition is limited. This limitation, combined with targets located laterally relative to the robot, poses critical challenges for achieving rapid and intuitive teleoperation of the manipulator. Existing manipulator mapping methods either fail to support lateral teleoperation or lack proper coordinate transformations, leading to mismatches between the intended and actual movement directions of the leader and follower devices. This reduces operational intuitiveness and increases the cognitive load on human operators. To overcome these issues, we propose a hybrid mapping method that combines joint-space velocity control with Cartesian-space control. This method leverages joint-space velocity commands for rapid manipulator reorientation, while employing Cartesian-space commands to achieve precise end-effector teleoperation. Furthermore, we introduce a virtual base coordinate frame that adaptively adjusts in response to the manipulator’s reorientation. This adaptive compensation ensures that the visual feedback from the camera mounted on the end-effector remains consistent and intuitive. The proposed method was validated through experiments on a quadruped robot equipped with a manipulator in an EOD scenario. Results demonstrated significant improvements, including 100% success rate, 43.9% task duration reduction, and 31.7% NASA-TLX score decrease, indicating decreased cognitive load and enhanced task efficiency compared to baseline methods.

Abstract:
Collaborative learning enhances the performance and adaptability of multi-robot systems in complex tasks but faces significant challenges due to high communication overhead and data heterogeneity inherent in multi-robot tasks. To this end, we propose CoCoL, a Communication efficient decentralized Collaborative Learning method tailored for multi-robot systems with heterogeneous local datasets. Leveraging a mirror descent framework, CoCoL achieves remarkable communication efficiency with approximate Newton-type updates by capturing the similarity between objective functions of robots, and reduces computational costs through inexact sub-problem solutions. Furthermore, the integration of a gradient tracking scheme ensures its robustness against data heterogeneity. Experimental results on three representative multi-robot collaborative learning tasks show that the proposed CoCoL can significantly reduce both the number of communication rounds and total bandwidth consumption while maintaining state-of-the-art accuracy. These benefits are particularly evident in challenging scenarios involving non-IID (non-independent and identically distributed) data distribution, streaming data, and time-varying network topologies.

Affiliations: Laboratory of Robotics Mechanism and Cross Innovation, School of Intelligent Engineering and Automation, Beijing University of Posts and Telecommunications, Beijing, China; China Software Testing Center (MIIT Software and Integrated Circuit Promotion Center), Beijing, China; College of Electronic and Information Engineering & National Key Laboratory of Autonomous Intelligent Unmanned Systems, Tongji University, Shanghai, China

Abstract:
With the global ecological environment facing continuous deterioration, effective monitoring of arboreal birds in complex canopy environments remains challenging due to limitations of conventional drones in endurance, size, and habitat disturbance. To address these challenges, this paper presents an ant-inspired micro quadrotor UAV equipped with a lightweight bistable gripper system mimicking the mandibular morphology of leafcutter ants. The design integrates shape memory alloy (SMA)-driven actuation and thermoplastic polyurethane (TPU)-based adaptive grippers, enabling rapid deformation (71 ms switching time) and energy-efficient operation (zero power consumption during perching). Experimental results demonstrate exceptional adaptability in grasping irregular objects (e.g., branches, pen caps) with an 8:1 payload-to-weight ratio. Field tests confirm stable navigation through dense foliage and reliable perching at heights exceeding 5 meters. The system’s compact dimensions (7 cm diameter, 70.5 g weight) and biomimetic approach offer a non-invasive solution for prolonged wildlife observation. This work advances bistable actuator design by combining bio-inspired structural optimization with rapid energy transition principles, showing potential in agile robotics and environmental sensing.

Abstract:
Modular robotics offers a promising approach for developing versatile and adaptive robotic systems capable of autonomous reconfiguration. This paper presents a novel modular robotic system in which each module is equipped with independent actuation, battery power, and control, enabling both individual mobility and coordinated locomotion. The system employs a hierarchical Central Pattern Generator (CPG) framework, where a low-level CPG governs the motion of individual modules, while a high-level CPG facilitates inter-module synchronization, allowing for seamless transitions between independent and collective behaviors.To validate the proposed system, we conduct both simulations in MuJoCo and hardware experiments, evaluating the system’s locomotion capabilities under various configurations. We first assess the fundamental motion of a single module, followed by two-module and four-module cooperative locomotion. The results demonstrate the effectiveness of the CPG-based control framework in achieving robust, flexible, and scalable locomotion. The proposed modular architecture has potential applications in search-and-rescue operations, environmental monitoring, and autonomous exploration, where adaptability and reconfigurability are essential for mission success.

Abstract:
Information exchange is crucial for optimal coordination of robots, but a link may not always be available among agents to share data. For this reason, this paper presents a decentralized solution for flocking control, leveraging state and uncertainty estimation of undetected robots. A neural network is trained to mimic a state estimator, also providing information about the uncertainty of the estimate. This uncertainty is used to weigh the contribution of the estimate in taking actions for coordination. Using Control Barrier Functions and Control Lyapunov Functions, we define an optimization problem to find an optimal control input to reproduce collective motion observed in nature. We evaluate both the learned estimator and the control strategy with extensive simulations.

Abstract:
As the airspace becomes increasingly congested, decentralized conflict resolution methods for airplane encounters have become essential. While decentralized safety controllers can prevent dangerous midair collisions, they do not always ensure prompt conflict resolution. As a result, airplane progress may be blocked for extended periods in certain situations. To address this blocking phenomenon, this paper proposes integrating bio-inspired nonlinear opinion dynamics into the airplane safety control framework, thereby guaranteeing both safety and blocking-free resolution. In particular, opinion dynamics enable the safety controller to achieve collaborative decision-making for blocking resolution and facilitate rapid, safe coordination without relying on communication or preset rules. Extensive simulation results validate the improved flight efficiency and safety guarantees. This study provides practical insights into the design of autonomous controllers for airplanes.

Abstract:
Open-world task planning, characterized by handling unstructured and dynamic environments, has been increasingly explored to integrate with long-horizon robotic manipulation tasks. However, existing evaluations of the capabilities of these planners primarily focus on single-arm systems in structured scenarios with limited skill primitives, which is insufficient for numerous bimanual dexterous manipulation scenarios prevalent in the real world. To this end, we introduce OBiMan-Bench, a large-scale benchmark designed to rigorously evaluate open-world planning capabilities in bimanual dexterous manipulation, including task-scenario grounding, workspace constraint handling, and long-horizon cooperative reasoning. In addition, we propose OBiMan-Planner, a vision-language model-based zero-shot planning framework tailored for bimanual dexterous manipulation. OBiMan-Planner comprises two key components, the scenario grounding module for grounding open-world task instructions with specific scenarios and the task planning module for generating sequential stages. Extensive experiments on OBiMan-Bench demonstrate the effectiveness of our method in addressing complex bimanual dexterous manipulation tasks in open-world scenarios. The code, benchmark, and supplementary material are released at https://github.com/Zixin-Tang/OBiMan.

Abstract:
Visual Place Recognition (VPR) plays a vital role in mobile robotics and autonomous navigation by retrieving reference images from a pre-established database. However, VPR systems frequently encounter performance degradation due to environmental variations. To overcome these challenges, we propose a re-ranking based VPR framework incorporating two key components: (1) A Visual Mamba Embedding (VME) module that optimizes spatial-channel feature interactions to generate discriminative global descriptors; and (2) A Spatial Graph Attentional Network (SGAN) that replaces conventional RANSAC-based verification with an efficient graph attention mechanism, improving matching accuracy while reducing computation. Comprehensive evaluations across multiple benchmark datasets demonstrate that the proposed method achieves superior performance compared to existing state-of-the-art methods, while maintaining advantages in computational efficiency and storage requirements.

Affiliations: Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen) and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong, Shenzhen, China; Institute of Medical Robotics, School of Biomedical Engineering, Shanghai Jiao Tong University, Shanghai, China; School of Science and Engineering (SSE), FNii-Shenzhen, and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, The Chinese University of Hong Kong, Shenzhen, China

Abstract:
Large Language Models (LLMs) exhibit remarkable capabilities in the hierarchical decomposition of complex tasks through semantic reasoning. However, their application in embodied systems faces challenges in ensuring reliable execution of subtask sequences and achieving one-shot success in long-term task completion. To address these limitations in dynamic environments, we propose Closed-Loop Embodied Agent (CLEA)—a novel architecture incorporating four specialized open-source LLMs with functional decoupling for closed-loop task management. The framework features two core innovations: (1) Interactive task planner that dynamically generates executable subtasks based on the environmental memory, and (2) Multimodal execution critic employing an evaluation framework to conduct a probabilistic assessment of action feasibility, triggering hierarchical re-planning mechanisms when environmental perturbations exceed preset thresholds. To validate CLEA’s effectiveness, we conduct experiments in a real environment with manipulable objects, using two heterogeneous robots for object search, manipulation, and search-manipulation integration tasks. Across 12 task trials, CLEA outperforms the baseline model, achieving a 67.3% improvement in success rate and a 52.8% increase in task completion rate. These results demonstrate that CLEA significantly enhances the robustness of task planning and execution in dynamic environments. Our code is available at https://sp4595.github.io/CLEA/.

Abstract:
This paper proposes the DnD Filter, a differentiable filter that utilizes diffusion models for state estimation of dynamic systems. Unlike conventional differentiable filters, which often impose restrictive assumptions on process noise (e.g., Gaussianity), DnD Filter enables a nonlinear state update without such constraints by conditioning a diffusion model on both the predicted state and observational data, capitalizing on its ability to approximate complex distributions. We validate its effectiveness on both a simulated task and a real-world visual odometry task, where DnD Filter consistently outperforms existing baselines. Specifically, it achieves a 25% improvement in estimation accuracy on the visual odometry task compared to state-of-the-art differentiable filters, and even surpasses differentiable smoothers that utilize future measurements. To the best of our knowledge, DnD Filter represents the first successful attempt to leverage diffusion models for state estimation, offering a flexible and powerful framework for nonlinear estimation under noisy measurements. The code is available at https://github.com/ZiyuNUS/DnDFilter.

Abstract:
The execution of long-distance load-carrying tasks across multiple terrains remains a frequent requirement. These tasks often involve heavy loads, resulting in fatigue, decreased efficiency, and potential safety risks. To address this issue, this paper proposes a wearable centaur robot with wheel-legged transformation for human load-carrying assistance. The key feature of this robotic mechanism is the independent wheel-legged transformable structure, enabling transitions between the wheeled and legged modes. The wheeled mode ensures high load-carrying efficiency, while in the legged mode, the wheels are laid flat, transforming the ankle joint into a locked support surface that provides stable gait support. This design enables efficient and stable load carriage over complex terrains, all while preserving the natural gait of the user. Next, we develop a unified control framework for human-robot collaborative locomotion across different terrains, which includes velocity control based on an admittance model for the wheeled mode, gait control using a Bézier trajectory for the legged mode, and the transition between the two modes. The preliminary experiments include wheeled-mode, legged-mode, mode transition and obstacle crossing under human-robot collaborative locomotion, validating the proposed robot’s adaptability to different terrains while assisting with human load carriage.

Abstract:
The integration of RGB and thermal data can significantly improve semantic segmentation performance in wild environments for field robots. Nevertheless, multi-source data processing (e.g. Transformer-based approaches) imposes significant computational overhead, presenting challenges for resource-constrained systems. To resolve this critical limitation, we introduced CM-SSM, an efficient RGB-thermal semantic segmentation architecture leveraging a cross-modal state space modeling (SSM) approach. Our framework comprises two key components. First, we introduced a cross-modal 2D-selective-scan (CM-SS2D) module to establish SSM between RGB and thermal modalities, which constructs cross-modal visual sequences and derives hidden state representations of one modality from the other. Second, we developed a cross-modal state space association (CM-SSA) module that effectively integrates global associations from CM-SS2D with local spatial features extracted through convolutional operations. In contrast with Transformer-based approaches, CM-SSM achieves linear computational complexity with respect to image resolution. Experimental results show that CM-SSM achieves state-of-the-art performance on the CART dataset with fewer parameters and lower computational cost. Further experiments on the PST900 dataset demonstrate its generalizability. Codes are available at https://github.com/xiaodonguo/CMSSM.

Abstract:
Autonomous exploration in unknown environments is a crucial challenge for various applications of unmanned aerial vehicles (UAVs). However, in large-scale scenarios, existing methods suffer from inefficient environmental information acquisition, computationally expensive exploration planning, and inconsistent motion. In this work, we present a novel method for rapid UAV autonomous exploration in large-scale environments. We develop a surface frontier guided viewpoints generation strategy that supports efficient coverage of scenario. Besides, we introduce an incremental viewpoint clustering method to approximate distant viewpoints using fewer anchor points, decreasing the computational costs of exploration tour planning. Building upon this, we propose a history-informed tour planning method that incorporates information from previous tour into the optimization process, maintaining motion consistency. Extensive simulation experiments validate that our method outperforms existing state-of-the-art methods in terms of exploration time, travel distance, and run time. Various real-world experiments are conducted to indicate the practicality of our approach. The source code will be released to benefit the community1.

Abstract:
Closed-loop control remains an open challenge in soft robotics. The nonlinear responses of soft actuators under dynamic loading conditions limit the use of analytic models for soft robot control. Traditional methods of controlling soft robots underutilize their configuration spaces to avoid nonlinearity, hysteresis, large deformations, and the risk of actuator damage. Furthermore, episodic data-driven control approaches such as reinforcement learning (RL) are traditionally limited by sample efficiency and inconsistency across initializations. In this work, we demonstrate RL for reliably learning control policies for dynamic balancing tasks in real-time single-shot hardware deployments. We use a deformable Stewart platform constructed using parallel, 3D-printed soft actuators based on motorized handed shearing auxetic (HSA) structures. By introducing a curriculum learning approach based on expanding neighborhoods of a known equilibrium, we achieve reliable single-deployment balancing at arbitrary coordinates. In addition to benchmarking the performance of model-based and model-free methods, we demonstrate that in a single deployment, Maximum Diffusion RL is capable of learning dynamic balancing after half of the actuators are effectively disabled, by inducing buckling and by breaking actuators with bolt cutters. Training occurs with no prior data, in as fast as 15 minutes, with performance nearly identical to the fully-intact platform. Single-shot learning on hardware facilitates soft robotic systems reliably learning in the real world and will enable more diverse and capable soft robots.

Abstract:
This paper presents a Variable Morphing Multi-Body AUVs (VMMAUVs) concept, designed for underwater structure maintenance. This robot is capable of dynamically adjusting their structure to adapt to varying operational scenarios. The study explores two key stability mechanisms: buoyancy adjustment and aperture angle control, both aimed at optimizing the metacentric height. Through simulations and experiments with different buoyancy configurations and aperture angles, the results show that the proposed methods significantly enhance the system ’ s stability, enabling faster convergence and better posture retention. The feasibility of the control strategies is validated through various numerical simulations, demonstrating the effectiveness of angle tracking control and buoyancy adjustment in maintaining stability under dynamic oceanic conditions.

Abstract:
Motion planning for continuous contact-based aerial manipulators on complex unstructured surfaces remains a substantial challenge due to the sophisticated topology of unstructured surfaces. While direct planning in the high-dimensional configuration space manifolds faces efficiency limitations, simplified planning in the parametric space sacrifices trajectory quality. Therefore, this paper proposes a sampling-based motion planning method, namely, parameter-configuration space fast marching tree (PCS-FMT), which integrates both configuration and parameter space information. The proposed PCS-FMT introduces a reparameterization strategy that compresses the planning space into a low-dimensional parameter manifold while preserving metric consistency with the original configuration space. Thus, PCS-FMT can efficiently plan in the parameter space and optimize the motion trajectory. Simulations on challenging unstructured surfaces validate the effectiveness of PCS-FMT for aerial manipulators in contact with unstructured surfaces.

Abstract:
In this paper, we introduce a novel estimator for vision-aided inertial navigation systems (VINS), the Preconditioned Cholesky-based Square Root Information Filter (PC-SRIF). When solving linear systems, employing Cholesky decomposition offers superior efficiency but can compromise numerical stability. Due to this, existing VINS utilizing (Square Root) Information Filters often opt for QR decomposition on platforms where single precision is preferred, avoiding the numerical challenges associated with Cholesky decomposition. While these issues are often attributed to the ill-conditioned information matrix in VINS, our analysis reveals that this is not an inherent property of VINS but rather a consequence of specific parameterizations. We identify several factors that contribute to an ill-conditioned information matrix and propose a preconditioning technique to mitigate these conditioning issues. Building on this analysis, we present PC-SRIF, which exhibits remarkable stability in performing Cholesky decomposition in single precision when solving linear systems in VINS. Consequently, PC-SRIF achieves superior theoretical efficiency compared to alternative estimators. To validate the efficiency advantages and numerical stability of PC-SRIF based VINS, we have conducted well controlled experiments, which provide empirical evidence in support of our theoretical findings. Remarkably, in our VINS implementation, PC-SRIF’s runtime is 41% faster than QR-based SRIF.

Abstract:
The augmented reality (AR) and virtual reality (VR) for human-machine interactions (HMI) have been greatly attracted in industry. These interactive devices are still bulky and lack accurate tactile perception, which limits their applications in HMI. Here, we proposed a novel flexible electronic device with multifunctional tactile sensing for the interaction with robots. The electronic device was designed using pressure sensing layer and tactile pixel array layer for separate contact force, force angle, and sliding distance sensing. Notably, the helical patterned tactile pixel array can achieve high resolution of force angle detection. Characterization tests showed that the flexible electronic device has a wide force sensing range of 0.8 ~ 6.0 N with a sensitivity of 0.358 N-1, and force angle detection resolution is 0.84°. Then, the device was integrated to construct a robotic interaction system for experimental tests. The results showed that our device can simultaneously measure the contact force, force angle, and sliding displacement when a finger touches the device. The tactile signals can be used for controlling robotic movements. The displacement detection accuracy of robotic arm was less than 1.0 mm, and time delay was less than 200 ms, demonstrating that our device can offer effective interactions with robots and humans.

Abstract:
Mixed autonomy traffic systems face significant security challenges when malicious agents disrupt coordination between autonomous and human-driven vehicles. We present Malicious Agent Detection and Isolation (MADI), a framework addressing two critical forms of disruptive behavior: path order violations at coordination points and strategic congestion generation. MADI integrates dual-mechanism detection with temporal consistency analysis to identify sophisticated malicious behaviors while filtering transient anomalies that could trigger false positives. Upon detection, our framework employs adaptive isolation strategies including enlarged safety boundaries and dynamic priority adjustment. Extensive experiments in simulated highway and urban environments demonstrate that MADI achieves up to 91% detection accuracy with only 4% false positives, significantly outperforming rule-based, anomaly-based, and single-criterion methods. The framework reduces travel time impacts by 25.5% and near-collision events by 76.5% in adversarial conditions, demonstrating its effectiveness for enhancing safety and efficiency in mixed autonomy traffic.

Abstract:
Robots operating in agricultural environments require a robust, fast perception system to accurately identify picking points. This paper proposed a lightweight method for detecting sweet pepper peduncles, which uses an active illumination camera system and DUTU2-Net+ to achieve efficient and accurate peduncle detection. The camera system is used to overcome the influence of ambient light through flash and no-flash (FNF) image pairs, achieving more robust color detection and quickly locating the peduncle’s region of interest (ROI). The improved DUTU2-Net+ is used accurately for peduncle ROI detection. It uses an encoder with depthwise separable convolution (DSC), dilated convolution, and a feature enhancement structure with a triple attention module (TAM) to reduce the computational load and parameters while ensuring detection accuracy. Experimental results show that the proposed method can effectively identify the position of the peduncle. The DUTU2-Net+ model achieves an average absolute error of 0.002, a maximum F1 score of 0.992, a frame rate of 36.3 FPS, and a model size of 6.9 MB. The source code is available at https://gitee.com/rosdx/detection-for-harvesting.git.

Abstract:
Task-aware robotic grasping is a challenging problem that requires the integration of semantic understanding and geometric reasoning. This paper proposes a novel framework that leverages Large Language Models (LLMs) and Quality Diversity (QD) algorithms to enable zero-shot task-conditioned grasp synthesis. The framework segments objects into meaningful subparts and labels each subpart semantically, creating structured representations that can be used to prompt an LLM. By coupling semantic and geometric representations of an object’s structure, the LLM’s knowledge about tasks and which parts to grasp can be applied in the physical world. The QD-generated grasp archive provides a diverse set of grasps, allowing us to select the most suitable grasp based on the task. We evaluated the proposed method on a subset of the YCB dataset with a Franka Emika robot. A consolidated ground truth for task-specific grasp regions is established through a survey. Our work achieves a weighted intersection over union (IoU) of 73.6% in predicting task-conditioned grasp regions in 65 task-object combinations. An end-to-end validation study on a smaller subset further confirms the effectiveness of our approach, with 88% of responses favoring the task-aware grasp over the control group. A binomial test shows that participants significantly prefer the task-aware grasp.

Abstract:
Recent trends in humanoid robot control have successfully employed imitation learning to enable the learned generation of smooth, human-like trajectories from human data. While these approaches make more realistic motions possible, they are limited by the amount of available motion data, and do not incorporate prior knowledge about the physical laws governing the system and its interactions with the environment. Thus they may violate such laws, leading to divergent trajectories and sliding contacts which limit real-world stability. We address such limitations via a two-pronged learning strategy which leverages the known physics of the system and fundamental control principles. First, we encode physics priors during supervised imitation learning to promote trajectory feasibility. Second, we minimize drift at inference time by applying a proportional-integral controller directly to the generated output state. We validate our method on various locomotion behaviors for the ergoCub humanoid robot, where a physics-informed loss encourages zero contact foot velocity. Our experiments demonstrate that the proposed approach is compatible with multiple controllers on a real robot and significantly improves the accuracy and physical constraint conformity of generated trajectories.

Abstract:
Keyframe selection plays a crucial role in balancing computational efficiency and localization accuracy in Visual Simultaneous Localization and Mapping (VSLAM) systems. Existing keyframe selection methods often struggle to capture high-level semantic information in environments where multiple semantic dimensions interact. In this paper, we propose the Multi-dimensional Semantic Analysis (MSA) module based on Visual-Language Models (VLMs). By leveraging the capability of VLMs to extract rich semantic features, we compute the similarity between each image frame and a set of textual descriptions, generating a scene descriptor that quantifies the semantic distance between frames across multiple dimensions (e.g., object count, texture, and lighting). We then introduce the Scene Change Assessment (SCA) module based on Bayesian On-line Changepoint Detection (BOCD), which identifies keyframes with significant semantic information gain, thereby reducing the total number of keyframes. Extensive experiments on an open dataset demonstrate that our method not only significantly reduces the number of keyframes but also maintains high localization accuracy. Furthermore, the inference speed of the MSA module satisfies the real-time requirements of VSLAM. These results underscore the potential of our approach to enhance the efficiency of keyframe selection.

Abstract:
In domestic environments, robots require a comprehensive understanding of their surroundings to interact effectively and intuitively with untrained humans. In this paper, we propose DVEFormer – an efficient RGB-D Transformer-based approach that predicts dense text-aligned visual embeddings (DVE) via knowledge distillation. Instead of directly performing classical semantic segmentation with fixed predefined classes, our method uses teacher embeddings from Alpha-CLIP to guide our efficient student model DVEFormer in learning fine-grained pixel-wise embeddings. While this approach still enables classical semantic segmentation, e.g., via linear probing, it further enables flexible text-based querying and other applications, such as creating comprehensive 3D maps. Evaluations on common indoor datasets demonstrate that our approach achieves competitive performance while meeting real-time requirements, operating at 26.3FPS for the full model and 77.0FPS for a smaller variant on an NVIDIA Jetson AGX Orin. Additionally, we show qualitative results that highlight the effectiveness and possible use cases in real-world applications. Overall, our method serves as a drop-in replacement for traditional segmentation approaches while enabling flexible natural-language querying and seamless integration into 3D mapping pipelines for mobile robotics.

Abstract:
Dual-arm robotic systems hold great potential for complex bimanual tasks that require intricate and coordinated manipulation, such as holding and transporting a tray with a cup of coffee while navigating through cluttered environments. However, these tasks pose significant challenges due to the inherent closed-chain constraints between the arms and the object, as well as the need for real-time collision avoidance, especially in real-world applications. To address these challenges, we introduce a hierarchical framework that combines learning-based planning with classical control theory to ensure whole-body collision avoidance movement while maintaining the kinematic relationship. In addition, we present a novel, efficient, and cost-free data generation method specifically designed for dual-arm cooperative tasks, overcoming the lack of sufficient training data. Extensive experiments in both simulation and real-world scenarios demonstrate that our approach improves the success rate by 26.3% compared to existing planning methods and by 54.7% compared to end-to-end methods. These results highlight the advantages of our method in whole-body collision avoidance and environmental adaptability, making it a promising solution for dual-arm cooperative tasks.

Abstract:
We propose a neuromorphic tactile sensing frame-work for robotic texture classification that is inspired by human exploratory strategies. Our system utilizes the NeuroTac sensor to capture neuromorphic tactile data during a series of exploratory motions. We first tested six distinct motions for texture classification under fixed environment: sliding, rotating, tapping, as well as the combined motions: sliding+rotating, tapping+rotating, and tapping+sliding. We chose sliding and sliding+rotating as the best motions based on final accuracy and the sample timing length needed to reach converged accuracy. In the second experiment designed to simulate complex real-world conditions, these two motions were further evaluated under varying contact depth and speeds. Under these conditions, our framework attained the highest accuracy of 87.33% with sliding+rotating while maintaining an extremely low power consumption of only 8.04 mW. These results suggest that the sliding+rotating motion is the optimal exploratory strategy for neuromorphic tactile sensing deployment in texture classification tasks and holds significant promise for enhancing robotic environmental interaction.

Affiliations: School of Computer Science, Carnegie Mellon University, Pittsburgh, PA, USA; Electrical and Computer Engineering, Carnegie Mellon University, Pittsburgh, PA, USA; Harvard John A. Paulson School of Engineering And Applied Sciences, Harvard University, Boston, MA, USA; Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, China; School of Mathematical Science, University of Nottingham, Nottingham, United Kingdom; Weill Cornell Medicine, Cornell University, New York, NY, USA; Meta Reality Labs, Pittsburgh, PA, USA

Abstract:
Calibrating large-scale camera arrays, such as those in dome-based setups, is time-intensive and typically requires dedicated captures of known patterns. While extrinsics in such arrays are fixed due to the physical setup, intrinsics often vary across sessions due to factors like lens adjustments or temperature changes. In this paper, we propose a dense-feature-driven multi-frame calibration method that refines intrinsics directly from scene data, eliminating the necessity for additional calibration captures. Our approach enhances traditional Structure-from-Motion (SfM) pipelines by introducing an extrinsics regularization term to progressively align estimated extrinsics with ground-truth values, a dense feature reprojection term to reduce keypoint errors by minimizing reprojection loss in the feature space, and an intrinsics variance term for joint optimization across multiple frames. Experiments on the Multiface dataset show that our method achieves nearly the same precision as dedicated calibration processes, and significantly enhances intrinsics and 3D reconstruction accuracy. Fully compatible with existing SfM pipelines, our method provides an efficient and practical plug-and-play solution for large-scale camera setups. Our code is publicly available at: https://github.com/YJJfish/Multi-Cali-Anything

Abstract:
A planner for autonomous vehicles must be capable of operating in diverse and complex real-world environments. However, learning-based planners often struggle with limited generalization due to the long-tail distribution in datasets. Moreover, the black-box nature of neural networks limits their interpretability and complicates the integration of explicit rules. In this work, we propose a hierarchical neural trajectory planner that takes the bird’s-eye view (BEV) rasters as input. The planner operates in two hierarchical phases: first, spatial proposals are sampled from a policy generated from interpretable learned reward maps, and second, learnable temporal velocity profiles are assigned to the spatial proposals using clothoid curves. We conduct training and closed-loop simulation on the nuPlan dataset. The results demonstrate that our proposed planner outperforms other learning-based methods, exhibiting superior adaptability in long-tail scenarios. Additionally, we explore the flexibility of our planner in integrating manually defined rule sets. Project website: https://iunone.github.io/HiTail

Abstract:
A multi-zone (typically 8×8) time-of-flight (ToF) sensor offers a low-cost, low-power, and compact solution for range measurement, making it ideal for specialized robotic applications. However, its low resolution limits its usability. Pairing a ToF sensor with a camera enhances depth perception and can solve the unscaled metric problem in mono depth estimation. Advances in deep learning further enable high-quality depth map reconstruction from ToF-camera data, providing a cost-effective alternative. However, accurate ToF-camera calibration remains a challenge due to ToF sensor’s coarse depth output.This work presents a simple yet effective method for the extrinsic calibration of a ToF sensor with an RGB camera using only a chessboard and two whiteboards. A tailored two-plane fitting algorithm is proposed specifically for the ToF sensor. Moreover, our approach leverages parallel lines with vanishing points and geometric constraints from plane intersections. This eliminates the need for robotic arm movements or SLAM-based sensor pose reconstruction, significantly reducing complexity while maintaining high accuracy. Experimental results demonstrate that our method lowers the root mean square (RMS) depth difference from 96.59 mm to 67.89 mm, underscoring its effectiveness in practical applications. Code is publicly available in https://github.com/Tianyou-Nottingham/ToF-Camera-Calibration.

Abstract:
Online semantic 3D modeling from streaming RGB-D data fundamentally requires consistent fusion of 2D segmentation. Popular approaches address segmentation inconsistencies through histogram-based label aggregation, where each 3D element (point/voxel) maintains the frequency of candidate labels, which introduces prohibitive memory and computational overhead for resource-constrained devices. In response to this challenge, we propose MEFusion, a memory-efficient probabilistic fusion framework to avoid element-wise histogram aggregation. Specifically, we propose an element-wise probability update algorithm based on Bayesian Estimation, where each voxel stores only one instance label and updates it based on a posterior probability to maintain segmentation consistency. Following 3D segmentation, we establish a segment-wise voting framework to aggregate the semantic labels from historical data, where co-segment voxels share the semantic voting histogram, for semantic consistency. Our experiments demonstrate that our method achieves a memory reduction of 77%(85%) and a speed improvement of 58%(6.12x) on the desktop (embedded) platform while maintaining comparable reconstruction accuracy to the state-of-the-art point-cloud-based method.

Abstract:
Vision-based 3D occupancy prediction has made significant advancements, but its reliance on cameras alone struggles in challenging environments. This limitation has driven the adoption of sensor fusion, among which camera-radar fusion stands out as a promising solution due to their complementary strengths. However, the sparsity and noise of the radar data limits its effectiveness, leading to suboptimal fusion performance. In this paper, we propose REOcc, a novel camera-radar fusion network designed to enrich radar feature representations for 3D occupancy prediction. Our approach introduces two main components, a Radar Densifier and a Radar Amplifier, which refine radar features by integrating spatial and contextual information, effectively enhancing spatial density and quality. Extensive experiments on the Occ3D-nuScenes benchmark demonstrate that REOcc achieves significant performance gains over the camera-only baseline model, particularly in dynamic object classes. These results underscore REOcc’s capability to mitigate the sparsity and noise of the radar data. Consequently, radar complements camera data more effectively, unlocking the full potential of camera-radar fusion for robust and reliable 3D occupancy prediction.

Abstract:
Indoor disaster relief and rescue missions require quadrotors to fully exploit their maneuverability in real-time. However, the computational complexity induced by the underactuated kinodynamics conflicts with the rapid replanning requirement. For agile trajectory planning in cluttered and unknown environments, we propose a real-time optimization-based quadrotor trajectory generation method that integrates kinodynamic constraints in both trajectory search and trajectory optimization phases to fully exploit maneuverability. To further improve efficiency, we introduce a waypoints selection strategy to reduce the computational burden of kinodynamic trajectory optimization by transforming obstacle avoidance constraints into waypoint constraints, thereby enabling safe trajectory optimization in real-time. Specifically, kinodynamic trajectories are searched under kinodynamic constraints, providing reliable initial values for subsequent numerical optimization. Nextly, a waypoints selection algorithm, based on an estimation of trajectory variation during optimization, is introduced to preserve the obstacle-avoidance properties obtained during the search phase by limiting the variation with waypoint constraints. Finally, trajectory is segmented by waypoints with fixed time intervals each segment and then optimized under kinodynamic constraints, ensuring real-time optimization at the cost of time allocation optimality. We evaluated our method through simulation and experimentally validate its performance in cluttered and unknown environments. The competence of proposed method is also validated in real-world experiments.

Abstract:
Underwater debris is a significantly growing challenge that autonomous underwater vehicles (AUVs) can help alleviate, but robot-guided debris search and removal can also cause harm to the aquatic ecosystem or other humans engaged in cleanup missions if the AUVs are unable to assess the risks associated with its actions. We introduce a method for identifying such risks in an underwater scene in the context of AUV debris search and removal tasks. Our approach integrates a vision language model (VLM) with monocular depth estimation to effectively classify and localize objects in a marine scene, specifically submerged marine debris. We use the pixel distance and depth difference using the monocular depth map to identify entities that are sensitive to harm in proximity to the debris. We collect and annotate a custom dataset containing images in three different marine and aquatic environments containing debris and other such sensitive entities, and compare classification performance for different types of prompts. We observe that the prompts describing the debris properties (e.g., "eroded trash") demonstrate a significant increase in accuracy compared to the use of object names directly as prompts. Our method successfully identifies debris that is safe to remove in complex scenes and turbid water conditions, highlighting the potential of using VLMs for risk assessment in AUV operations in the diverse underwater domain.

Abstract:
Recent advancements in 3D robotic manipulation have improved grasping of everyday objects, but transparent and specular materials remain challenging due to depth sensing limitations. While several 3D reconstruction and depth completion approaches address these challenges, they suffer from setup complexity or limited observation information utilization. To address this, leveraging the power of single-view 3D object reconstruction approaches, we propose a training-free framework SR3D that enables robotic grasping of transparent and specular objects from a single-view observation. Specifically, given single-view RGB and depth images, SR3D first uses the external visual models to generate 3D reconstructed object mesh based on RGB image. Then, the key idea is to determine the 3D object’s pose and scale to accurately localize the reconstructed object back into its original depth corrupted 3D scene. Therefore, we propose view matching and keypoint matching mechanisms, which leverage both the 2D and 3D’s inherent semantic and geometric information in the observation to determine the object’s 3D state within the scene, thereby reconstructing an accurate 3D depth map for effective grasp detection. Experiments in both simulation and real-world show the reconstruction effectiveness of SR3D. More demonstrations can be found at: https://sites.google.com/view/sr3dtech/

Abstract:
6D pose estimation is a central problem in robot vision. Compared with pose estimation based on point correspondences or its robust versions, correspondence-free methods are often more flexible. However, existing correspondence-free methods often rely on feature representation alignment or end-to-end regression. For such a purpose, a new correspondence-free pose estimation method and its practical algorithms are proposed, whose key idea is the elimination of unknowns by process of addition to separate the pose estimation from correspondence. By taking the considered point sets as patterns, feature functions used to describe these patterns are introduced to establish a sufficient number of equations for optimization. The proposed method is applicable to nonlinear transformations such as perspective projection and can cover various pose estimations from 3D-to-3D points, 3D-to-2D points, and 2D-to-2D points. Experimental results on both simulation and actual data are presented to demonstrate the effectiveness of the proposed method.

Affiliations: School of Computer Science and Technology, Xinjiang University, Urumqi, China; Tsinghua Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; School of Computing Science, University of Glasgow, Glasgow, Scotland, United Kingdom; Dongguan Key Laboratory of Intelligent Equipment and Smart Industry, School of Advanced Engineering, Great Bay University, Dongguan, China; Department:School of Information Engineering, Chang’an University, Changan, China

Abstract:
Multi-sensor fusion is a key technology in the field of autonomous driving and robotics. Traditional offline multi-sensor fusion calibration methods rely on manual operations and fail to meet real-time requirements, while recent online calibration technologies have limited generalization capabilities. This paper proposes CalibMutiL, an end-to-end calibration network that departs from conventional deep feature fusion by leveraging multi-level RGB image features to guide point cloud alignment. CalibMutiL introduces a Multi-level Fusion module (MLF) that effectively utilizes the rich visual features of the image. In addition, we regard the alignment process as a sequence prediction problem and further improve the performance through an Iterative Refinement module (IRM). Evaluation of the KITTI odometry and raw dataset demonstrates the average calibration error reaches 0.81cm and 0.09°. The generalization tests resulted in errors of 4.24cm and 0.13°, outperforming existing methods. Our implementation will be publicly available at https://github.com/VIP-G/CalibMutiL.

Abstract:
Dynamicity is a critical and highly challenging aspect in Multi-Object Tracking and Segmentation (MOTS), significantly impeding the effective integration of diverse association cues. High dynamicity, such as severe occlusion or deformation, can distort appearance cues, leading to inaccurate inter-object relationships and misleading results. Conversely, in low dynamicity states, spatiotemporal consistency of appearance cues aids in recovering object states. To address this issue, we propose a straightforward, effective, and versatile Dynamicity Adaptation for Multi-object Tracking and Segmentation, named DA-Track. First, we leverage the sensitivity of appearance cues to dynamicity through pre-association, capturing dynamic behavior in objects. Second, Dynamicity Adaptation incorporates Dynamicity Selection to identify reliable appearance cues based on pre-association results and Occlusion Dynamicity Fusing to adaptively integrate appearance and motion cues by analyzing historical mask variations. Experiments on MOTS20 and KITTI MOTS datasets demonstrate DA-Track’s robust and reliable performance across diverse scenarios.

Abstract:
Predicting the near-term behavior of a reactive agent is crucial in many robotic scenarios, yet remains challenging when observations of that agent are sparse or intermittent. Vision-Language Models (VLMs) offer a promising avenue by integrating textual domain knowledge with visual cues, but their one-shot predictions often miss important edge cases and unusual maneuvers. Our key insight is that iterative, counterfactual exploration–where a dedicated module probes each proposed behavior hypothesis, explicitly represented as a plausible trajectory, for overlooked possibilities–can significantly enhance VLM-based behavioral forecasting. We present TRACE (Tree-of-thought Reasoning And Counterfactual Exploration), an inference framework that couples tree-of-thought generation with domain-aware feedback to refine behavior hypotheses over multiple rounds. Concretely, a VLM first proposes candidate trajectories for the agent; a counterfactual critic then suggests edge-case variations consistent with partial observations, prompting the VLM to expand or adjust its hypotheses in the next iteration. This creates a self-improving cycle where the VLM progressively internalizes edge cases from previous rounds, systematically uncovering not only typical behaviors but also rare or borderline maneuvers, ultimately yielding more robust trajectory predictions from minimal sensor data. We validate TRACE on both ground-vehicle simulations and real-world marine autonomous surface vehicles. Experimental results show that our method consistently outperforms standard VLM-driven and purely model-based baselines, capturing a broader range of feasible agent behaviors despite sparse sensing. Evaluation videos and code are available at trace-robotics.github.io.

Abstract:
This paper presents an uncertainty-aware exploration framework for cooperative underwater operations using multiple unmanned underwater vehicles (UUVs). The proposed framework leverages prior environmental information to iteratively integrate task planning, task assignment, and prior belief updating, enabling efficient exploration in unknown underwater environments. An interest area selection strategy is proposed to balance the exploration of uncharted regions and the exploitation of areas with high target likelihood. To optimize interest area task allocation, a simultaneous auctionbased mechanism is developed that assigns each interest area to the most suitable UUV by maxing potential information gain while minimizing operational costs. Additionally, to address the computational constraints of UUV systems, a Sparse Gaussian Process (SGP) with variationally optimized inducing points is employed, enabling rapid and accurate fusion of real-time observations with prior environmental information. This approach facilitates dynamic updates of the probabilistic environment representation and interest point selection without compromising accuracy. Experimental results in the HoloOcean simulator demonstrate the framework’s effectiveness in refining the probabilistic environment representation, achieving efficient exploration and accurate target detection in complex underwater scenarios. The results highlight the framework’s capability to dynamically adapt to environmental uncertainties, showcasing its potential for underwater exploration applications.

Abstract:
In contrast to single-skill tasks, long-horizon tasks play a crucial role in our daily life, e.g., a pouring task requires a proper concatenation of reaching, grasping and pouring subtasks. As an efficient solution for transferring human skills to robots, imitation learning has achieved great progress over the last two decades. However, when learning long-horizon visuomotor skills, imitation learning often demands a large amount of semantically segmented demonstrations. Moreover, the performance of imitation learning could be susceptible to external perturbation and visual occlusion. In this paper, we exploit dynamical movement primitives and meta-learning to provide a new framework for imitation learning, called Meta-Imitation Learning with Adaptive Dynamical Primitives (MiLa). MiLa allows for learning unsegmented long-horizon demonstrations and adapting to unseen tasks with a single demonstration. MiLa can also resist external disturbances and visual occlusion during task execution. Real-world robotic experiments demonstrate the superiority of MiLa, irrespective of visual occlusion and random perturbations on robots.

Abstract:
Task-oriented dexterous grasp generation aims to generate stable and functional grasps that enable a robotic hand to effectively interact with objects to accomplish specific tasks. However, generating high-dimensional hand configurations that seamlessly adapt to diverse task requirements and object geometries remains a significant challenge. In this paper, we propose a novel pipeline called ColaDex to address this challenging problem. The core idea of ColaDex is to leverage a vision-language models (VLMs) to select the dexterous grasp from a set of candidates that aligns well with the task description. To this end, we first introduce a contact-guided optimization method to generate a set of high-quality grasp candidates around the object through analytical optimization. Subsequently, to effectively prompt VLMs with the sampled numerous grasp candidates, we propose an object-centric approach that adaptively represents a group of candidates as prototypical contact maps, learned based on the geometric relationships between the grasping hand and object shape. We then feed the task requirement and the generated prototypical contact maps into the VLM, enabling it to reason about grasp-object interactions and assess their alignment with the given task, ultimately selecting the grasp that best aligns with the task requirement. Extensive experiments demonstrate that our prototypical contact map is a more informative prompting mechanism than conventional RGB images, enabling ColaDex to consistently generate high-quality task-oriented grasps and achieve a high success rate across diverse objects and tasks.

Abstract:
With the rapid advancement of autonomous driving technology, a lack of data has become a major obstacle to enhancing perception model accuracy. Researchers are now exploring controllable data generation using world models to diversify datasets. However, previous work has been limited to studying image generation quality on specific public datasets. There is still relatively little research on how to build data generation engines for real-world application scenes to achieve large-scale data generation for challenging scenes. In this paper, a simulator-conditioned scene generation engine based on world model is proposed. By constructing a simulation system consistent with real-world scenes, simulation data and labels, which serve as the conditions for data generation in the world model, for any scenes can be collected. It is a novel data generation pipeline by combining the powerful scene simulation capabilities of the simulation engine with the robust data generation capabilities of the world model. In addition, a benchmark with proportionally constructed virtual and real data, is provided for exploring the capabilities of world models in real-world scenes. Quantitative results show that these generated images significantly improve downstream perception models performance. Finally, we explored the generative performance of the world model in urban autonomous driving scenarios. All the data and code will be available at https://github.com/Li-Zn-H/SimWorld.

Abstract:
Semantic Scene Completion (SSC) constitutes a pivotal element in autonomous driving perception systems, tasked with inferring the 3D semantic occupancy of a scene from sensory data. To improve accuracy, prior research has implemented various computationally demanding and memory-intensive 3D operations, imposing significant computational requirements on the platform during training and testing. This paper proposes L2COcc, a lightweight camera-centric SSC framework that also accommodates LiDAR inputs. With our proposed efficient voxel transformer (EVT) and cross-modal knowledge modules, including feature similarity distillation (FSD), TPV distillation (TPVD) and prediction alignment distillation (PAD), our method substantially reduce computational burden while maintaining high accuracy. The experimental evaluations demonstrate that our proposed method surpasses the current state-of-the-art vision-based SSC methods regarding accuracy on both the SemanticKITTI and SSCBench-KITTI-360 benchmarks, respectively. Additionally, our method is more lightweight, exhibiting a reduction in both memory consumption and inference time by over 23% compared to the current state-of-the-arts method. Code is available at our project page: https://studyingfufu.github.io/L2COcc/.

Abstract:
Curriculum learning has emerged as a promising approach for training complex robotics tasks, yet current applications predominantly rely on manually designed curricula, which demand significant engineering effort and can suffer from subjective and suboptimal human design choices. While automated curriculum learning has shown success in simple domains like grid worlds and games where task distributions can be easily specified, robotics tasks present unique challenges: they require handling complex task spaces while maintaining relevance to target domain distributions that are only partially known through limited samples. To this end, we propose Grounded Adaptive Curriculum Learning (GACL1), a framework specifically designed for robotics curriculum learning with three key innovations: (1) a task representation that consistently handles complex robot task design, (2) an active performance tracking mechanism that allows adaptive curriculum generation appropriate for the robot’s current capabilities, and (3) a grounding approach that maintains target domain relevance through alternating sampling between reference and synthetic tasks. We validate GACL on wheeled navigation in constrained environments and quadruped locomotion in challenging 3D confined spaces, achieving 6.8% and 6.1% higher success rates, respectively, than state-of-the-art methods in each domain.

Abstract:
Robot assisted endoluminal intervention is an emerging tool for treating luminal lesions. Vision-based endoluminal navigation, particularly through video-CT registration, is a tangible way of obtaining absolute camera position information. By using pre-operative CT data, accurate endoscope localization can be achieved, without the need of additional tracking hardware intraoperatively. However, aligning preoperative CT with intraoperative domain remains a challenge. Although approaches such as style transfer have been explored, patient-specific textures and intra-operative artifacts can significantly complicate the task. To overcome these challenges, we propose R2Nav, a robust, real-time test time adaptation method for endoluminal navigation. R2Nav constructs a confidence buffer during the testing phase, refining the model only for frames with high uncertainty. We introduce a registration-augmented model refinement strategy, which enhances both accuracy and efficiency of the system by selecting relevant training samples from the virtual gallery. Additionally, we propose a novel warm-up strategy for the registration encoder during the initial testing phase, enabling the extraction of more robust features when the model is suboptimal. Extensive validation demonstrates that R2Nav outperforms the current state-of-the-art methods, offering significant advantages for real-time, intra-operative endoluminal navigation. Code is at: https://github.com/EndoluminalSurgicalVision-IMR/R2Nav.

Abstract:
Large language models (LLMs) represent a significant advancement in integrating physical robots with AI-driven systems. In this research, we present a framework that leverages Robotics Decision-Making Models (RDMM) for decision-making in domain-specific contexts, enhancing robotic autonomy. This framework incorporates agent-specific knowledge representation, allowing robots to recall and utilize their capabilities and past experiences for improved decision-making. Unlike other approaches, our method prioritizes real-time, on-device solutions, successfully operating on hardware with as little as 8GB of memory. The framework integrates visual perception models, providing robots with a better understanding of their environment. Additionally, real-time speech recognition capabilities are included, improving the human-robot interaction experience. Experimental results show that the RDMM framework achieves planning accuracy of 93%. Furthermore, we introduce a novel dataset consisting of 27k planning instances and 1.3k annotated text-image samples, specifically curated from real-world robotic tasks in competition scenarios. The framework, benchmarks, datasets, and models developed in this work are publicly available on our project website at https://github.com/shadynasrat/RDMM.

Abstract:
Reliable semantic segmentation of open environments is essential for intelligent systems, yet significant problems remain: 1) Existing RGB-T semantic segmentation models mainly rely on low-level visual features and lack high-level textual information, which struggle with accurate segmentation when categories share similar visual characteristics. 2) While SAM excels in instance-level segmentation, integrating it with thermal images and text is hindered by modality heterogeneity and computational inefficiency. To address these, we propose TASeg, a text-aware RGB-T segmentation framework by using Low-Rank Adaptation (LoRA) fine-tuning technology to adapt vision foundation models. Specifically, we propose a Dynamic Feature Fusion Module (DFFM) in the image encoder, which effectively merges features from multiple visual modalities while freezing SAM’s original transformer blocks. Additionally, we incorporate CLIP-generated text embeddings in the mask decoder to enable semantic alignment, which further rectifies the classification error and improves the semantic understanding accuracy. Experimental results across diverse datasets demonstrate that our method achieves superior performance in challenging scenarios with fewer trainable parameters.

Abstract:
Successful execution of dexterous robotic manipulation tasks in new environments, such as grasping, depends on the ability to proficiently segment unseen objects from the background and other objects. Previous works in unseen object instance segmentation (UOIS) train models on large-scale datasets, which often leads to overfitting on static visual features. This dependency results in poor generalization performance when confronted with out-of-distribution scenarios. To address this limitation, we rethink the task of UOIS based on the principle that vision is inherently interactive and occurs over time. We propose a novel real-time interactive perception framework, rt-RISeg, that continuously segments unseen objects by robot interactions and analysis of a designed body frame-invariant feature (BFIF). We demonstrate that the relative rotational and linear velocities of randomly sampled body frames, resulting from selected robot interactions, can be used to identify objects without any learned segmentation model. This fully self-contained segmentation pipeline generates and updates object segmentation masks throughout each robot interaction without the need to wait for an action to finish. We showcase the effectiveness of our proposed interactive perception method by achieving an average object segmentation accuracy rate 27.5% greater than state-of-the-art UOIS methods. Furthermore, although rt-RISeg is a standalone framework, we show that the autonomously generated segmentation masks can be used as prompts to vision foundation models for significantly improved performance.

Abstract:
The highly nonlinear dynamics of vehicles present a major challenge for the practical implementation of optimal control and Model Predictive Control (MPC) approaches in path planning and tracking applications. Koopman operator theory offers a global linear representation of nonlinear dynamical systems, making it a promising framework for optimization-based vehicle control. This paper introduces a novel deep learning-based Koopman modeling approach that employs deep neural networks to capture the full vehicle dynamics, from pedal and steering inputs to chassis states, within a curvilinear Frenet frame. The superior accuracy of the Koopman model compared to identified linear models is shown for a double lane change maneuver. Furthermore, it is shown that an MPC controller deploying the Koopman model provides significantly improved performance while maintaining computational efficiency comparable to a linear MPC.

Abstract:
Designing a whole-body controller for loco-manipulation in unstructured real-world environments remains a formidable challenge. Previous approaches have primarily focused on extending the workspace of robotic arms while maintaining quadrupedal landing postures. However, these methods fail to fully exploit the mobility of legged robots. To address these limitations, we propose a unified framework for collision-free loco-manipulation in real-world applications. The framework comprises two key modules: (1) a Loco-manipulation Motion Prior, which generates loco-manipulation skill trajectories via Trajectory Optimization (TO), and (2) a Collision-free Manipulation module using a Model Predictive Path Integral (MPPI)-based trajectory generator and a vector-based trajectory follower. Extensive experiments have been conducted in both simulation and real-world scenarios to evaluate our framework’s tracking accuracy, whole-body coordination, and workspace expansion capabilities. Supplementary and videos are available at: https://sites.google.com/view/loco-mani-amp/

Abstract:
This paper introduces FlipWalker, a novel under-actuated robot locomotion system inspired by Jacob’s Ladder illusion toy, designed to traverse challenging terrains where wheeled robots often struggle. Like the Jacob’s Ladder toy, Flip-Walker features two interconnected segments joined by flexible cables, enabling it to pivot and flip around singularities in a manner reminiscent of the toy’s cascading motion. Actuation is provided by motor-driven legs within each segment that push off either the ground or the opposing segment, depending on the robot’s current configuration. A physics-based model of the underactuated flipping dynamics is formulated to elucidate the critical design parameters governing forward motion and obstacle clearance or climbing. The untethered prototype weighs 0.78 kg, achieves a maximum flipping speed of 0.2 body lengths per second. Experimental trials on artificial grass, river rocks, and snow demonstrate that FlipWalker’s flipping strategy, which relies on ground reaction forces applied normal to the surface, offers a promising alternative to traditional locomotion for navigating irregular outdoor terrain.

Abstract:
With the increasing presence of automated vehicles on open roads under driver supervision, disengagement cases are becoming more prevalent. While some data-driven planning systems attempt to directly utilize these disengagement cases for policy improvement, the inherent scarcity of disengagement data (often occurring as a single instance) restricts training effectiveness. Furthermore, some disengagement data should be excluded since the disengagement may not always come from the failure of driving policies, e.g. the driver may casually intervene for a while. To this end, this work proposes disengagement-reason-augmented reinforcement learning (DRARL), which enhances driving policy improvement process according to the reason of disengagement cases. Specifically, the reason of disengagement is identified by an out-of-distribution (OOD) state estimation model. When the reason doesn’t exist, the case will be identified as a casual disengagement case, which doesn’t require additional policy adjustment. Otherwise, the policy can be updated under a reason-augmented imagination environment, improving the policy performance of disengagement cases with similar reasons. The method is evaluated using real-world disengagement cases collected by autonomous driving robotaxi. Experimental results demonstrate that the method accurately identifies policy-related disengagement reasons, allowing the agent to handle both original and semantically similar cases through reason-augmented training. Furthermore, the approach prevents the agent from becoming overly conservative after policy adjustments. Overall, this work provides an efficient way to improve driving policy performance with disengagement cases.

Abstract:
Accurately perceiving dynamic environments is a fundamental task for autonomous driving and robotic systems. Existing methods inadequately utilize temporal information, relying mainly on local temporal interactions between adjacent frames and failing to leverage global sequence information effectively. To address this limitation, we investigate how to effectively aggregate global temporal features from temporal sequences, aiming to achieve occupancy representations that efficiently utilize global temporal information from historical observations. For this purpose, we propose a global temporal aggregation denoising network named GTAD, introducing a global temporal information aggregation framework as a new paradigm for holistic 3D scene understanding. Our method employs an in-model latent denoising network to aggregate local temporal features from the current moment and global temporal features from historical sequences. This approach enables the effective perception of both fine-grained temporal information from adjacent frames and global temporal patterns from historical observations. As a result, it provides a more coherent and comprehensive understanding of the environment. Extensive experiments on the nuScenes and Occ3D-nuScenes benchmark and ablation studies demonstrate the superiority of our method.

Abstract:
Recent advancements in autonomous driving (AD) systems have highlighted the potential of world models in achieving robust and generalizable performance across both ordinary and challenging driving conditions. However, a key challenge remains: precise and flexible camera pose control, which is crucial for accurate viewpoint transformation and realistic simulation of scene dynamics. In this paper, we introduce PosePilot, a lightweight yet powerful framework that significantly enhances camera pose controllability in generative world models. Drawing inspiration from self-supervised depth estimation, PosePilot leverages structure-from-motion principles to establish a tight coupling between camera pose and video generation. Specifically, we incorporate self-supervised depth and pose readouts, allowing the model to infer depth and relative camera motion directly from video sequences. These outputs drive pose-aware frame warping, guided by a photometric warping loss that enforces geometric consistency across synthesized frames. To further refine camera pose estimation, we introduce a reverse warping step and a pose regression loss, improving viewpoint precision and adaptability. Extensive experiments on autonomous driving and general-domain video datasets demonstrate that PosePilot significantly enhances structural understanding and motion reasoning in both diffusion-based and auto-regressive world models. By steering camera pose with self-supervised depth, PosePilot sets a new benchmark for pose controllability, enabling physically consistent, reliable viewpoint synthesis in generative world models.

Abstract:
As marine ecosystems face rapid declines, field observations have become essential for better understanding our oceans. Fish-inspired robots are a promising solution, as they are less disruptive than propeller-based approaches in sensitive environments. However, in both fish and fish-inspired robots, there is a trade-off between speed (that favors rigid bodies) and maneuverability (that favors flexible bodies). In this work, we present BlueKoi, an untethered, fish-inspired robotic platform that leverages both a stiff tuna-inspired tail for efficient swimming and a koi-inspired rotating head for maneuvering, reaching speeds of 1.84 body lengths per second and a turn radius of 1.93 body lengths. We experimentally quantify the robot’s turn radius under varying conditions and develop a reduced-order model to both understand the turning behavior and inform future design decisions, without needing explicit measurements of hydrodynamic coefficients. Furthermore, we show that our model is not only accurate but also capable of extending simulations to account for future design modifications. By decoupling propulsion and maneuver-ability, BlueKoi is a scalable and modular platform that enables adaptability for diverse sensing and navigation needs.

Abstract:
3D point cloud registration is an essential problem in computer vision, robotics, surgical navigation and augmented reality. Accurate registration of partially overlapped intraoperative point clouds (e.g., femoral reconstruction) remains critical yet challenging in orthopedic navigation due to incomplete overlap and dynamic noise. In this study, we propose a partial-to-partial point cloud registration framework based on directional spatial consistency. First, we extract overlapped areas from partially overlapping point clouds and leverage the point registration graph matching module to calculate the hard point matching matrix. Second, we sample nodes from the source point cloud and generate translation-invariant edge vectors (direction/scale-preserving) via their k-nearest neighbors, guided by predicted point correspondences. This bypasses translation ambiguities by encoding spatial consistency through edges, reducing pose estimation to 3DoF alignment (rotation). The loss explicitly couples point-level matches with edge-level geometric constraints for dual optimization. Building upon this framework, we extract reliable overlapping edge representations and prune their similarity matrix by thresholding low-confidence scores, effectively suppressing spurious matches. The proposed edge-aware matching mechanism further exploits the translation invariance of local structures to refine point correspondences with enhanced accuracy. Finally, we introduce a bidirectional registration mechanism to reinforce optimization stability, achieving state-of-the-art performance across benchmarks. Extensive experiments on ModelNet40, ShapeNet, and MedShapeNet validate our method under diverse scenarios: partial-to-partial, unseen categories, partial-to-full, and cross-dataset generalization, surpassing existing methods in registration accuracy. The codes are available at https://github.com/pidan0824/DSCGM.

Abstract:
Safety is a critical concern in human-robot collaboration (HRC). As collaborative robots take on increasingly complex tasks in human environments, their systems have become more sophisticated through the integration of multimodal sensors, including force-torque sensors, cameras, LiDARs, and IMUs. However, existing studies on HRC safety primarily focus on ensuring safety under normal operating conditions, overlooking scenarios where internal sensor faults occur.While anomaly detection modules can help identify sensor errors and mitigate hazards, two key challenges remain: (1) no anomaly detector is flawless, and (2) not all sensor malfunctions directly threaten human safety. Relying solely on anomaly detection can lead to missed errors or excessive false alarms.To enhance safety in real-world HRC applications, this paper introduces a deep learning-based method that proactively predicts hazards following the detection of sensory anomalies. We simulate two common types of faults—bias and noise—affecting joint sensors and monitor abnormal manipulator behaviors that could pose risks in fenceless HRC environments. A dataset of 2,400 real-world samples is collected to train the proposed hazard prediction model.The approach leverages multimodal inputs, including RGB-D images, human pose, joint states, and planned robot paths, to assess whether sensor malfunctions could lead to hazardous events. Experimental results show that the proposed method outperforms state-of-the-art models, while offering faster inference speed. Additionally, cross-scenario testing confirms its strong generalization capabilities.The code and datasets are available at: DL-based-Hazard-Prediction.

Abstract:
Object Goal Navigation-requiring an agent to locate a specific object in an unseen environment-remains a core challenge in embodied AI. Although recent progress in Vision-Language Model (VLM)-based agents has demonstrated promising perception and decision-making abilities through prompting, none has yet established a fully modular world model design that reduces risky and costly interactions with the environment by predicting the future state of the world. We introduce WMNav, a novel World Model-based Navigation framework powered by Vision-Language Models (VLMs). It predicts possible outcomes of decisions and builds memories to provide feedback to the policy module. To retain the predicted state of the environment, WMNav proposes the online maintained Curiosity Value Map as part of the world model memory to provide dynamic configuration for navigation policy. By decomposing according to a human-like thinking process, WMNav effectively alleviates the impact of model hallucination by making decisions based on the feedback difference between the world model plan and observation. To further boost efficiency, we implement a two-stage action proposer strategy: broad exploration followed by precise localization. Extensive evaluation on HM3D and MP3D validates WMNav surpasses existing zero-shot benchmarks in both success rate and exploration efficiency (absolute improvement: +3.2% SR and +3.2% SPL on HM3D, +13.5% SR and +1.1% SPL on MP3D). Project page: https://b0b8k1ng.github.io/WMNav/.

Abstract:
Deep reinforcement learning (DRL) has emerged as a promising approach for robotic control, but its real-world deployment remains challenging due to its vulnerability to environmental perturbations. Existing white-box adversarial attack methods, adapted from supervised learning, fail to effectively target DRL agents as they overlook temporal dynamics and indiscriminately perturb all state dimensions, limiting their impact on long-term rewards. To address these challenges, we propose the Adaptive Gradient-Masked Reinforcement (AGMR) Attack, a white-box attack method that combines DRL with a gradient-based soft masking mechanism to dynamically identify critical state dimensions and optimize adversarial policies. AGMR selectively allocates perturbations to the most impactful state features and incorporates a dynamic adjustment mechanism to balance exploration and exploitation during training. Extensive experiments demonstrate that AGMR outperforms state-of-the-art adversarial attack methods in degrading the performance of the victim agent and enhances the victim agent’s robustness through adversarial defense mechanisms.

Abstract:
Bimanual dexterous manipulation remains a significant challenge in robotics due to the high DoFs of each hand and their coordination. Existing single-hand manipulation techniques often leverage human demonstrations to guide RL methods but fail to generalize to complex bimanual tasks involving multiple sub-skills. In this paper, we propose VTAO-BiManip, a novel framework that integrates visual-tactile-action pre-training with object understanding, aiming to enable human-like bimanual manipulation via curriculum reinforcement learning (RL). We improve prior learning by incorporating hand motion data, providing more effective guidance for dual-hand coordination. Our pretraining model predicts future actions as well as object pose and size using masked multimodal inputs, facilitating cross-modal regularization. To address the multi-skill learning challenge, we introduce a two-stage curriculum RL approach to stabilize training. We evaluate our method on a bimanual bottle-cap twisting task, demonstrating its effectiveness in both simulated and real-world environments. Our approach achieves a success rate that surpasses existing visual-tactile pretraining methods by over 20%.

Abstract:
Recent advancements in Large Language Models (LLMs) have catalyzed numerous efforts to apply these technologies to embodied tasks, with a particular focus on high-level task planning and task decomposition. LLMs face challenges in understanding the physical world, especially regarding spatial, temporal, and causal relationships among objects and actions. Moreover, the current benchmarks for evaluating these relationships are limited. To further investigate this domain, we introduce a novel embodied task planning benchmark, ET-Plan-Bench. This benchmark features a controllable and diverse array of embodied tasks, varying in levels of difficulty and complexity. It is designed to evaluate two critical dimensions of LLMs’ application in embodied task understanding: spatial understanding (including relation constraints and occlusion of target objects) and temporal and causal comprehension of sequences of actions within an environment. Utilizing multi-source simulators as the backend simulator, ET-Plan-Bench provides immediate environmental feedback to LLMs, enabling dynamic interaction with the environment and the capacity for re-planning as necessary. We evaluated state-of-the-art open-source and closed-source foundational models, including GPT-4, Llama, and Mistral, using our proposed benchmark. While these models perform adequately on simple navigation tasks, their performance significantly deteriorates when con-fronted with tasks that demand a deeper understanding of spatial, temporal, and causal relationships. Consequently, our benchmark distinguishes itself as a large-scale, quantifiable, highly automated, and fine-grained diagnostic framework that presents a substantial challenge to the latest foundational models. We hope it will inspire and propel further research in embodied task planning utilizing foundational models. Code available at: https://github.com/ET-Plan-Bench/ET-Plan-Bench

Abstract:
Multi-Agent Motion Planning (MAMP) seeks collision-free trajectories for multiple agents from their respective start to goal locations among static obstacles, while minimizing a cost function over the trajectories. Existing approaches for this problem include graph-based, Mix-Integer Programming (MIP) based and trajectory optimization-based, each with its own limitations. This paper introduces a new approach for MAMP based on Mixed Integer Conic Programming (MICP) formulation that complements these existing approaches. We show that our formulation is valid and test our approach against various baselines, including a graph-based method that combines search and sampling, as well as different MIP formulations. The numerical results show that the solutions found by our approach are sometimes eight times closer to the true optimum than the ones found by the baseline when given the same amount of runtime limit. We also verify our approach with multiple drones in a lab setting.

Abstract:
Autonomous driving (AD) continues to grapple with the complexity of dynamic and interactive traffic environments, where the primary difficulty stems from insufficient modeling of inter-vehicle interactions—particularly, how autonomous agents should perceive and respond to surrounding vehicles’ influence. To address this, this paper proposed ExpliDrive, an explainable data-driven approach for interaction-aware autonomous driving. Its highlights lie in bridging Model Predictive Control (MPC) and Transformers. The proposed approach builds a generalized system dynamics in which interaction effects between vehicles are explicitly modeled. Specifically, a Transformer encoder-decoder is employed to encode the interaction patterns among vehicles, and these learned effects are seamlessly embedded into the motion planning process. Hence, the proposed approach bears following features: i) enabling proactively interaction-aware autonomous driving; ii) data-driven yet explainable; iii) integrating the prediction into motion planning. Open-looped evaluation demonstrates the proposed approach achieves the lowest prediction errors, from ADE@1s (0.16m) to ADE@5s (0.80m). Close-looped planning shows the proposed approach has significant benefits in driving success rate and flexibility.

Abstract:
Efficient navigation and search in unknown environments for multiple objects is a fundamental challenge in robotics, particularly in applications such as warehouse management, domestic assistance, and search-and-rescue. The Multi-Object Search (MOS) problem involves navigating to a sequence of locations to maximize the likelihood of finding target objects while minimizing travel costs. In this paper, we introduce a novel approach to the MOS problem, called Finder, which leverages vision language models (VLMs) to locate multiple objects across diverse environments. Specifically, our approach introduces multi-channel score maps to track and reason multiple objects simultaneously during navigation, along with a score map technique that combines scene-level and object-level semantic correlations. We validate our approach through extensive experiments in both simulated and real-world environments. The results demonstrate that Finder outperforms existing multi-object search methods using deep reinforcement learning and VLM Additional ablation and scalability studies highlight the importance of our design choices and show the system’s robustness with increasing number of target objects. Website: https://find-all-my-things.github.io/

Abstract:
Monocular depth estimation (MDE) is crucial for various computer vision applications, but existing methods often struggle to balance inference speed and accuracy when processing large-region visual information. This paper introduces LR2Depth, a novel MDE method that addresses this challenge by utilizing large-kernel convolution on low-resolution feature maps for efficient large-region feature aggregation. Our approach leverages the fact that each pixel on low-resolution feature maps corresponds to a larger region of the original image, allowing for fast and accurate depth predictions at a lower inference cost. Extensive experiments on NYU-Depth-V2, KITTI, and SUN RGB-D datasets demonstrate that LR2Depth not only achieves state-of-the-art performance but also operates approximately twice as fast as previous MDE methods. Notably, at the time of submission, LR2Depth secured the top-1 position on the KITTI depth prediction online benchmark. The code is available in the project page.

Abstract:
Loop closures are essential for correcting odometry drift and creating consistent maps, especially in the context of large-scale navigation. Current methods using dense point clouds for accurate place recognition do not scale well due to computationally expensive scan-to-scan comparisons. Alternative object-centric approaches are more efficient but often struggle with sensitivity to viewpoint variation. In this work, we introduce REGRACE, a novel approach that addresses these challenges of scalability and perspective difference in re-localization by using LiDAR-based submaps. We introduce rotation-invariant features for each labeled object and enhance them with neighborhood context through a graph neural network. To identify potential revisits, we employ a scalable bag-of-words approach, pooling one learned global feature per submap. Additionally, we define a revisit with geometrical consistency cues rather than embedding distance, allowing us to recognize far-away loop closures. Our evaluations demonstrate that REGRACE achieves similar results compared to state-ofthe-art place recognition and registration baselines while being twice as fast. Code and models are publicly available†.

Abstract:
This paper proposes a simple controller for bipedal locomotion that can stabilize three-axis (roll, pitch, and yaw) rotation without relying on ground reaction moment manipulation. Extra acceleration of the CoM (center-of-mass) from the nominal DCM (divergent component of motion) dynamics generates moment around the CoM. Based on this principle, the behavior of the desired DCM is modulated to carry signal for rotation stabilization. A robust walking controller is synthesized by combining the proposed rotation stabilizer with continuous step adaptation. Simulation study shows that the proposed controller is capable of robust disturbance rejection and yaw-regulated walking of a point-foot robot.

Abstract:
Predicting the future motion of road participants is a critical task in autonomous driving. In this work, we address the challenge of low-quality generation of low-probability modes in multi-agent joint prediction. To tackle this issue, we propose a two-stage multi-agent interactive prediction framework named keypoint-guided joint prediction after classification-aware marginal proposal (JAM). The first stage is modeled as a marginal prediction process, which classifies queries by trajectory type to encourage the model to learn all categories of trajectories, providing comprehensive mode information for the joint prediction module. The second stage is modeled as a joint prediction process, which takes the scene context and the marginal proposals from the first stage as inputs to learn the final joint distribution. We explicitly introduce key waypoints to guide the joint prediction module in better capturing and leveraging the critical information from the initial predicted trajectories. We conduct extensive experiments on the real-world Waymo Open Motion Dataset interactive prediction benchmark. The results show that our approach achieves competitive performance. In particular, in the framework comparison experiments, the proposed JAM outperforms other prediction frameworks and achieves state-of-the-art performance in interactive trajectory prediction. The code is available at https://github.com/LinFunster/JAM to facilitate future research.

Abstract:
Multi-Agent Path Finding (MAPF) involves finding collision-free paths for multiple agents while minimizing a cost function—an NP-hard problem. Bounded suboptimal methods like Enhanced Conflict-Based Search (ECBS) and Explicit Estimation CBS (EECBS) balance solution quality with computational efficiency using focal search mechanisms. While effective, traditional focal search faces a limitation: the lower bound (LB) value determining which nodes enter the FOCAL list often increases slowly in early search stages, resulting in a constrained search space that delays finding valid solutions. In this paper, we propose a novel bounded suboptimal algorithm, double-ECBS (DECBS), to address this issue by first determining the maximum LB value and then employing a best-first search guided by this LB to find a collision-free path. Experimental results demonstrate that DECBS outperforms ECBS in most test cases and is compatible with existing optimization techniques. DECBS can reduce nearly 30% high-level Constraint Tree (CT) nodes and 50% low-level focal search nodes. When agent density is high, DECBS achieves a 23.5% average runtime improvement over ECBS with identical suboptimality bounds and optimizations.

Abstract:
We propose an object-centric recovery (OCR) framework to address the challenges of out-of-distribution (OOD) scenarios in visuomotor policy learning. Previous behavior cloning (BC) methods rely heavily on a large amount of labeled data coverage, failing in unfamiliar spatial states. Without relying on extra data collection, our approach learns a recovery policy constructed by an inverse policy inferred from the object keypoint manifold gradient in the original training data. The recovery policy serves as a simple add-on to any base visuomotor BC policy, agnostic to a specific method, guiding the system back towards the training distribution to ensure task success even in OOD situations. We demonstrate the effectiveness of our object-centric framework in both simulation and real robot experiments, achieving an improvement of 77.7% over the base policy in OOD. Furthermore, we show OCR’s capacity to autonomously collect demonstrations for continual learning. Overall, we believe this framework represents a step toward improving the robustness of visuomotor policies in real-world settings. Project Website: https://sites.google.com/view/ocr-penn

Abstract:
Autonomous vehicle trajectory planning faces significant challenges in dynamic traffic environments due to the complex and mixed causal relationships between critical scene elements (e.g., pedestrians, vehicles, road markings) and safe decision-making. To identify the causal factors influencing planning outcomes, we propose Causal-Planner, which disentangles the scene interaction graph into causal and confounding components via attention-based adversarial graph learning. Additionally, we introduce a long-short-term episodic memory gating (LSTEM) module that enhances causal interaction disentangling by adaptively capturing evolving causal relationships in dynamic scenarios through bidirectional gated memory fusion. Extensive experiments on the nuPlan dataset suggest that Causal-Planner achieves competitive performance, performing well in both Test-random and Test-hard scenarios under open-loop and closed-loop evaluations. The code will be publicly available at https://github.com/Yyb-XJTU/Causal-Planner.

Abstract:
In this paper, we present DL-Clip, an innovative online learning approach for nonlinear stabilizing control that operates without prior knowledge of system dynamics or reward signals, while significantly improving training efficiency. DL-Clip introduces a novel integration of stabilizing control with efficient Reinforcement Learning (RL) training mechanisms. The algorithm uses Lyapunov functions to ensure system stability and employs clipping operations to optimize policy updates, achieving faster convergence. We evaluate the effectiveness of DL-Clip through experiments, including simulations of the inverted pendulum and the Image-Based Visual Servoing (IBVS) for multicopter position stabilization. In addition, we validate the approach through a real flight experiment based on the IBVS problem, demonstrating its practical applicability.

Abstract:
Accurately recognizing the structural regions of targeted objects is crucial for successful manipulation. In this study, we concentrate on the task of hanging crumpled garments on a rack, a common scenario in household environments. This context presents two primary challenges: (1) perceiving and grasping the structural regions of garments that exhibit severe deformations and self-occlusions; (2) adjusting the configuration of garments to fit the supporting components of the rack. To address these challenges, we propose a confidence-guided grasping strategy that actively seeks garment collars through handovers between dual robotic arms. In particular, we develop an autonomous data collection procedure in real-world settings to train the collar detection network. The exact grasping pose is determined through depth-aware contour extraction, and its success is evaluated based on a specially designed metric. Furthermore, we formulate the hanging task as one-shot imitation learning with an egocentric view. To precisely align the collar with the supporting item, we propose a two-step hanging strategy that involves coarse approaching followed by fine transformation. We perform comprehensive experiments and show that our framework notably enhances the success rate compared to existing methods.

Abstract:
Constrained motion planning is a common but challenging problem in robotic manipulation. In recent years, data-driven constrained motion planning algorithms have shown impressive planning speed and success rate. Among them, the latent motion method based on manifold approximation is the most efficient planning algorithm. Due to errors in manifold approximation and the difficulty in accurately identifying collision conflicts within the latent space, time-consuming path validity checks and path replanning are required. In this paper, we propose a method that trains a neural network to predict the minimum distance between the robot and obstacles using latent vectors as inputs. The learned distance gradient is then used to calculate the direction of movement in the latent space to move the robot away from obstacles. Based on this, a local path optimization algorithm in the latent space is proposed, and it is integrated with the path validity checking process to reduce the time of replanning. The proposed method is compared with state-of-the-art algorithms in multiple planning scenarios, demonstrating the fastest planning speed.

Abstract:
Scene graph generation (SGG) is a structured approach to understanding real-world scenes with complex relations, which can enhance UAV autonomy in unfamiliar environments. However, SGG typically has numerous model parameters that require considerable computational resources. This paper proposes a refined SGG model and reduces the model parameters for its UAV applications. First, we use subject-object query pairs to predict triplets directly, eliminating the need for separate entity predictions. Additionally, the cross-attention mechanism enhances the model’s ability to query triplets. We use a single decoder to process subject and object entities simultaneously, enhancing computational speed and reducing the number of parameters. Then, we map the entities to the relational semantic space before performing relations classification, which improves the model performance by adding a small number of parameters. Finally, the set prediction loss function is designed for relation prediction to strengthen the role of relation prediction in triplets. Real-world UAV experiments show that our model can extract more triplets per second with fewer parameters than the benchmarks. Github: https://github.com/SupersPig/myLGTR.

Abstract:
In this paper, a vision-based method is proposed for cable installation tasks in constrained environments. The main challenge of such tasks lies in the potential interference between the cable and surrounding obstacles. Model-based approaches are not well-suited for these industrial scenarios due to variations in the physical properties of workpieces. To address this, the proposed method integrates a potential field-based tip trajectory regulation with a shape deformation servo. In the shape deformation servo, a planner is employed to determine a feasible shape curve that avoids obstacles. This step is crucial, as an infeasible reference for shape control may result in unstable behavior. The effectiveness of the proposed method is validated through experiments. Notably, this approach does not rely on prior model information, making it highly adaptable for industrial deployment.

Abstract:
Designing 3D indoor layouts is a crucial task with significant applications in embodied robot intelligence, virtual reality, and interior design. Existing methods for 3D layout design either rely on diffusion models, which utilize spatial relationship priors, or heavily leverage the inferential capabilities of proprietary Large Language Models (LLMs) , which require extensive prompt engineering and in-context exemplars via black-box trials. These methods often face limitations in generalization and dynamic scene editing. In this paper, we introduce LLplace, a novel 3D indoor scene layout designer based on lightweight, fine-tuned, open-source LLM Llama3. LLplace circumvents the need for spatial relationship priors and in-context exemplars, enabling efficient and credible room layout generation based solely on user inputs specifying the room type and desired objects. We curated a new dialogue dataset based on the 3D-Front dataset, expanding the original data volume and incorporating dialogue data for adding and removing objects. This dataset can enhance the LLM’s spatial understanding. Furthermore, through dialogue, LLplace activates the LLM’s capability to understand 3D layouts and perform dynamic scene editing, enabling the addition and removal of objects. Our approach demonstrates that LLplace can effectively generate and edit 3D indoor layouts interactively and outperform existing methods in delivering high-quality 3D design solutions.

Abstract:
Robust robotic grasping in cluttered environments presents a significant challenge, as existing methods often neglect the complex interactions between the gripper, objects, and obstacles, leading to collisions and grasping failures. To address this, we propose a framework that integrates collision avoidance as a core constraint within the grasp pose optimization process. Central to this framework is a Neural Collision Detection (NCD) network that takes scene configurations and grasp poses as inputs, producing a collision score that approximates traditional collision detection functions. The NCD network provides critical feedback for refining grasp predictions and demonstrates strong generalization across diverse environments, facilitating efficient collision detection and constrained grasp pose optimization. Additionally, we incorporate frictional force closure, geometric symmetry, and surface alignment as regularization terms within the optimization function, enhancing the physical stability and geometric plausibility of the generated grasps. Extensive experiments conducted in real-world environments show a significant improvement in grasp success rates, with robust generalization to previously unseen objects and scenarios. These results validate the efficacy of our framework, highlighting its potential for enabling reliable robotic manipulation in complex and cluttered environments.

Abstract:
Autonomous drone delivery to door relies on a popular computer vision technique called semantic segmentation (SS) to recognize meaningful house segments and determine a precise drop-off point near the door. While this SS-based approach is effective under specific environments, it fails to perform well across other common environmental factors such as different seasons, hours of the day, weather and illumination levels. In this work, we propose Robust Structural Semantic Segmentation (RSSS), a novel patch to the existing SS solution without requiring re-training for new environments. The core idea is to “Let Strong Help Weak”, where the results of semantic segmentation obtained under favorable/strong conditions are utilized to enhance the weaker ones in adverse settings. This improvement is achieved by leveraging house structures and spatial layouts, which remain largely invariant across various environments. Our evaluation shows that RSSS outperforms the state-of-the-art methods and significantly enhances the robustness of SS and drone delivery across various environments. The dataset collected for this study is released at Github [1].

Abstract:
The goal of RGB-T tracking is to enhance the accuracy and robustness by leveraging the complementary features of RGB and TIR modalities in complex scenarios. Previous methods have overlooked the power of semantic features in extracting valuable information from different modalities and improving interactions across them. Moreover, using Bounding Boxes (BBox) for target initialization can cause issues like bounding box blurring and tracking drift when the target’s appearance changes or gets occluded. To address these challenges, we propose the CLIP-based RGBT tracking algorithm TIETracker, which aims to to exploit the complementary advantages of multimodality more effectively using textual information. Textual descriptions direct the backbone network to learn target representations in multimodality and facilitate the interaction of multi-modal features. Additionally, in scenarios of occlusion and scale transformations that lead to missing or altered target features, textual information adaptively supplements the target representation. This approach also improves the response in the image region of the target, addressing issues with bounding box accuracy and tracking drift. Our extensive evaluation on three leading RGB-T tracking benchmarks demonstrates that TIETracker achieves competitive compared to state-of-the-art methods, effectively countering feature loss from changes in target appearance and occlusion.

Abstract:
Robot-Assisted Feeding (RAF) systems are essential for assisting individuals with disabilities or motor impairments in eating tasks. Manipulating granular food items, such as rice and beans, poses significant challenges due to their dynamic physical properties. Learning from human demonstrations offers a promising solution, but acquiring high-quality demonstrations is complex. To address this, we present VERAGMIL, a framework that combines a high-fidelity simulator with an intuitive Virtual Reality (VR) interface for recording demonstrations and supporting different imitation learning methods. VERAGMIL provides a realistic environment for training RAF systems to handle granular materials, including robots, sensors, and various food items with distinct physical characteristics. We evaluate VERAGMIL by training three imitation learning models—BC, BC-RNN, and BCQ—on granular scooping and transporting tasks using both VR interface and 3D space mouse demonstrations, comparing them with a human-expert baseline. The models are assessed on success rate, spillage, generalization to unseen food items, and task completion time. Results show that VR-based demonstrations significantly outperform 3D space mouse data, with BCQ achieving the best overall performance, particularly in reducing spillage and approaching human performance. These findings underscore the effectiveness of our framework for training RAF systems in granular material handling. The code for our framework is publicly available at: https://github.com/AmanuelErgogo/VERAGMIL.git.

Abstract:
While magnetic micro-robots have demonstrated significant potential across various applications, including drug delivery and microsurgery, the open issue of precise navigation and control in complex fluid environments is crucial for in vivo implementation. This paper introduces a novel flow-aware navigation and control strategy for magnetic micro-robots that explicitly accounts for the impact of fluid flow on their movement. First, the proposed method employs a Physics-Informed U-Net (PI-UNet) to refine the numerically predicted fluid velocity using local observations. The predicted velocity is then incorporated into a flow-aware A path planning algorithm, ensuring efficient navigation while mitigating flow-induced disturbances. Finally, a control scheme is developed to compensate for the predicted fluid velocity, thereby optimizing the micro-robot’s performance. A series of simulation studies and real-world experiments are conducted to validate the efficacy of the proposed approach. This method enhances both planning accuracy and control precision, expanding the potential applications of magnetic micro-robots in fluid-affected environments typical of many medical scenarios.

Abstract:
Countless underwater robots seek to monitor aquatic environments while minimizing their impact on fragile ecosystems. At mm-scales, these systems can be used in a range of waterways, from shallow streams and rivers, to larger ponds and lakes, and navigate around large obstacles or through tight spaces in coral reefs, mangroves, or pipe systems. They can also be more readily used as platforms for biological study, as small-scale robots can more easily be integrated into bench-top characterization systems to verify hydrodynamic performance. Here, we present a new robotic platform, the Daniobot, a 16.5mm body length (BL) microrobotic fish that is capable of achieving top speeds of 2.84BLs−1. At 23.8mm total length (TL), Daniobot is, to the best of our knowledge, the smallest fish-inspired robot propelled by onboard actuators. We present the design, fabrication, and assembly of this robot as well as detailed position and velocity results at varying tail amplitudes and frequencies, and compare their trends to a simple analytical model. This design uses a single PZT bimorph actuator operating at 175V, enabling future untethered experiments.

Abstract:
Autonomous navigation is a fundamental task for robot vacuum cleaners in indoor environments. Since their core function is to clean entire areas, robots inevitably encounter dead zones in cluttered and narrow scenarios. Existing planning methods often fail to escape due to complex environmental constraints, high-dimensional search spaces, and high difficulty maneuvers. To address these challenges, this paper proposes an embodied escaping model that leverages a reinforcement learning-based policy with an efficient action mask for dead zone escaping. To alleviate the issue of the sparse reward in training, we introduce a hybrid training policy that improves learning efficiency. In handling redundant and ineffective action options, we design a novel action representation to reshape the discrete action space with a uniform turning radius. Furthermore, we develop an action mask strategy to select valid actions quickly, balancing precision and efficiency. In real-world experiments, our robot is equipped with a Lidar, IMU, and two-wheel encoders. Extensive quantitative and qualitative experiments across varying difficulty levels demonstrate that our robot can consistently escape from challenging dead zones. Moreover, our approach significantly outperforms compared path planning and reinforcement learning methods in terms of success rate and collision avoidance. A video showcasing our methodology and real-world demonstrations is available at https://youtu.be/kBaaYWGhNuE.

Abstract:
Wearable robotic devices have been demonstrated to reduce muscle activation and metabolic cost during walking, but conventional motorized systems often impose significant weight and bulk, leading to user discomfort and limited portability. To address these limitations, the Robotic Achilles Tendon (RAT) was developed as a lightweight, semi-passive spring system that delivers ankle assistance exclusively during the stance phase. The RAT integrates a double-acting pneumatic cylinder and a solenoid valve to emulate spring behavior when the valve is closed and to permit unrestricted ankle motion when the valve is open. Gait-phase detection is achieved via a single inertial measurement unit mounted on the wrist, exploiting the conserved angular momentum that couples arm and leg movements. System architecture was optimized by eliminating motors and minimizing sensor count, resulting in a device weight of 0.45 kg per leg and a total weight of 1.4 kg. Performance evaluation involved surface electromyography and metabolic cost measurements in a cohort of healthy young adults. Compared to unassisted walking, the RAT reduced plantar-flexor muscle activation by 16.9% and decreased metabolic cost by 10.6%. These findings confirm that intent-based actuation of a semi-passive spring can provide effective ankle assistance with minimal hardware complexity. Future work will investigate alternative sensor locations that remain synchronized with lower-limb kinematics, simplify battery and processing modules to further reduce device mass, and extend validation to elderly and pediatric populations.

Abstract:
Moving object segmentation plays a vital role in understanding dynamic visual environments. While existing methods rely on multi-frame image sequences to identify moving objects, single-image MOS is critical for applications like motion intention prediction and handling camera frame drops. However, segmenting moving objects from a single image remains challenging for existing methods due to the absence of temporal cues. To address this gap, we propose MovSAM, the first framework for single-image moving object segmentation. MovSAM leverages a Multimodal Large Language Model (MLLM) enhanced with Chain-of-Thought (CoT) prompting to search the moving object and generate text prompts based on deep thinking for segmentation. These prompts are cross-fused with visual features from the Segment Anything Model (SAM) and a Vision-Language Model (VLM), enabling logic-driven moving object segmentation. The segmentation results then undergo a deep thinking refinement loop, allowing MovSAM to iteratively improve its understanding of the scene context and inter-object relationships with logical reasoning. This innovative approach enables MovSAM to segment moving objects in single images by considering scene understanding. We implement MovSAM in the real world to validate its practical application and effectiveness for autonomous driving scenarios where the multi-frame methods fail. Furthermore, despite the inherent advantage of multi-frame methods in utilizing temporal information, MovSAM achieves state-of-the-art performance across public MOS benchmarks, reaching 92.5% on J&F. Our implementation will be available at https://github.com/IRMVLab/MovSAM.

Abstract:
In recent years, tactile sensors have become essential for robotic systems, particularly in tasks requiring high-precision interaction and manipulation. The Vision-Based Tactile Sensor (VBTS) represents a significant advancement in tactile sensing, utilizing cameras to monitor the deformation of soft materials at the sensor tip. Pressure applied to the sensor alters the light propagation path, thereby changing the image captured by the camera. By combining image processing and deep learning, VBTS provides highly accurate estimates of contact position and force, achieving micrometer-level resolution. This paper presents a novel VBTS design that leverages a light-conductive plate and a silicone membrane to enhance the sensor’s sensitivity to force perception. The soft, thin nature of the silicone membrane allows for precise detection of minimal forces, making it suitable for tasks involving highly deformable objects. Experimental results demonstrate the sensor’s capability in detecting contact areas and force distributions, which can be applied in diverse domains such as soft object assembly, medical assistance, and food processing. Moreover, the proposed VBTS outperforms traditional sensors by utilizing computationally efficient algorithms that maintain real-time performance without compromising resolution.

Abstract:
For safe and flexible navigation in multi-robot systems, this paper presents an enhanced and predictive sampling-based trajectory planning approach in complex environments, the Gradient Field-based Dynamic Window Approach (GF-DWA). Building upon the dynamic window approach, the proposed method utilizes gradient information of obstacle distances as a new cost term to anticipate potential collisions. This enhancement enables the robot to improve awareness of obstacles, including those with non-convex shapes. The gradient field is derived from the Gaussian process distance field, which generates both the distance field and gradient field by leveraging Gaussian process regression to model the spatial structure of the environment. Through several obstacle avoidance and fleet collision avoidance scenarios, the proposed GF-DWA is shown to outperform other popular trajectory planning and control methods in terms of safety and flexibility, especially in complex environments with non-convex obstacles.

Abstract:
Differential-drive robots are widely used in dynamic pedestrian environments, such as hospitals, for time-sensitive tasks like medication delivery, which require high navigation efficiency to ensure timely arrivals. However, existing methods tend to overemphasize safety, resulting in overly conservative behaviors and prolonged navigation times, which in turn lead to reduced efficiency. To address this issue, this paper proposes a novel navigation framework that integrates a pedestrian risk map, modeled using asymmetric Gaussian distributions, into B-spline trajectory optimization. Rather than strictly avoiding high-risk regions, the method balances collision risk and trajectory length minimization, leading to more effective navigation. Additionally, multiple planning modes enhance adaptability in complex environments, ensuring both safety and efficiency. Furthermore, kinematic constraints specific to differential-drive robots are incorporated to ensure the feasibility of the generated trajectories. Simulations and real-world experiments validate the proposed method’s effectiveness in achieving safe and efficient navigation in dynamic pedestrian environments. The video is available at https://youtu.be/S9qJmXyPEzw.

Abstract:
With the advancement of robot manipulator technologies, developing custom-made low-cost manipulator arms has gained increasing popularity in the field of robotics. However, these low-cost systems often lack precise sensing and control capabilities, making reliable collision detection particularly critical to ensure safe operation. Popular methods of collision detection, which estimate external disturbances such as generalized Momentum Observer (MO) critically rely on an accurate dynamics model, and precise system identification requires joint torque sensors that are expensive for low-cost manipulators. This paper presents a novel approach to improve collision detection for low-cost robot manipulators where modeling error is present and torque sensing is not available. We propose a probabilistic framework using Gaussian Mixture Models (GMM) to capture friction and other unmodeled dynamics of a robot manipulator. Instead of explicitly identifying model parameters, GMM is created on the total residual torque of the MO while running an excitation trajectory. The GMM is then deployed in the main control loop to identify external disturbances from the residual torque of the same momentum observer. The approach is validated on a custom-made 4-Degrees-of-Freedom (DoF) robot arm with modeling error, unmodeled dynamics, and with only joint position measurement. Our method achieves reliable detection on both hard and soft collisions, demonstrating a reduction in the false positive rate by more than 50% compared to conventional MO-based methods with the same true positive rate on the same hardware.

Abstract:
Soft, elastic platforms may pose an intricate challenge towards sensor fusion as forces acting on the structure render extrinsic transformations variable over time. The present paper tackles this problem by introducing an elastic deformation model and embedding it into a sensor fusion scheme. The core of our method is given by a neural representation mapping temporal deformation sequences onto mass-normalized restoring forces. By using continuous time trajectory models as well as Newton’s second law, the sensor fusion problem becomes solvable by enforcing the consistency between second-order trajectory differentials and network outputs. The approach is validated on a loosely-coupled, real-world fusion scenario: an elastically connected, non-overlapping stereo camera system. As demonstrated, our approach permits relative camera alignment, absolute scale recovery, as well as inertial alignment from individual visual odometry results.1

Abstract:
We introduce COU: Common Objects Underwater, an instance-segmented image dataset of commonly found man-made objects in multiple aquatic and marine environments. COU contains approximately 10K segmented images, annotated from images collected during a number of underwater robot field trials in diverse locations. COU is created to address the lack of datasets with robust class coverage curated for underwater instance segmentation, which is particularly useful for training light-weight, real-time capable detectors for Autonomous Underwater Vehicles (AUVs). In addition, COU addresses the lack of diversity in object classes since the commonly available underwater image datasets focus only on marine life. Currently, COU contains images from both closed-water (pool) and open-water (lakes and oceans) environments, of 24 different classes of objects including marine debris, dive tools, and AUVs. To assess the efficacy of COU in training underwater object detectors, we use three state-of-the-art models to evaluate its performance and accuracy, using a combination of standard accuracy and efficiency metrics. The improved performance of COU-trained detectors over those solely trained on terrestrial data demonstrates the clear advantage of training with annotated underwater images. We make COU available for broad use under open-source licenses.

Abstract:
This paper proposes a Parallel Fuzzy Nonlinear Active Disturbance Rejection Control (FNLADRC) strategy to improve the precision and robustness of robotic manipulators in machining large, complex components. By decoupling the multi-degree-of-freedom dynamics and integrating Nonlinear ADRC with fuzzy logic, the method adaptively estimates and compensates for disturbances in real time, effectively mitigating machining vibrations and trajectory errors. A parallel control architecture enhances multi-joint coordination, improving adaptability and response speed. Additionally, a fuzzy logic-based tuning mechanism dynamically adjusts control parameters to boost robustness. Simulations on a UR5 robotic arm validate the method’s superior performance in dynamic, uncertain environments, demonstrating FNLADRC as a promising solution for high-precision robotic machining.

Abstract:
This paper presents the design and experimental validation of a torque-controlled robotic wrist for peg-in-hole tasks. The proposed wrist features a serial pitch-yaw joint configuration that enhances dexterity while maintaining compactness. The design integrates stepper motors, harmonic geartrains, and compliant mechanisms to optimize torque output and control accuracy. A compliant pulley and a compliant cap are introduced, enabling embedded torque sensing without the need for external sensors, thereby reducing system complexity and improving response time. Experimental results demonstrate the effectiveness of the wrist in torque accuracy and misalignment correction during peg-in-hole assembly, highlighting the benefits of the compliant-driven torque sensing approach. Compared to existing robotic wrists, the proposed design achieves a higher torque density. The findings contribute to advancing robotic wrist technology, particularly in applications requiring precise force modulation, high dexterity, and adaptable compliance.

Abstract:
Recent advancements in 3D Gaussian Splatting (3DGS) have significantly improved novel view synthesis, playing a crucial role in robotic vision and scene reconstruction. However, 3DGS relies heavily on precise camera poses and sharp images, which are often difficult to obtain in real-world robotic applications due to motion and defocus blur. Directly applying 3DGS to blurred images results in severe degradation, limiting its effectiveness in tasks such as autonomous navigation and manipulation. To address this challenge, we propose MAD-GS, a novel deblurring framework based on 3DGS, specifically designed for robotic vision tasks. MAD-GS effectively mitigates motion and defocus blur while refining imprecise camera poses, enhancing 3D scene reconstruction under real-world uncertainties. Additionally, we introduce a blur segmentation mask to identify and adaptively refine heavily blurred regions, improving visual quality and downstream robotic decision-making. Extensive experiments on synthetic and real-world datasets demonstrate that MAD-GS outperforms existing methods, leading to superior image clarity and fidelity, thereby advancing robust robot perception in dynamic environments.

Abstract:
The efficacy of targeted cancer drug therapy is significantly compromised by imprecise drug delivery mechanisms. Micro/nano robots (MNRs), characterized by their controllable motion, present a promising solution to this challenge. However, the non-Newtonian nature of blood, with its high viscosity and blood cells’ interference, poses substantial limitations on the upstream efficiency of MNRs. This paper innovatively discusses for the first time the effects of blood viscosity and blood cell interference on the motion of MNRs, investigating their upstream motion capabilities in blood through comprehensive theoretical modeling, simulation, and experimental validation. A dynamic model of MNR motion was developed, and the velocity formula for MNRs in non-Newtonian fluid was derived. Experiments were conducted using different magnetic fields in pure water, high-viscosity simulated blood, and diluted blood. Results indicated that under a gradient magnetic field, the upstream velocities of MNRs in pure water, simulated blood, and diluted blood were 45.0, 14.4, and 11.1 mm/s, respectively. Under a rotating magnetic field, the velocities of vortex swarms were 825, 240, and 145 µm/s, respectively. Increased fluid viscosity reduced MNR velocity by 70%, while blood cells caused an additional 10% reduction. This research establishes a theoretical and experimental framework for the upstream motion of MNRs against blood flow, enhancing their potential in targeted drug delivery and broader biomedical applications.

Abstract:
The magnetic dipole model exhibits significant deviations from real-world sensor data due to neglected material nonlinearities and environmental interference. This paper proposed a Physics-Informed Residual Network (PIRNet) that adaptively corrected simulated magnetic field data by integrating dipole theory with deep residual learning. The network took a 5×5 triaxial magnetic matrix as input and employed a dual-branch architecture: a convolutional residual branch extracted local sensor-level distortion features, while a physics-encoding branch models systematic position and orientation-related deviations. A gated fusion mechanism dynamically combined these features, with a divergence-free constraint (∇ B = 0) incorporated as a regularization term. The corrected data was processed through Levenberg-Marquardt (LM) optimization for pose estimation, with subsequent hybrid lookup table compensation combining distance-weighted trilinear interpolation for spatial coordinates and spherical linear interpolation (Slerp) for orientation vectors. Experimental results showed that the positioning error was reduced from 2.1 mm to 1.15 mm, the orientation error was reduced from 3.23◦ to 1.01◦, and the average speed of magnet positioning reached 44.7 ms per frame. This approach provides a high-precision, low-cost sim-to-real transfer solution for magnetic navigation robots.

Abstract:
Ethical dilemmas are a common challenge in everyday driving, requiring human drivers to balance competing priorities such as safety, efficiency, and rule compliance. However, much of the existing research in automated vehicles (AVs) has focused on high-stakes "trolley problems," which involve extreme and rare situations. Such scenarios, though rich in ethical implications, are rarely applicable in real-world AV decision-making. In practice, when AVs confront everyday ethical dilemmas, they often appear to prioritise strict adherence to traffic rules. By contrast, human drivers may bend the rules in context-specific situations, using judgement informed by practical concerns such as safety and efficiency. According to the concept of meaningful human control, AVs should respond to human reasons, including those of drivers, vulnerable road users, and policymakers. This work introduces a novel human reasons-based supervision framework that detects when AV behaviour misaligns with expected human reasons to trigger trajectory reconsideration. The framework integrates with motion planning and control systems to support real-time adaptation, enabling decisions that better reflect safety, efficiency, and regulatory considerations. Simulation results demonstrate that this approach could help AVs respond more effectively to ethical challenges in dynamic driving environments by prompting replanning when the current trajectory fails to align with human reasons. These findings suggest that our approach offers a path toward more adaptable, human-centered decision-making in AVs.

Abstract:
In microsurgery, surgeons frequently encounter challenges due to the need for exceptional precision and dexterity, the lack of depth perception for micro-scale surgical maneuvers, and the inevitable effects of fatigue and hand tremor. In surgical robotics, conventional intraoperative perception systems normally provide real-time image feedback, but depth and volumetric information is typically lacking. To overcome these challenges, we propose a teleoperated robotic system with two arms to provide high-fidelity intraoperative volumetric imaging during micro-scale tissue manipulation. This system incorporates an optical coherence tomography sensor for real-time 3D visualization and a dual-arm teleoperated robot system controlled by haptic input devices for accurate and precise manipulation. We characterize the system’s performance through a precision positioning task and a vessel following task in a retinal model, which shows average positioning errors of approximately 232 μm and 83 μm, respectively. We demonstrate the fully integrated system through the completion of an eggshell membrane peeling task that simulates retinal membrane peeling.

Abstract:
Vectorized maps are indispensable for precise navigation and the safe operation of autonomous vehicles. Traditional methods for constructing these maps fall into two categories: offline techniques, which rely on expensive, labor-intensive LiDAR data collection and manual annotation, and online approaches that use onboard cameras to reduce costs but suffer from limited performance, especially at complex intersections. To bridge this gap, we introduce the Multiple Roadside Camera-based Vectorized Map approach, MRC-VMap, a cost-effective, vision-centric, end-to-end neural network designed to generate high-definition vectorized maps directly at intersections. Leveraging existing roadside surveillance cameras, MRC-VMap directly converts time-aligned, multi-directional images into vectorized map representations. This integrated solution lowers the need for additional intermediate modules-such as separate feature extraction and Bird’s-Eye View (BEV) conversion steps-thus reducing both computational overhead and error propagation. Moreover, the use of multiple camera views enhances mapping completeness, mitigates occlusions, and provides robust performance under practical deployment constraints. Extensive experiments conducted on 4, 000 intersections across 4 major metropolitan areas in China demonstrate that MRC-VMap not only outperforms state-of-the-art online methods but also achieves accuracy comparable to high-cost LiDAR-based approaches, thereby offering a scalable and efficient solution for modern autonomous navigation systems.

Abstract:
Monocular depth estimation is a rudimentary problem for robotic perception systems and downstream applications. However, depth estimation from a single image is an inherently ill-posed problem due to data loss related to projection from 3D to 2D. Recent studies address the discrepancy between camera parameters by using learning-based methods and unifying the camera model to canonical camera space or bipolar representations, thus addressing the problem of training a metric depth model over different datasets with different camera parameters. In addition, the previous study, OrchardDepth, introduced the sparse-dense depth consistency loss function to learn the dense depth distribution through the city autonomous driving scene to improve model performance in the orchard. Instead of enforcing strict consistency between the sparse and dense depth, this work introduced the KL divergence to encourage the network to adapt to the depth distributions of different sensors and penalize deviations from reliable regions while tolerating errors in unreliable areas. Furthermore, we further enhance the depth consistency loss by integrating bins into the supervised discretised depth distribution. This method significantly improves the robustness and performance of our previous method. In addition, it improves the absolute relative error in the orchard dataset by 17.3% and 16.2% in contrast to SILog Loss and OrchardDepth baseline, respectively. Thus enhancing the new training paradigm for depth estimation in the orchard scene.

Abstract:
Recognizing various types of clothing is crucial for robotic clothing manipulation tasks, such as garment organization and robot-assisted dressing. Unlike rigid object recognition, clothing recognition remains a challenging task due to the diverse forms introduced by flexible deformations. However, existing classification models primarily focus on clothing color and texture while overlooking structural features, limiting their ability to distinguish between deformable clothing categories with similar color and texture. Moreover, due to the insufficient representation of structural features, these models heavily rely on manually annotated labels, making it difficult to accurately recognize unseen clothing items with new colors or textures. To address these challenges, we propose a novel topological structure representation and optimization strategy for category-level clothing structural feature learning. Additionally, we design a multi-clothing classification framework based on multiple mask generation to identify clothing regions within a scene. By leveraging our proposed structural feature learning strategy, our framework effectively generalizes to unseen clothing items. Finally, we introduce a fabric-specific grasping position estimation method and develop a corresponding robotic grasping system capable of selecting and grasping specified clothing items based on user instructions. Extensive real-world robotic experiments demonstrate the effectiveness of our system, and comprehensive comparisons with multiple baselines further validate the superiority of our approach.

Abstract:
To enhance robustness against noise resulting from velocity measurement and acceleration estimation in robot online identification and adaptive control, the robot dynamics should be filtered and parameterized to generate a filtered regression matrix regarding identifiable parameters. However, generating a filtered regression matrix is complicated for robots with high degrees of freedom (DoFs). The power balance model (PBM) of robots with spatial notations stands out as an effective option for online applications owing to its simplicity in generating an easily computed and acceleration-free filtered regression vector. This paper proposes a PBM-based recursive composite learning robot control (RCLRC) method to enhance parameter convergence so as to boost tracking control. Based on the PBM, a filtered regressor with a computational complexity of O(n) (instead of O(n2) to O(n4) for its dynamic model-based counterpart) is employed to calculate an excitation matrix, and a generalized regression equation for composite parameter update is normalized to provide more uniform convergence rates across all parameter components. Experiments on a 7-DoF robot manipulator have shown that the proposed PBM-RCLRC outperforms state-of-the-art methods on parameter estimation and tracking control.

Abstract:
Precise needle alignment is essential for percutaneous needle insertion in robotic ultrasound-guided procedures. However, inherent challenges such as speckle noise, needle-like artifacts, and low image resolution complicate robust needle detection, which is essential for alignment in ultrasound images. These issues become particularly problematic when visibility is reduced or lost, diminishing the effectiveness of visual-based needle alignment methods. In this paper, we propose a method to restore effectively when the ultrasound imaging plane and the needle insertion plane are misaligned. Unlike many existing approaches that rely heavily on needle visibility in ultrasound images, our method uses a more robust feature by periodically vibrating the needle using a mechanical system. Specifically, we propose a new vibration-based energy metric that remains effective even when the needle is fully out of plane. Using this metric, we develop an elegant control strategy to reposition the ultrasound probe in response to misalignments between the imaging plane and the needle insertion plane in both translation and rotation. Experiments conducted on ex-vivo porcine tissue samples using a dual-arm robotic ultrasound-guided needle insertion system demonstrate the effectiveness of the proposed approach. The experimental results show the translational error of 0.41±0.27 mm and the rotational error of 0.51±0.19 degrees.

Abstract:
The complex nonlinear dynamics of hydraulic excavators, such as time delays and control coupling, pose significant challenges to achieving high-precision trajectory tracking. Traditional control methods often fall short in such applications due to their inability to effectively handle these nonlinearities, while commonly used learning-based methods require extensive interactions with the environment, leading to inefficiency. To address these issues, we introduce EfficientTrack, a trajectory tracking method that integrates model-based learning to manage nonlinear dynamics and leverages closed-loop dynamics to improve learning efficiency, ultimately minimizing tracking errors. We validate our method through comprehensive experiments both in simulation and on a real-world excavator. Comparative experiments in simulation demonstrate that our method outperforms existing learning-based approaches, achieving the highest tracking precision and smoothness with the fewest interactions. Real-world experiments further show that our method remains effective under load conditions and possesses the ability for continual learning, highlighting its practical applicability. For implementation details and source code, please refer to https://github.com/ZiqingZou/EfficientTrack.

Abstract:
Electromagnetic Navigation Systems enable remote actuation of untethered micro and nanorobots, as well as the precise control of magnetic surgical tools for minimally invasive medical procedures. Accurate modeling of the magnetic fields generated by the electromagnets composing these systems is essential for achieving reliable and precise navigation. Existing modeling approaches either neglect nonlinear effects such as electromagnet saturation or fail to ensure that the field predictions are physically consistent. These limitations can lead to significant prediction errors, particularly in the estimation of field gradients, which directly impacts force calculations. As a result, inaccurate gradient predictions degrade force control performance, limiting the precision of magnetic actuation. In this work, we investigate physics-informed and data-driven modeling techniques to improve the accuracy of magnetic field and gradient predictions. Additionally, we introduce an approach for solving the inverse problem, developing models capable of predicting the required electromagnet currents to generate a desired magnetic field and gradient based on this approach. By incorporating physical constraints into the models, we enhance the predictive accuracy and physical consistency of the field estimates. In the experimental section, we demonstrate the benefits of these methods to enable improved force control in open-loop for untethered robots using a small-scale Electromagnetic Navigation System.

Abstract:
The development of large Vision-Language-Action (VLA) models has enhanced the robot’s ability to manipulate objects in unseen scenarios based on language instructions. While existing VLAs have demonstrated promise in various scenarios, they still struggle with effective multi-modal data feature extraction and lack a closed-loop inference framework. In this paper, we propose an advanced VLA model. Unlike previous works that repurpose VLM for action prediction using simple action quantization, we componentized the VLA architecture with a specialized action module conditioned on the model output and a critic module for inference. We demonstrate the performance improvement of diffusion action transformers in modeling continuous temporal actions, with the critic module applied during inference to form a closed-loop model. Extensive experiments on real robots demonstrate that our model significantly outperforms existing methods, with the ability to handle complex, high-precision tasks and generalize to unseen objects and backgrounds.

Abstract:
Scene graphs have emerged as a powerful tool for robots, providing a structured representation of spatial and semantic relationships for advanced task planning. Despite their potential, conventional 3D indoor scene graphs face critical limitations, particularly under- and over-segmentation of room layers in structurally complex environments. Under-segmentation misclassifies non-traversable areas as part of a room, often in open spaces, while over-segmentation fragments a single room into overlapping segments in complex environments. These issues stem from naive voxel-based map representations that rely solely on geometric proximity, disregarding the structural constraints of traversable spaces and resulting in inconsistent room layers within scene graphs. To the best of our knowledge, this work is the first to tackle segmentation inconsistency as a challenge and address it with Traversability-Aware Consistent Scene Graphs (TACS-Graphs), a novel framework that integrates ground robot traversability with room segmentation. By leveraging traversability as a key factor in defining room boundaries, the proposed method achieves a more semantically meaningful and topologically coherent segmentation, effectively mitigating the inaccuracies of voxel-based scene graph approaches in complex environments. Furthermore, the enhanced segmentation consistency improves loop closure detection efficiency in the proposed Consistent Scene Graph-leveraging Loop Closure Detection (CoSG-LCD) leading to higher pose estimation accuracy. Experimental results confirm that the proposed approach outperforms state-of-the-art methods in terms of scene graph consistency and pose graph optimization performance.

Abstract:
Large-scale, diverse datasets are essential for training robust learning-based robotic manipulation models; however, their acquisition typically requires controlled environments and specialized hardware in research laboratories. This paper presents VRobotix, a virtual reality (VR)-based framework that enables cost-effective and scalable robotic dataset generation through immersive human-in-the-loop control within a physics-accurate robot simulation. By leveraging off-the-shelf VR headsets (e.g., Oculus Quest 3), VRobotix eliminates the need for physical robots while supporting a URDF-compatible, physics-based simulator that accommodates adaptable robotic platforms and egocentric control interfaces, including handheld controllers and body posture tracking. Benefiting from the physics-based simulation, a unique contribution of VRobotix is the replay module, which can regenerate synchronized multi-modal dataset (kinematic states, RGB-D streams) with multiple dataset formats based on the replayable trajectory, supporting various robotic applications. Additionally, an imitation learning module is developed to train control policies using the data collected by VRobotix. Experiments on three initial tasks—pushing, grasping, and stacking—demonstrate a high data collection success rate, averaging 92.0%. Furthermore, policies trained on just 50 trials achieve a 100% task success rate. VRobotix reduces infrastructure costs while generating ROS-compatible datasets, democratizing scalable robotic data acquisition.

Abstract:
Using 3D point clouds in odometry estimation in robotics often requires finding a set of correspondences between points in subsequent scans. While there are established methods for point clouds of sufficient quality, state-of-the-art still struggles when this quality drops. Thus, this paper presents a novel learning-based framework for predicting robust point correspondences between pairs of noisy, sparse and unstructured 3D point clouds from a light-weight, low-power, inexpensive, consumer-grade System-on-Chip (SoC) Frequency Modulated Continuous Wave (FMCW) radar sensor. Our network is based on the transformer architecture which allows leveraging the attention mechanism to discover pairs of points in consecutive scans with the greatest mutual affinity. The proposed network is trained in a self-supervised way using set-based multi-label classification cross-entropy loss, where the ground-truth set of matches is found by solving the Linear Sum Assignment (LSA) optimization problem, which avoids tedious hand annotation of the training data. Additionally, posing the loss calculation as multi-label classification permits supervising on point correspondences directly instead of on odometry error, which is not feasible for sparse and noisy data from the SoC radar we use. We evaluate our method with an open-source state-of-the-art Radar-Inertial Odometry (RIO) framework in real-world Unmanned Aerial Vehicle (UAV) flights and with the widely used public Coloradar dataset. Evaluation shows that the proposed method improves the position estimation accuracy by over 14 % and 19 % on average, respectively. The open source code and datasets can be found here: https://github.com/aau-cns/radar_transformer.

Abstract:
Data scarcity has long been an issue in the robot learning community. Particularly, in safety-critical domains like surgical applications, obtaining high-quality data can be especially difficult. It poses challenges to researchers seeking to exploit recent advancements in reinforcement learning and imitation learning, which have greatly improved generalizability and enabled robots to conduct tasks autonomously. We introduce dARt Vinci, a scalable data collection platform for robot learning in surgical settings. The system uses Augmented Reality (AR) hand tracking and a high-fidelity physics engine to capture subtle maneuvers in primitive surgical tasks: By eliminating the need for a physical robot setup and providing flexibility in terms of time, space, and hardware resources-such as multiview sensors and actuators-specialized simulation is a viable alternative. At the same time, AR allows the robot data collection to be more egocentric, supported by its body tracking and content overlaying capabilities. Our user study confirms the proposed system’s efficiency and usability, where we use widely-used primitive tasks for training teleoperation with da Vinci surgical robots. Data throughput improves across all tasks compared to real robot settings by 41% on average. The total experiment time is reduced by an average of 10%. The temporal demand in the task load survey is improved. These gains are statistically significant. Additionally, the collected data is over 400 times smaller in size, requiring far less storage while achieving double the frequency. The source code for this project can be accessed at https://dartvinci.finite-state.com/.

Abstract:
Crop yield estimation is a relevant problem in agriculture, because an accurate yield estimate can support farmers’ decisions on harvesting or precision intervention. Robots can help to automate this process. To do so, they need to be able to perceive the surrounding environment to identify target objects such as trees and plants. In this paper, we introduce a novel approach to address the problem of hierarchical panoptic segmentation of apple orchards on 3D data from different sensors. Our approach is able to simultaneously provide semantic segmentation, instance segmentation of trunks and fruits, and instance segmentation of trees (a trunk with its fruits). This allows us to identify relevant information such as individual plants, fruits, and trunks, and capture the relationship among them, such as precisely estimate the number of fruits associated to each tree in an orchard. To efficiently evaluate our approach for hierarchical panoptic segmentation, we provide a dataset designed specifically for this task. Our dataset is recorded in Bonn, Germany, in a real apple orchard with a variety of sensors, spanning from a terrestrial laser scanner to a RGB-D camera mounted on different robots platforms. The experiments show that our approach surpasses state-of-the-art approaches in 3D panoptic segmentation in the agricultural domain, while also providing full hierarchical panoptic segmentation. Our dataset is publicly available at https://www.ipb.uni-bonn.de/data/hops/. The open-source implementation of our approach is available at https://github.com/PRBonn/hapt3D.

Abstract:
We present a novel framework demonstrating zero-shot sim-to-real transfer of visual control policies learned in a Neural Radiance Field (NeRF) environment for quadrotors to fly through racing gates. Robust transfer from simulation to real flight poses a major challenge, as standard simulators often lack sufficient visual fidelity. To address this, we construct a photorealistic simulation environment of quadrotor racing tracks, called FalconGym, which provides effectively unlimited synthetic images for training. Within FalconGym, we develop a pipelined approach for crossing gates that combines (i) a Neural Pose Estimator (NPE) coupled with a Kalman filter to reliably infer quadrotor poses from single-frame RGB images and IMU data, and (ii) a self-attention-based multi-modal controller that adaptively integrates visual features and pose estimation. This multi-modal design compensates for perception noise and intermittent gate visibility. We train this controller purely in FalconGym with imitation learning and deploy the resulting policy to real hardware with no additional fine-tuning. Simulation experiments on three distinct tracks (circle, U-turn and figure-8) demonstrate that our controller outperforms a vision-only state-of-the-art baseline in both success rate and gate-crossing accuracy. In 30 live hardware flights spanning three tracks and 120 gates, our controller achieves a 95.8% success rate and an average error of just 10 cm when flying through 38 cm-radius gates.

Abstract:
Safe landing is essential in robotics applications, from industrial settings to space exploration. As artificial intelligence advances, we have developed PEACE (Prompt Engineering Automation for CLIPSeg Enhancement), a system that automatically generates and refines prompts for identifying landing zones in changing environments. Traditional approaches using fixed prompts for open-vocabulary models struggle with environmental changes and can lead to dangerous outcomes when conditions are not represented in the predefined prompts. PEACE addresses this limitation by dynamically adapting to shifting data distributions. Our key innovation is the dual segmentation of safe and unsafe landing zones, allowing the system to refine the results by removing unsafe areas from potential landing sites. Using only monocular cameras and image segmentation, PEACE can safely guide descent operations from 100 meters to altitudes as low as 20 meters. The testing shows that PEACE significantly outperforms the standard CLIP and CLIPSeg prompting methods, improving the successful identification of safe landing zones from 57% to 92%. We have also demonstrated enhanced performance when replacing CLIPSeg with FastSAM. The complete source code is available as an open-source software 1.

Abstract:
Accurate spatial and motion understanding is critical for autonomous driving systems. While object-level perception models excel in structured environments, they struggle with open-set categories and often lack precise geometric representation. Occupancy-based, class-agnostic methods offer better scene expressiveness but typically ignore inter-agent interactions and fail to ensure physical consistency in motion predictions, limiting their reliability in complex traffic scenarios. In this paper, we propose LEGO-Motion, a novel class-agnostic motion prediction framework that bridges the gap between instance-level reasoning and occupancy-based modeling. Unlike conventional grid-based methods that treat each cell independently, LEGO-Motion introduces two key components: (1) the Interaction-Augmented Instance Encoder (IaIE), which models interactions among dynamic agents via cross-attention, and (2) the Instance-Enhanced BEV Encoder (IeBE), which improves motion consistency across instances through multi-stage feature fusion. These components enable our model to learn semantically coherent and physically plausible motion fields. Extensive experiments on the nuScenes dataset show that LEGO-Motion achieves a around 6% improvement in motion prediction accuracy over the previous state-of-the-art, while maintaining real-time inference at 21ms. Moreover, our method demonstrates strong generalization on a proprietary FMCW LiDAR benchmark. These results validate LEGO-Motion's effectiveness in capturing both global scene structure and fine-grained motion dynamics, making it a promising foundation for next-generation perception systems.

Abstract:
Automatic robotic facial expression generation is crucial for human–robot interaction (HRI), as handcrafted methods based on fixed joint configurations often yield rigid and unnatural behaviors. Although recent automated techniques reduce the need for manual tuning, they tend to fall short by not adequately bridging the gap between human preferences and model predictions—resulting in a deficiency of nuanced and realistic expressions due to limited degrees of freedom and insufficient perceptual integration. In this work, we propose a novel learning-to-rank framework that leverages human feedback to address this discrepancy and enhanced the expressiveness of robotic faces. Specifically, we conduct pairwise comparison annotations to collect human preference data and develop the Human Affective Pairwise Impressions (HAPI) model, a Siamese RankNet-based approach that refines expression evaluation. Results obtained via Bayesian Optimization and online expression survey on a 35-DOF android platform demonstrate that our approach produces significantly more realistic and socially resonant expressions of Anger, Happiness, and Surprise than those generated by baseline and expert-designed methods. This confirms that our framework effectively bridges the gap between human preferences and model predictions while robustly aligning robotic expression generation with human affective responses.

Affiliations: The Shenzhen Future Network of Intelligence Institute (FNii-Shenzhen), School of Science and Engineering (SSE), and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, Chinese University of Hong Kong (Shenzhen), Guangdong, China; The State Key Laboratory of Photonics and Communications, School of Electronics, Peking University, Beijing, China; SSE, FNii-Shenzhen, and Guangdong Provincial Key Laboratory of Future Networks of Intelligence, Chinese University of Hong Kong (Shenzhen), Guangdong, China

Abstract:
We propose a continuous gathering scheme based on the swept volume to address the challenges involved in planning a tethered robot duo to efficiently collect marine debris. Specifically, we model the tethered robot duo by constructing a double-layer U-shape, and then apply an object-aware optimization approach that leverages the swept volume signed distance field (SVSDF) to guide trajectory optimization, promoting complete object collection while maintaining a continuous and collision-free gathering motion. Existing algorithms either fail to fully address key challenges, such as assuming an unrealistically infinite tether length or incurring high computational costs. In contrast, our proposed method, by adopting the double-layer U-shape technique, effectively manages tether length constraints and preserves the tether shape, ensuring feasible collection. By utilizing the SVSDF technique to guide the trajectory optimization process, we maximize the swept coverage of objects while minimizing that of obstacles. This enables complete object coverage, avoids collisions, and prevents the tether from becoming trapped by obstacles during the collection process. Moreover, we propose a set of metrics for this gathering planning problem and validate the generated trajectories in simulation, using a collision-free multi-UAV information-gathering approach to efficiently estimate the target area. Simulations demonstrate that our proposed method achieves superior, resolution-independent gathering performance compared to existing algorithms.

Abstract:
Robot learning has produced remarkably effective "black-box" controllers for complex tasks such as dynamic locomotion on humanoids. Yet ensuring dynamic safety, i.e., constraint satisfaction, remains challenging for such policies. Reinforcement learning (RL) embeds constraints heuristically through reward engineering, and adding or modifying constraints requires retraining. Model-based approaches, like control barrier functions (CBFs), enable runtime constraint specification with formal guarantees but require accurate dynamics models. This paper presents SHIELD, a layered safety framework that bridges this gap by: (1) training a generative, stochastic dynamics residual model using real-world data from hardware rollouts of the nominal controller, capturing system behavior and uncertainties; and (2) adding a safety layer on top of the nominal (learned locomotion) controller that leverages this model via a stochastic discrete-time CBF formulation enforcing safety constraints in probability. The result is a minimally-invasive safety layer that can be added to the existing autonomy stack to give probabilistic guarantees of safety that balance risk and performance. In hardware experiments on an Unitree G1 humanoid, SHIELD enables safe navigation (obstacle avoidance) through varied indoor and outdoor environments using a nominal (unknown) RL controller and onboard perception.

Abstract:
Recent advancements in generative models have revolutionized video synthesis and editing. However, the scarcity of diverse, high-quality datasets continues to hinder video-conditioned robotic learning, limiting cross-platform generalization. In this work, we address the challenge of swapping a robotic arm in one video with another— a key step for cross-embodiment learning. Unlike previous methods that depend on paired video demonstrations in the same environmental settings, our proposed framework, RoboSwap, operates on unpaired data from diverse environments, alleviating the data collection needs. RoboSwap introduces a novel video editing pipeline integrating both GANs and diffusion models, combining their isolated advantages. Specifically, we segment robotic arms from their backgrounds and train an unpaired GAN model to translate one robotic arm to another. The translated arm is blended with the original video background and refined with a diffusion model to enhance coherence, motion realism and object interaction. The GAN and diffusion stages are trained independently. Our experiments demonstrate that RoboSwap outperforms state-of-the-art video and image editing models on three benchmarks in terms of both structural coherence and motion consistency, thereby offering a robust solution for generating reliable, cross-embodiment data in robotic learning.

Abstract:
Precise alignment of the micropipette tip is crucial for robotic cell micromanipulation, enabling delicate procedures such as cell transfer, rotation, and immobilization. However, due to the limited field of view and depth perception under high-magnification microscopy, it poses significant challenges in accurately identifying misalignment and effectively controlling micropipette motion for adjustment. This paper comprehensively analyzes and addresses the misalignment problem, particularly that caused by the improper inclination angle of the micropipette holder. A vision-guided robotic control strategy for automatic micropipette alignment is integrated into a 5-degree-of-freedom (5-DOF) micromanipulator, enabling autonomous detection, adjustment, and positioning of the micropipette. The proposed method ensures precise trajectory tracking and compensates for geometric uncertainties introduced by fabrication or installation errors. Experimental validation demonstrates that the proposed system achieved a mean absolute error below 3 µm for positioning the tip of the micropipette at the focal plane during the procedure of adjustment. Meanwhile, the robotic method required significantly less time to stably rotate the micropipette compared to manual operation for cell manipulation. The vertical alignment error was less than 3 µm along a 250 µm micropipette tip segment. These results confirm that the proposed approach significantly enhances speed, accuracy, and repeatability in micropipette-based micro-manipulation, providing a robust solution for high-throughput biological experiments and clinical applications.

Abstract:
Robust bipedal locomotion in exoskeletons requires the ability to dynamically react to changes in the environment in real time. This paper introduces the hybrid data-driven predictive control (HDDPC) framework, an extension of the data-enabled predictive control, that addresses these challenges by simultaneously planning foot contact schedules and continuous domain trajectories. The proposed framework utilizes a Hankel matrix-based representation to model system dynamics, incorporating step-to-step (S2S) transitions to enhance adaptability in dynamic environments. By integrating contact scheduling with trajectory planning, the framework offers an efficient, unified solution for locomotion motion synthesis that enables robust and reactive walking through online replanning. We validate the approach on the Atalante exoskeleton, demonstrating improved robustness and adaptability.

Abstract:
For several tasks, ranging from manipulation to inspection, it is beneficial for robots to localize a target object in their surroundings. In this paper, we propose an approach that utilizes coarse point clouds obtained from miniaturized VL53L5CX Time-of-Flight (ToF) sensors (tiny LiDARs) to localize a target object in the robot’s workspace. We first conduct an experimental campaign to calibrate the dependency of sensor readings on relative range and orientation to targets. We then propose a probabilistic sensor model, which we validate in an object pose estimation task using a Particle Filter (PF). The results show that the proposed sensor model improves the performance of the localization of the target object with respect to two baselines: one that assumes measurements are free from uncertainty and one in which the confidence is provided by the sensor datasheet.

Abstract:
This paper addresses the scarcity of affordable, fully-actuated five-fingered hands for dexterous teleoperation, which is crucial for collecting large-scale real-robot data within the "Learning from Demonstrations" paradigm. We introduce the prototype version of the RAPID Hand, the first low-cost, 20-degree-of-actuation (DoA) dexterous hand that integrates a novel anthropomorphic actuation and transmission scheme with an optimized motor layout and structural design to enhance dexterity. Specifically, the RAPID Hand features a universal phalangeal transmission scheme for the non-thumb fingers and an omnidirectional thumb actuation mechanism. Prioritizing affordability, the hand employs 3D-printed parts combined with custom gears for easier replacement and repair. We assess the RAPID Hand’s performance through quantitative metrics and qualitative testing in a dexterous teleoperation system, which is evaluated on three challenging tasks: multi-finger retrieval, ladle handling, and human-like piano playing. The results indicate that the RAPID Hand’s fully actuated 20-DoF design holds significant promise for dexterous teleoperation.

Abstract:
Place recognition is a crucial component that enables autonomous vehicles to obtain localization results in GPS-denied environments. In recent years, multimodal place recognition methods have gained increasing attention. They overcome the weaknesses of unimodal sensor systems by leveraging complementary information from different modalities. Most existing methods explore cross-modality correlations through feature-level or descriptor-level fusion. Conversely, the recently proposed 3D Gaussian Splatting provides a new perspective on multimodal spatio-temporal fusion by harmonizing temporally continuous multimodal data into an explicit scene representation. In this paper, we propose a 3D Gaussian Splatting-based multimodal place recognition network dubbed GSPR. It explicitly combines multi-view RGB images and LiDAR point clouds into a spatio-temporally unified scene representation with the proposed Multimodal Gaussian Splatting. A network composed of 3D graph convolution and transformer is designed to extract global descriptors from the Gaussian scenes for place recognition. Extensive evaluations on three datasets demonstrate that our method can effectively leverage complementary strengths of both multi-view cameras and LiDAR, achieving SOTA place recognition performance while maintaining solid generalization ability. Our open-source code will be released at https://github.com/QiZS-BIT/GSPR.

Abstract:
Human-robot collaborative manipulation with mobile, multiple manipulators is crucial for expanding robotic applications, requiring precise handling of coupled force-position constraints between partners. Current systems, however, exhibit end-effector oscillations and instability during dynamic interactions. To overcome these limitations, this work develops a collaborative framework integrating a collaborative controller and a whole-body controller. The collaborative controller employs the object’s center-of-mass dynamics model with real-time contact forces and motion states to predict trajectories while coordinating with an attitude stabilization controller to adjust the desired end-effector poses. The whole-body controller utilizes model predictive control to generate coordinated motions that strictly follow pose commands from the collaborative controller, ensuring stable transportation. Simulation and physical experiments validate the proposed framework’s effectiveness in real-world scenarios.

Abstract:
This paper presents a direct, targetless, and automatic LiDAR-Camera joint calibration method that effectively overcomes the intrinsic precision limitations. We propose an iterative two-stage optimization methodology that leverages 3D LiDAR measurements to simultaneously refine both intrinsic and extrinsic. In the first stage, the intrinsic is optimized using a normalized information distance (NID) metric, an information-theoretic measure that quantifies the statistical alignment between LiDAR and image intensities, while initial extrinsic parameters derived from CAD specifications facilitate the projection of LiDAR point clouds onto the camera image plane. In the second stage, the refined intrinsic guides further optimization of extrinsic using the same NID-based evaluation metrics. This alternating process iteratively enhances both intrinsic and extrinsic through their mutual interdependence. Experiments across multiple datasets demonstrate that our method achieves sub-pixel intrinsic accuracy and extrinsic parameters that closely align with CAD specifications, validating the superior performance of our methodology for sensor fusion applications.

Abstract:
Collision-free flight in cluttered environments is a critical capability for autonomous quadrotors. Traditional methods often rely on detailed 3D map construction, trajectory generation, and tracking. However, this cascade pipeline can introduce accumulated errors and computational delays, limiting flight agility and safety. In this paper, we propose a novel method for enabling collision-free flight in cluttered environments without explicitly constructing 3D maps or generating and tracking collision-free trajectories. Instead, we leverage Model Predictive Control (MPC) to directly produce safe actions from sparse waypoints and point clouds from a depth camera. These sparse waypoints are dynamically adjusted online based on nearby obstacles detected from point clouds. To achieve this, we introduce a dual KD-Tree mechanism: the Obstacle KD-Tree quickly identifies the nearest obstacle for avoidance, while the Edge KD-Tree provides a robust initial guess for the MPC solver, preventing it from getting stuck in local minima during obstacle avoidance. We validate our approach through extensive simulations and real-world experiments. The results show that our approach significantly outperforms the mapping-based methods and is also superior to imitation learning-based methods, demonstrating reliable obstacle avoidance at up to 12 m/s in simulations and 6 m/s in real-world tests. Our method provides a simple and robust alternative to existing methods. The code is publicly available at https://github.com/SJTU-ViSYS-team/avoid-mpc.

Abstract:
This paper presents a novel continuous-time gradient-proportional-integral flow (GPIF) for motion planning with obstacle avoidance. We first frame the motion planning task as a constrained optimization problem, which is relaxed to be an unconstrained optimization problem that can be locally solved via a gradient flow approach using functional analysis. To enforce constraints, the proposed GPIF augments the gradient flow dynamics with proportional and integral feedback terms. Under reasonable assumptions formulated as linear matrix inequalities, we prove that the GPIF can generate optimal control trajectories with guaranteed exponential convergence. Numerical simulations validate the algorithm's efficacy, focusing on simple car navigation in cluttered environments. Simulations show that even after discretization for practical implementation, the GPIF method retains computational efficiency, enabling both offline planning and real-time online execution.

Abstract:
Accurate spatial-temporal parameters of LiDAR, IMU and camera, including extrinsic parameter and time offset, is the key to ensure multi-sensor fusion performance for ground vehicles. Compared with target-based calibration method, targetless method does not need artificial targets which is more convenient and flexible. Most existing targetless calibration methods cannot apply to ground vehicles with the LiDAR-IMU-camera combination system. To address this issue, in this paper we propose GLIC-Calib: a carefully designed targetless and one-shot spatial-temporal calibration method of LiDAR-IMU-camera for ground vehicles. First, we recover the real-time camera scale by constant camera height and visual ground points. Then we initialize 6-DoF extrinsic of LiDAR-IMU and camera-IMU as well as their time offsets by raw motion measurements and ground constraints. Finally we refine spatial-temporal parameters of LiDAR-IMU and camera-IMU by high accurate LiDAR-camera extrinsic parameter obtained from environmental association. Experiments conducted on the self-collected datasets with different sensor configurations show the effectiveness, efficiency and robustness of the proposed methods compared with others. We open-sourced our methods on GitHub for the contribution to the community.

Abstract:
This paper presents VisLanding, a monocular 3D perception-based framework for safe UAV (Unmanned Aerial Vehicle) landing. Addressing the core challenge of autonomous UAV landing in complex and unknown environments, this study innovatively leverages the depth-normal synergy prediction capabilities of the Metric3D V2 model to construct an end-to-end safe landing zones (SLZ) estimation framework. By introducing a safe zone segmentation branch, we transform the landing zone estimation task into a binary semantic segmentation problem. The model is fine-tuned and annotated using the WildUAV dataset from a UAV perspective, while a cross-domain evaluation dataset is constructed to validate the model’s robustness. Experimental results demonstrate that VisLanding significantly enhances the accuracy of safe zone identification through a depth-normal joint optimization mechanism, while retaining the zero-shot generalization advantages of Metric3D V2. The proposed method exhibits superior generalization and robustness in cross-domain testing compared to other approaches. Furthermore, it enables the estimation of landing zone area by integrating predicted depth and normal information, providing critical decision-making support for practical applications. The code and datasets are available at https://github.com/Victoire7/Vislanding.git.

Abstract:
Place recognition is an important component for autonomous robot navigation. Many existing LiDAR-based place recognition methods encode the structural information of 3D LiDAR data into 2D image representations. However, most of these intermediates only exploit the projection in a single view, ignoring a great amount of useful information. In this paper, a compact fusion-view image representation of LiDAR point cloud is proposed to extract important structural information from different views. Our proposed method generates such fusion-view images using the corresponding geometric information among points highlighting the edges of objects. It then extracts texture features encoding the shapes and layouts of scene elements into global descriptors, where regional features are designed to adapt local discrepancies caused by seasonal changes. Extensive experiments on the Oxford RobotCar, NCLT, UTBM datasets and our cross-season dataset validate the proposed method and demonstrate its superior generalization performance under different LiDAR sensors and season shifts. Moreover, our proposed method can operate online with a single CPU, making it suitable for resource-limited real robot platforms.

Abstract:
Recent advances in 3D scene reconstruction, such as Neural Radiance Fields (NeRF) and 3D Gaussian Splatting, have demonstrated remarkable results in novel view synthesis and dynamic scene representation. Despite these successes, existing approaches rely on time-synchronized multi-view imagery captured using specialized camera rigs in controlled environments. This reliance limits their applicability in uncontrolled, unbounded dynamic scenes. In this work, we propose a novel Unmanned Aerial Vehicle (UAV) based multi-view capture system that leverages GNSS Pulse Per Second (PPS) signals for precise frame synchronization across multiple cameras. Our system eliminates the need for fixed infrastructure, enabling flexible and scalable data collection for dynamic scene reconstruction in diverse environments. In addition to the system architecture, we also introduce a dataset of synchronized multi-view images captured in unbounded outdoor scenes from four synchronized UAVs, each carrying a stereo camera rig. We benchmark several 3D and 4D representation methods on our dataset and highlight the challenges associated with data collection in unstructured outdoor settings such as sparse views, varied lighting conditions, visual degradation etc. Our hardware configuration details, software details and dataset is available at https://github.com/neufieldrobotics/Dynamic_Mapping.

Abstract:
This paper proposes a quality-driven adaptive control framework for robotic vascular anatomies scanning to facilitate the acquisition of high-quality ultrasound (US) images. Specifically, a novel probability-based US image quality evaluation metric for vascular anatomies is introduced, leveraging an image segmentation network to establish a mapping between the controlled variables of the robot (e.g., pose and force) and US image quality. Furthermore, an adaptive US probe control strategy driven by US image quality is developed to optimize real-time image acquisition, with its stability rigorously proven. To assess the effectiveness of the proposed framework, two experiments were conducted on a human tissue-mimicking phantom, encompassing both static and dynamic scenarios. The experimental results demonstrate that the proposed framework ensures stable contact force and significantly enhances US image quality for robot-assisted vascular anatomy imaging, even in the presence of external disturbances.

Abstract:
Monocular 3D object tracking methods are widely employed in robotic applications, however, they often struggle with low-contrast image sequences. In this paper, we introduce a novel approach to filtering redundant edges in images by leveraging sparse interior correspondences. Our method features a sparse-flow-based probability segmentation model that comprises both coarse and fine components. The coarse model evaluates the ratio of interior correspondences within a circular region centered on each pixel, while the fine model employs a binary Gaussian kernel based on the nearest interior correspondences. This probability framework facilitates the identification of control points for object edges. Additionally, we implement a robust gradient consistency-based edge connection algorithm to generate refined object edges. Utilizing these filtered edges, we formulate an edge-based energy function that accounts for object contour shape and noise uncertainty, seamlessly integrating into a multi-feature pose optimization framework. Our multi-feature fusion strategy achieves state-of-the-art performance in both public datasets and real-world applications, operating at 60 Hz using only CPU.

Abstract:
Unmanned ground vehicles operating in complex environments must adaptively adjust to modeling uncertainties and external disturbances to perform tasks such as wall following and obstacle avoidance. This paper introduces an adaptive control approach based on spiking neural networks for wall fitting and tracking, which learns and adapts to unforeseen disturbances. We propose real-time wall-fitting algorithms to model unknown wall shapes and generate corresponding trajectories for the vehicle to follow. A discretized linear quadratic regulator is developed to provide a baseline control signal based on an ideal vehicle model. Point matching algorithms then identify the nearest matching point on the trajectory to generate feedforward control inputs. Finally, an adaptive spiking neural network controller, which adjusts its connection weights online based on error signals, is integrated with the aforementioned control algorithms. Numerical simulations demonstrate that this adaptive control framework outperforms the traditional linear quadratic regulator in tracking complex trajectories and following irregular walls, even in the presence of partial actuator failures and state estimation errors.

Abstract:
We present a novel calibration method between single-point LiDAR and camera sensors utilizing an easy-to-build customized calibration board satisfying the Manhattan world (MW). Previous methods for LiDAR-camera (LC) calibration focus on line and plane correspondences. However, they require dense 3D point clouds from heavy and expensive LiDAR to simplify alignments; otherwise, these approaches fail for extremely sparse LiDAR. Compact, lightweight, and sparse LiDAR and camera sensors are inevitable for micro drones like Crazyflie with a maximum payload of 15 g, but there are no explicit calibration methods for them. To address these issues, we propose a new extrinsic calibration method with a new calibration board, which rotates like a door to capture geometric features and align them with images. Once we find an initial estimate, we refine the relative rotation by minimizing the angle difference between the grid orientation of the checkerboard and the MW axes. We demonstrate the effectiveness of the proposed method through various LC configurations, achieving its capability and high accuracy compared to other state-of-the-art approaches. We release our calibration toolkit, source codes, and how to make the calibration boards for the robotics community: https://SPLiCE-Calib.github.io/.

Abstract:
The heterogeneous cluster system holds significant application potential in scenarios such as collaborative logistics, disaster response operations, and precision agriculture, but achieving effective task planning for its subsystems remains a challenging issue due to specialized robotic hardware and distinct action spaces. To this end, an innovative framework called LLM-driven Closed-Loop Behavior Tree (LLM-CBT) is proposed. LLMs and behavior trees (BTs) are integrated for task planning in heterogeneous unmanned clusters, including Unmanned Aerial Vehicles (UAVs) and Unmanned Ground Vehicles (UGVs). Particularly, a novel mechanism, Generation-Refinement-Execution-Feedback (GREF), is introduced, in which an initial behavior tree is generated by LLM and iteratively refined. The refined behavior tree is then executed, and adjustments are made based on the execution results, forming a closed-loop process that ultimately achieves the task objectives. In this way, the executability of BTs is improved, and the robustness of task execution in dynamic environments is enhanced. Experiments were conducted across three scenarios with varying task complexity. The results show that the GREF closed-loop mechanism is essential for the effective operation of heterogeneous unmanned clusters.

Abstract:
The Entrance Dependent Vehicle Routing Problem (EDVRP) is a variant of the Vehicle Routing Problem (VRP) where the scale of cities influences routing outcomes, necessitating consideration of their entrances. This paper addresses EDVRP in agriculture, focusing on multi-parameter vehicle planning for irregularly shaped fields. To address the limitations of traditional methods, such as heuristic approaches, which often overlook field geometry and entrance constraints, we propose a Joint Probability Distribution Sampling Neural Network (JPDS-NN) to effectively solve the EDVRP. The network uses an encoder-decoder architecture with graph transformers and attention mechanisms to model routing as a Markov Decision Process, and is trained via reinforcement learning for efficient and rapid end-to-end planning. Experimental results indicate that JPDS-NN reduces travel distances by 48.4–65.4%, lowers fuel consumption by 14.0–17.6%, and computes two orders of magnitude faster than baseline methods, while demonstrating 15–25% superior performance in dynamic arrangement scenarios. Ablation studies validate the necessity of cross-attention and pre-training. The framework enables scalable, intelligent routing for large-scale farming under dynamic constraints.

Abstract:
This paper introduces Quaternion Approximate Networks (QUAN), a novel deep learning framework that leverages quaternion algebra for rotation equivariant image classification and object detection. Unlike conventional quaternion neural networks attempting to operate entirely in the quaternion domain, QUAN approximates quaternion convolution through Hamilton product decomposition using real-valued operations. This approach preserves geometric properties while enabling efficient implementation with custom CUDA kernels. We introduce Independent Quaternion Batch Normalization (IQBN) for training stability and extend quaternion operations to spatial attention mechanisms. QUAN is evaluated on image classification (CIFAR-10/100, ImageNet), object detection (COCO, DOTA), and robotic perception tasks. In classification tasks, QUAN achieves higher accuracy with fewer parameters and faster convergence compared to existing convolution and quaternion-based models. For objection detection, QUAN demonstrates improved parameter efficiency and rotation handling over standard Convolutional Neural Networks (CNNs) while establishing the SOTA for quaternion CNNs in this downstream task. These results highlight its potential for deployment in resource-constrained robotic systems requiring rotation-aware perception and application in other domains.

Abstract:
Cooperative perception research is hindered by the limited availability of datasets that capture the complexity of real-world Vehicle-to-Everything (V2X) interactions, particularly under dynamic communication constraints. To address this gap, we introduce WHALES (Wireless enHanced Autonomous vehicles with Large number of Engaged agentS), the first large-scale V2X dataset explicitly designed to benchmark communication-aware agent scheduling and scalable cooperative perception. WHALES introduces a new benchmark that enables state-of-the-art (SOTA) research in communication-aware cooperative perception, featuring an average of 8.4 cooperative agents per scene and 2.01 million annotated 3D objects across diverse traffic scenarios. It incorporates detailed communication metadata to emulate real-world communication bottlenecks, enabling rigorous evaluation of scheduling strategies. To further advance the field, we propose the Coverage-Aware Historical Scheduler (CAHS), a novel scheduling baseline that selects agents based on historical viewpoint coverage, improving perception performance over existing SOTA methods. WHALES bridges the gap between simulated and real-world V2X challenges, providing a robust framework for exploring perception-scheduling co-design, cross-data generalization, and scalability limits. The WHALES dataset and code are available at https://github.com/chensiweiTHU/WHALES.

Abstract:
Azimuth thrusters are widely used for controlling marine vehicles, especially, for dynamic positioning and hovering purposes. However, including azimuth thruster makes control allocation a nonlinear non-convex problem which is commonly solved using nonlinear programming methods, simplified by paring azimuth thrusters (e.g., two thrusters will always move at the same angle), or locally linearized using approximation equations such as Taylor series expansions and polynomial functions. In this paper, a new approach is presented to modify the azimuth thruster control allocation problem into a convex quadratic problem with a new force decomposition and linear first-order inequality constraints. As a result, the complexity of the control allocation increases linearly with respect to the number of azimuth thrusters, allowing it to be implemented on the marine vehicles with increased numbers of azimuth thrusters controlled independently and can be solved using constrained quadratic programming solvers. Case studies has been presented to validate the proposed method on simulated Autonomous Underwater Vehicles (AUVs) with two and four azimuth thrusters configured with different azimuth angle limits (±45, ±90, and ±135 degrees). The results shows excellent control performance of the proposed approach in controlling multiple states (surge, pitch, yaw, depth and sway) simultaneously, even when experiencing a cross-track ocean current. Recommendation on hardware implementation is also discussed for real world platform integration.

Abstract:
We present Kalman-Filter Assisted Reinforcement Learner (KARL) for dynamic object tracking and grasping over eye-on-hand (EoH) systems, significantly expanding such systems’ capabilities in challenging, realistic environments. In comparison to the previous state-of-the-art, KARL (1) incorporates a novel six-stage RL curriculum that doubles the system’s motion range, thereby greatly enhancing the system’s grasping performance, (2) integrates a robust Kalman filter layer between the perception and reinforcement learning (RL) control modules, enabling the system to maintain an uncertain but continuous 6D pose estimate even when the target object temporarily exits the camera’s field-of-view or undergoes rapid, unpredictable motion, and (3) introduces mechanisms to allow retries to gracefully recover from unavoidable policy execution failures. Extensive evaluations conducted in both simulation and real-world experiments qualitatively and quantitatively corroborate KARL’s advantage over earlier systems, achieving higher grasp success rates and faster robot execution speed. Source code and supplementary materials for KARL will be made available at: https://github.com/arc-l/karl.

Abstract:
Magnetic helical micro-robots (microbots) have attracted strong interest due to their unique propulsion mechanisms and potential applications in biomedical fields, particularly in minimally-invasive surgical procedures. Earlier research primarily focused on studying helical microbots in viscous liquids, while their dynamic behavior in viscoelastic solids remains largely unexplored. Here, we present an experimental study of a helical microbot operating in a viscoelastic gelatin hydrogel. The robot is fabricated by two-photon polymerization and actuated by an external rotating magnetic field. We observe that in viscoelastic solids, the robot ruptures the gel and creates a three-dimensional (3D) helical trajectory, despite the rotational axis of the driving magnetic field being fixed. Largely distinct from the propulsion behavior in a Newtonian fluid, the precession angle of the helix is significantly enhanced in the viscoelastic gel and increases with a rising rotational frequency. A dynamic model is developed using the multipole expansion method, incorporating the gel’s complex viscosity and shear-thinning properties to capture the key characteristics of this dynamic response. These findings offer new insights into the behavior of helical microbots in viscoelastic media, expanding possible application scenarios of microbots in biomedicine.

Abstract:
Tactile sensing is essential for dexterous manipulation, yet large-scale human demonstration datasets lack tactile feedback, limiting their effectiveness in skill transfer to robots. To address this, we introduce TacCap, a wearable Fiber Bragg Grating (FBG)-based tactile sensor designed for seamless human-to-robot transfer. TacCap is lightweight, durable, and immune to electromagnetic interference, making it ideal for real-world data collection. We detail its design and fabrication, evaluate its sensitivity, repeatability, and cross-sensor consistency, and assess its effectiveness through grasp stability prediction and ablation studies. Our results demonstrate that TacCap enables transferable tactile data collection, bridging the gap between human demonstrations and robotic execution, with broad implications for fine-motor disciplines such as surgical training and musical performance. To support further research and development, we open-source our hardware design and software.

Abstract:
This paper introduces Multi-PrefDrive, a framework that significantly enhances LLM-based autonomous driving through multidimensional preference tuning. Aligning LLMs with human driving preferences is crucial yet challenging, as driving scenarios involve complex decisions where multiple incorrect actions can correspond to a single correct choice. Traditional binary preference tuning fails to capture this complexity. Our approach pairs each chosen action with multiple rejected alternatives, better reflecting real-world driving decisions. By implementing the Plackett-Luce preference model, we enable nuanced ranking of actions across the spectrum of possible errors. Experiments in the CARLA simulator demonstrate that our algorithm achieves an 11.0% improvement in overall score and an 83.6% reduction in infrastructure collisions, while showing perfect compliance with traffic signals in certain environments. Comparative analysis against DPO and its variants reveals that Multi-PrefDrive’s superior discrimination between chosen and rejected actions, which achieving a margin value of 25, and such ability has been directly translates to enhanced driving performance. We implement memory-efficient techniques including LoRA and 4-bit quantization to enable deployment on consumer-grade hardware and will open-source our training code and multi-rejected dataset to advance research in LLM-based autonomous driving systems. Project Page (https://liyun0607.github.io/).

Abstract:
Semantic Scene Completion (SSC) aims to reconstruct the entire 3D scene in terms of both occupancy and semantics, serving as a fundamental task for autonomous driving and robotic systems. Camera-based methods have seen significant advancements due to their low cost and rich visual cues. However, previous approaches have predominantly focused on semantic recovery. This can lead to inaccurate occupancy predictions and, consequently, the failure of downstream tasks such as trajectory planning. To address this limitation, we propose a novel multi-frame matching framework, GeoScene, which reconstructs spatial structures through inter-frame geometric correlations of temporal images and subsequently infers scene semantic information. Specifically, we extract features from distinct frames in the depth dimension and derive depth features by constructing a cost volume. Following this, dot product and voxelization operations are applied between the extracted features and depth features to correct assignment errors. Furthermore, we introduce a surface normal-based regression loss to preserve fine-grained surface structures. Extensive experiments on the SemanticKITTI dataset demonstrate that GeoScene outperforms existing state-of-the-art methods.

Abstract:
Depth estimation is a cornerstone computer vision application that is critical for scene understanding and autonomous driving. In real-world scenarios, achieving reliable depth perception under adverse weather—e.g. in fog and rain—is crucial to ensure safety and system robustness. However, quantitatively evaluating the performances of depth estimation methods in these scenarios is challenging due to the difficulty of obtaining ground truth data. A promising approach is using weather chambers to simulate diverse weather conditions in a controlled environment. However, current datasets are limited in distance and lack a dense ground truth. To address this gap, we introduce a novel evaluation benchmark that extends depth evaluation up to 200 meters under clear, foggy, and rainy conditions. To this end, we employ a multimodal sensor setup, including state-of-the-art stereo RGB, RCCB, Gated camera systems, and a long-range LiDAR sensor. Moreover, we record a digital twin of the test facility sampled at a millimeter scale using a high-end geodesic laser scanner. This comprehensive benchmark allows for the evaluation of different models and multiple sensing modalities in a more precise and accurate manner, as well as at far distances. Data and code will be released upon publication.

Abstract:
Recent advancements in high-definition (HD) map construction have demonstrated the effectiveness of dense representations, which heavily rely on computationally intensive bird’s-eye view (BEV) features. While sparse representations offer a more efficient alternative by avoiding dense BEV processing, existing methods often lag behind due to the lack of tailored designs. These limitations have hindered the competitiveness of sparse representations in online HD map construction. In this work, we systematically revisit and enhance sparse representation techniques, identifying key architectural and algorithmic improvements that bridge the gap with—and ultimately surpass—dense approaches. We introduce a dedicated network architecture optimized for sparse map feature extraction, a sparse-dense segmentation auxiliary task to better leverage geometric and semantic cues, and a denoising module guided by physical priors to refine predictions. Through these enhancements, our method achieves state-of-the-art performance on the nuScenes dataset, significantly advancing HD map construction and centerline detection. Specifically, SparseMeXt-Tiny reaches a mean average precision (mAP) of 55.5% at 32 frames per second (fps), while SparseMeXt-Base attains 65.2% mAP. Scaling the backbone and decoder further, SparseMeXt-Large achieves an mAP of 68.9% at over 20 fps, establishing a new benchmark for sparse representations in HD map construction. These results underscore the untapped potential of sparse methods, challenging the conventional reliance on dense representations and redefining efficiency-performance trade-offs in the field.

Abstract:
The vehicle routing problem with drones (VRPD) involves determining the optimal routes for trucks and drones to collaboratively deliver parcels to customers, aiming to minimize total operational costs. While various heuristic algorithms have been developed to address the problem, existing solutions are built based on simplistic cost models, overlooking the temporal dynamics of the costs, which fluctuate depending on the dynamically changing traffic conditions. In this paper, we present a novel problem called the vehicle routing problem with drones under dynamically changing traffic conditions (Adapt-VRPD) to address the limitation of existing VRPD solutions. We design a novel cost model that factors in the actual travel distance and projected travel time, computed using a machine learning-driven travel time prediction algorithm. A variable neighborhood descent (VND) algorithm is developed to find the optimal truck-drone routes under the dynamics of traffic conditions through incorporation of the travel time prediction model. A simulation study was performed to compare our algorithm with a state-of-the-art VRPD heuristic. Our algorithm outperformed the benchmark, reducing the average and maximum discrepancies from the actual cost by 37.6% and 27.6%, respectively, across various delivery scenarios.

Abstract:
This paper addresses the multi-robot pursuit problem for an unknown target, encompassing both target state estimation and pursuit control. First, in state estimation, we focus on using only bearing information, as it is readily available from vision sensors and effective for small, distant targets. Challenges such as instability due to the nonlinearity of bearing measurements and singularities in the two-angle representation are addressed through a proposed uniform bearing-only information filter. This filter integrates multiple 3D bearing measurements, provides a concise formulation, and enhances stability and resilience to target loss caused by limited field of view (FoV). Second, in target pursuit control within complex environments, where challenges such as heterogeneity and limited FoV arise, conventional methods like differential games or Voronoi partitioning often prove inadequate. To address these limitations, we propose a novel multiagent reinforcement learning (MARL) framework, enabling multiple heterogeneous vehicles to search, localize, and follow a target while effectively handling those challenges. Third, to bridge the sim-to-real gap, we propose two key techniques: incorporating adjustable low-level control gains in training to replicate the dynamics of real-world autonomous ground vehicles (AGVs), and proposing spectral-normalized RL algorithms to enhance policy smoothness and robustness. Finally, we demonstrate the successful zero-shot transfer of the MARL controllers to AGVs, validating the effectiveness and practical feasibility of our approach. The accompanying video is available at https://youtu.be/HO7FJyZiJ3E.

Abstract:
Depth estimation is a cornerstone of 3D reconstruction and plays a vital role in minimally invasive endoscopic surgeries. However, most current depth estimation networks rely on traditional convolutional neural networks, which are limited in their ability to capture global information. Foundation models offer a promising approach to enhance depth estimation, but those models currently available are primarily trained on natural images, leading to suboptimal performance when applied to endoscopic images. In this work, we introduce a novel fine-tuning strategy for the Depth Anything Model and integrate it with an intrinsic-based unsupervised monocular depth estimation framework. Our approach includes a low-rank adaptation technique based on random vectors, which improves the model’s adaptability to different scales. Additionally, we propose a residual block built on depthwise separable convolution to compensate for the transformer’s limited ability to capture local features. Our experimental results on the SCARED dataset and Hamlyn dataset show that our method achieves state-of-the-art performance while minimizing the number of trainable parameters. Applying this method in minimally invasive endoscopic surgery can enhance surgeons’ spatial awareness, thereby improving the precision and safety of the procedures.

Abstract:
Catheters and guidewires are increasingly used to navigate tortuous paths offering minimal invasive access to deeply seated locations in the body. Steering these instruments is highly challenging among others due to poor awareness of the configuration such instrument takes on in the body. To address this difficulty research in physical intelligence has been conducted. The aim is to delegate part of the control problem locally and have the instrument determine itself how to effectively interact with its physical environment. To enable such smart behaviour this paper presents a compact FBG (fiber Bragg grating) based drive system for controlling the bending of the distal tip of a steerable catheter. The design process establishes key constraints for selecting an appropriate FBG fiber based on the selected backbone’s characteristics. Force estimation is done using the strain measured via the fiber with a root mean square error (RMSE) of 0.05 N which is then used to train a Long Short-Term Memory (LSTM) network to detect possible contact with the surroundings using the force prediction of the trained model. The trained model was able to predict the force with an RMSE of 0.012 N in a non-contact scenario. The results indicate that the proposed system incorporating FBG sensing, pneumatic artificial muscle (PAM) actuation, and LSTM based contact detection offers a promising pathway for more precise and versatile catheter manipulation in minimally invasive interventions.

Abstract:
Autonomous off-road navigation faces challenges due to diverse, unstructured environments, requiring robust perception with both geometric and semantic understanding. However, scarce densely labeled semantic data limits generalization across domains. Simulated data helps, but introduces domain adaptation issues. We propose COARSE, a semi-supervised domain adaptation framework for off-road semantic segmentation, leveraging sparse, coarse in-domain labels and densely labeled out-of-domain data. Using pretrained vision transformers, we bridge domain gaps with complementary pixel-level and patch-level decoders, enhanced by a collaborative pseudo-labeling strategy on unlabeled data. Evaluations on RUGD and Rellis-3D datasets show significant improvements of 9.7% and 8.4% respectively, versus only using coarse data. Tests on real-world off-road vehicle data in a multi-biome setting further demonstrate COARSE’s applicability.

Abstract:
This paper presents a learning-based online IMU compensation method (AGCNet) that can compensate for run-time errors of the accelerometer and gyroscope to improve inertial odometry. AGCNet employs U-Net architecture with hybrid dilated convolutions to extract multiscale features. It also adopts skip connections and patch-based processing strategy to aggregate local and global information. The network is trained to minimize absolute errors between integration results derived from compensated IMU data and ground truth motion states. The network utilizes IMU measurements from the current time window to correct errors in the subsequent time window, enabling sparser computations. Experiments on two public visual-inertial datasets show that AGCNet can accurately estimate the orientation from IMU measurements, outperforming existing learning-based methods. When applied to Open-VINS, AGCNet improves the accuracy of orientation estimation by an average of 29.8% and position estimation by an average of 37.3%.

Abstract:
Personalized driving refers to an autonomous vehicle’s ability to adapt its driving behavior or control strategies to match individual users’ preferences and driving styles while maintaining safety and comfort standards. However, existing works either fail to capture every individual’s preference precisely or become computationally inefficient as the user base expands. Vision-Language Models (VLMs) offer promising solutions to this front through their natural language understanding and scene reasoning capabilities. In this work, we propose a lightweight yet effective on-board VLM framework that provides low-latency personalized driving performance while maintaining strong reasoning capabilities. Our solution incorporates a Retrieval-Augmented Generation (RAG)-based memory module that enables continuous learning of individual driving preferences through human feedback. Through comprehensive real-world vehicle experiments, our system has demonstrated the ability to provide safe, comfortable, and personalized driving experiences across various scenarios and significantly reduce takeover rates by up to 76.9%. To the best of our knowledge, this work represents the first personalized VLM motion control system in real-world autonomous vehicles. The demo video can be watched at https://tinyurl.com/4xsnz79n.

Abstract:
Preference-based reinforcement learning (PbRL) shows promise in aligning robot behaviors with human preferences, but its success depends heavily on the accurate modeling of human preferences through reward models. Most methods adopt Markovian assumptions for preference modeling (PM), which overlook the temporal dependencies within robot behavior trajectories that impact human evaluations. While recent works have utilized sequence modeling to mitigate this by learning sequential non-Markovian rewards, they ignore the multimodal nature of robot trajectories, which consist of elements from two distinctive modalities: state and action. As a result, they often struggle to capture the complex interplay between these modalities that significantly shapes human preferences. In this paper, we propose a multimodal sequence modeling approach for PM by disentangling state and action modalities. We introduce a multimodal transformer network, named PrefMMT, which hierarchically leverages intra-modal temporal dependencies and inter-modal state-action interactions to capture complex preference patterns. Our experimental results demonstrate that PrefMMT consistently outperforms state-of-the-art PM and direct preference-based policy learning baselines on locomotion tasks from the D4RL benchmark and manipulation tasks from the MetaWorld benchmark. Source code and supplementary information are available at https://sites.google.com/view/prefmmt.

Abstract:
Soft actuators are inherently flexible and compliant, traits that enhance their adaptability to diverse environments and tasks. However, their low structural stiffness can lead to unpredictable and uncontrollable complex deformations when substantial force is required, thereby compromising their load-bearing capacity. This work proposes a novel layer jamming method that uses bioinspired directional adhesives as interlayer films to adjust the stiffness of soft actuators. The mechanical behavior of a single tilted fibril was analyzed using the energy method to determine the adhesion force of the adhesives. The directional adhesive was designed under the guidance of the adhesion force model. Testing under various loads and directions revealed that the tilted characteristic of fibrils can enhance the adhesion force in its grasping direction. A tunable stiffness actuator using directional adhesives (TSADA), was developed with these adhesives serving as interlayer films. The stiffness model of TSADA was derived by analyzing its axial compression force. The results of stiffness experiments indicate that the adhesives serve as interlayer films can adjust the stiffness in response to applied load. TSADA was compared with other typical soft actuators to evaluate the stiffness performance, and the results indicate that TSADA exhibits the highest stiffness and the widest tunable stiffness range. This demonstrates the superior performance of the directional adhesives as interlayer films in terms of stiffness adjustment.

Abstract:
Robotic soil manipulation is essential for automated farming, particularly in excavation and levelling tasks. However, the nonlinear dynamics of granular materials challenge traditional control methods, limiting stability and efficiency. We propose Celebi, a causality-enhanced optimisation method that integrates differentiable physics simulation with adaptive step-size adjustments based on causal inference. To enable gradient-based optimisation, we construct a differentiable simulation environment for granular material interactions. We further define skill parameters with a differentiable mapping to end-effector motions, facilitating efficient trajectory optimisation. By modelling causal effects between task-relevant features extracted from point cloud observations and skill parameters, Celebi selectively adjusts update step sizes to enhance optimisation stability and convergence efficiency. Experiments in both simulated and real-world environments validate Celebi’s effectiveness, demonstrating robust and reliable performance in robotic excavation and levelling tasks.

Abstract:
In recent years, 3D Gaussian Splatting (3D-GS)based scene representation demonstrates significant potential in real-time rendering and training efficiency. However, most existing methods primarily focus on single-map reconstruction, while the registration and fusion of multiple 3D-GS submaps remain underexplored. Existing methods typically rely on manual intervention to select a reference sub-map as a template and use point cloud matching for registration. Moreover, hard-threshold filtering of 3D-GS primitives often degrades rendering quality after fusion. In this paper, we present a novel approach for automated 3D-GS sub-map alignment and fusion, eliminating the need for manual intervention while enhancing registration accuracy and fusion quality. First, we extract geometric skeletons across multiple scenes and leverage ellipsoid-aware convolution to capture 3D-GS attributes, facilitating robust scene registration. Second, we introduce a multi-factor Gaussian fusion strategy to mitigate the scene element loss caused by rigid thresholding. Experiments on the ScanNet-GSReg and our Coord datasets demonstrate the effectiveness of the proposed method in registration and fusion. For registration, it achieves a 41.9% reduction in RRE on complex scenes, ensuring more precise pose estimation. For fusion, it improves PSNR by 10.11 dB, highlighting superior structural preservation. These results confirm its ability to enhance scene alignment and reconstruction fidelity, ensuring more consistent and accurate 3D scene representation for robotic perception and autonomous navigation.

Abstract:
Vision-and-Language Navigation in Continuous Environments (VLN-CE) presents challenges due to environmental variations and domain shifts, making it difficult for agents to generalize beyond seen environments. Most existing methods rely on learning correlations between observations and actions from training data, which leads to spurious dependencies on environmental biases. To address this, we propose CVLN-Think (CVT), a novel navigation model that incorporates causal inference to enhance robustness and adaptability. Specifically, Style Causal Adjuster (SCA) generates counterfactual style observations, enabling agents to learn invariant spatial structures rather than overfitting to dataset-specific visual patterns. Furthermore, Thinking Cause Navigation Engine (TCNE) applies causal intervention to adjust navigation decisions by identifying and mitigating biases from prior experience. Unlike conventional approaches that passively learn from data distributions, our model actively thinks along the "observation-action" chain to make more reliable navigation predictions. Experimental results demonstrate that our approach achieves satisfactory performance on VLN-CE tasks. Further analysis indicates that our method possesses stronger generalization capabilities, highlighting the superiority of our proposed approach.

Abstract:
In swarm robotics, confrontation scenarios, including strategic confrontations, require efficient decision– making that integrates discrete commands and continuous actions. Traditional task and motion planning methods separate decision–making into two layers, but their unidirectional structure fails to capture the interdependence between these layers, limiting adaptability in dynamic environments. Here, we propose a novel bidirectional approach based on hierarchical reinforcement learning, enabling dynamic interaction between the layers. This method effectively maps commands to task allocation and actions to path planning, while leveraging cross– training techniques to enhance learning across the hierarchical framework. Furthermore, we introduce a trajectory prediction model that bridges abstract task representations with actionable planning goals. In our experiments, it achieves over 80% in confrontation win rate and under 0.01 seconds in decision time, outperforming existing approaches. Demonstrations through large–scale tests and real–world robot experiments further emphasize the generalization capabilities and practical applicability of our method.

Abstract:
Cluster tools are vital in semiconductor manufacturing, where multifunctional process modules (MPMs) enhance flexibility and efficiency. However, variable MPMs and processing time in dual-arm cluster tools (DACTs) complicate scheduling, as variable MPM allocation patterns yield distinct productivity. This paper proposes a reinforcement learning-based method for DACTs with MPMs. Firstly, an algorithm enumerates all valid MPM allocation patterns. Then, an adaptive deep Q-Network (DQN) with masking techniques efficiently selects the most efficient pattern and generates robot schedules, minimizing makespan and wafer post-processing residency time across diverse DACT configurations. Experiments validate the proposed approach that offers robust, flexible scheduling solutions to boost semiconductor manufacturing productivity.

Abstract:
In this paper, the problem of controlling the 7 degree-of-freedom (DOF) redundant manipulator accurately executing tasks along a desired trajectory with time-varying position and orientation while addressing a constrained compliant behavior within the null space is considered. The objective of this work is to extend null space impedance control from the traditional fixed point to any pose to meet the human-robot physical interaction in practical applications, such as service and medical robotics. To track the desired trajectory, a Cartesian impedance controller containing desired task variables is derived. Redundancy is then exploited to handle human-robot interaction behavior by using the designed null space impedance controller while constraining the range of elbow motion. In addition, an analytical inverse kinematics (IK) solution is employed to guarantee the compliant behavior of the null space is balanced to any elbow configuration. Finally, the performance of the proposed approach is verified through various experiments on a torque-controlled 7-DOF redundant manipulator.

Abstract:
Supervised behavioral cloning using robot visual-action data has been widely investigated in robot manipulation. However, these methods typically require simultaneous acquisition of visual and action data, which makes them difficult to utilize unpaired visual-action datasets: e.g. videos on Internet or action only data which has less privacy and security concerns. To take advantage of the action data without synchronized visual observation, we propose UnVALe, a novel dexterous robotic manipulation RL framework that utilizes action data without paired images to learn priors of human dexterous manipulation skills. Specifically, an LSTM-based network is designed to learn the temporal action prior by reconstructing the input trajectories, and a VAE network is designed to learn the spatial action prior by reconstructing the input action. Novel rewards are proposed to incorporate the priors into reinforcement learning, which encourages action output from RL polices to maintain low reconstruction errors in the LSTM and VAE networks. We perform extensive validation on three dexterous robot manipulation tasks. The experimental results show that UnVALe can effectively improve robot manipulation performance. Compared with existing visual pretraining methods, our method achieves a more than 30% increase in success rates.

Abstract:
This paper presents a rigorous evaluation of Real-to-Sim parameter estimation approaches for fabric manipulation in robotics. The study systematically assesses three state-of-the-art approaches, namely two differential pipelines and a data-driven approach. We also devise a novel physics-informed neural network approach for physics parameter estimation. These approaches are interfaced with two simulations across multiple Real-to-Sim scenarios (lifting, wind blowing, and stretching) for five different fabric types and evaluated on three unseen scenarios (folding, fling, and shaking). We found that the simulation engines and the choice of Real-to-Sim approaches significantly impact fabric manipulation performance in our evaluation scenarios. Moreover, PINN observes superior performance in quasi-static tasks but shows limitations in dynamic scenarios. Videos and source code are available at cvas-ug.github.io/real2sim-study.

Abstract:
Hybrid aerial underwater vehicles (HAUVs) hold great promise but face challenges like air-water integration and high underwater resistance. This paper presents Nezha-Morphing, a bio-inspired HAUV that emulates seabirds’ adaptive wing extension and retraction mechanisms for efficient movement in both aerial and aquatic domains. It has a servo-driven foldable-arm mechanism and is made of high-strength, low-mass materials with a double-system architecture for aerial and underwater control. This paper details the mechanical and electronic system design, dynamic analysis, and experimental results of Nezha-Morphing. The experimental findings are highly impressive: underwater, the folded-arm configuration effectively reduces resistance, enabling a maximum speed of 0.620 m/s and achieving a high average acceleration, significantly enhancing motion efficiency and flexible maneuverability in confined spaces. In the air, with its arms unfolded, the vehicle exhibits exceptional stability and strong wind resistance, maintaining steady flight even under level-4 wind conditions. Moreover, it completes the water-to-air cross-domain transition in just 1.5 seconds. Nezha-Morphing successfully integrates flight stability, cross-domain adaptability, and hydrodynamic efficiency, showcasing substantial potential for diverse applications.

Abstract:
Ensuring aircraft safety against terrain collisions in complex and dynamic environments remains a critical challenge in aviation. To address this, a parallel autonomy system is proposed that can take control from a human pilot to prevent a controlled flight into terrain collision. The proposed system operates in the background, continuously maintaining a forward-looking motion plan that can be executed immediately if a terrain collision is projected to happen, absent its timely intervention. Terrain avoidance motion plans are rapidly generated based on the aircraft’s current state vector and a Digital Elevation Model of the surrounding terrain. The planning process involves two main steps: first, a sampling-based motion planner leverages prior knowledge acquired through generative adversarial learning to bias the search toward escape paths within the most favorable regions of Cartesian space. Second; differential flatness of the aircraft model is utilized to ensure the dynamic feasibility of an associated Cartesian space escape trajectory and flattening it into a state-control trajectory. This converts the output tracking problem in Cartesian space into a ready-to-invoke state-feedback control.

Abstract:
Wireless Electromagnetic Tracking (WEMT) enables non-line-of-sight (NLoS) pose estimation in robotics but faces accuracy limitations from restricted operational range and environmental interference. This paper proposes a WEMT-inertial fusion system enhanced by a learning-based Interacting Multiple Model (IMMNN) to address these challenges. The framework integrates a multi-transmitter array with a WEMT-IMU fusion tracker, leveraging IMMNN to mitigate performance degradation caused by nonlinear spatial noise and motion uncertainty in dynamic, array-based environments. IMMNN employs a graph attention network to dynamically model spatial correlations among array units, adaptively optimizing state transition probabilities across motion models. A gated recurrent framework further enhances robustness by analyzing residual sequences to suppress transient noise and outliers. Experimental results demonstrate that the proposed system achieves a root-mean-square error (RMSE) of 30.4 mm over an expanded 1.9×1.9 m2 operational area. The graph attention mechanism enables adaptive spatial noise suppression and ensures stable tracking under rapid motion and electromagnetic disturbances. By synergizing model-driven filtering with data-driven learning, IMMNN effectively improves accuracy and robustness, advancing high-precision WEMT solutions for complex robotic applications.

Abstract:
Accurate positioning is essential for autonomous driving, but localization using 2D maps is challenging due to the domain gap between perspective view and 2D map. While GNSS accuracy is often limited by atmospheric effects, multipath, and signal blockages. We propose a novel positioning method that combines perspective view images with satellite images retrieved based on rough GNSS positions to achieve precise three-degree-of-freedom (3-DoF) pose estimation. Our method leverages the Swin Transformer for satellite image processing and semantic completion for monocular image analysis. By extracting depth and semantic information from monocular images, we convert these to overhead projections, effectively bridging the gap between different viewpoints. This cross-view transformation allows for precise alignment of features from monocular images onto semantically enriched satellite images. Additionally, we integrate a robust global position estimator using the semantic information from satellite images to further enhance accuracy and robustness. The experimental results demonstrate that our method excels in various complex scenarios; we successfully improved the positioning accuracy within 1 m to 80.67% and the heading in 1° to 33.78%. However, longitudinal localization remains more challenging, with higher errors than lateral positioning.

Abstract:
Edge devices for robotics in hazardous environments, such as rescue drones, navigate complex terrains while transmitting images to remote servers for anomaly detection, including wildfires. However, these devices operate under strict resource constraints, prioritizing operational-critical tasks (e.g., autonomous navigation) while handling image-processing workloads with minimal overhead. Offloading computation to a remote server can alleviate this burden, but unstable network conditions can degrade accuracy and timeliness. To address these challenges, this paper presents a novel offloading framework that balances computational efficiency and accuracy in image-processing tasks. Specifically, it ensures (R1) a minimum accuracy level for individual image-processing tasks associated with different camera sensors and (R2) maximizes the overall image-processing accuracy across all sensors. Our approach builds on an edge-server collaborative image reconstruction architecture, where images are divided into patches and selectively reconstructed. To achieve R1 and R2, we introduce: (i) a hierarchical scheduler that effectively prioritizes patch transmissions under resource constraints and (ii) a feedback mechanism that adapts to network instability, ensuring reliable offloading and inference. Experimental results demonstrate that our framework maintains high accuracy and timely processing, even under network failures.

Abstract:
Flexible microelectrode (FME) implantation into brain cortex is challenging due to the deformable fiber-like structure of FME probe and the interaction with critical bio-tissue. To ensure the reliability and safety, the implantation process should be monitored carefully. This paper develops an image-based anomaly detection framework based on the microscopic cameras of the robotic FME implantation system. The unified framework is utilized at four checkpoints to check the micro-needle, FME probe, hooking result, and implantation point, respectively. Exploiting the existing object localization results, the aligned regions of interest (ROIs) are extracted from raw image and input to a pretrained vision transformer (ViT). Considering the task specifications, we propose a progressive granularity patch feature sampling method to address the sensitivity-tolerance trade-off issue at different locations. Moreover, we select a part of feature channels with higher signal-to-noise ratios from the raw general ViT features, to provide better descriptors for each specific scene. The effectiveness of the proposed methods is validated with the image datasets collected from our implantation system.

Abstract:
Repetitive motion control (RMC) for redundant manipulators has been extensively studied from the kinematic perspective, whereas security concerns under malicious adversaries have received limited attention. In network-controlled manipulators, when control commands sent from the control center to the remote manipulator are subject to false data injection attacks (FDIAs), serious incidents and potential harm to individuals can occur. This paper proposes a novel resilient controller such that the manipulator can successfully complete motion tracking tasks and address the non-repetitive motion problem, even in the presence of FDIAs. The problem is first reformulated as a convex optimization problem with an unknown parameter relative to FDIAs, where the RMC criteria serves as the objective function and physical limitations are incorporated as inequality constraints. A recurrent neural network (RNN) is then introduced to solve the problem, improving computational efficiency. Additionally, a detection mechanism is integrated to estimate the unknown attack parameter, allowing the RNN to find the optimal control command. Simulations and experiments are conducted on an RM65-B manipulator to validate the efficacy of the proposed method, and comparisons with existing approaches highlight its superior performance.

Abstract:
A 3-D multiple sound source localization (SSL) method is proposed in this work, which uses independent vector analysis (IVA) combining with an elaborately designed five-element microphone array. The proposed method separates the source signals by IVA. Then for each separated signal, four time-difference-of-arrival (TDOA) values are obtained. With the five-element microphone array, the four TODAs are used to realize the localization of each source separately by analytical solution. To meet the prerequisite of IVA, the properties of the mixing matrix with the five-element microphone array are analyzed and studied for 2 and 3 sound sources. And it is proved that, the configuration of the five-element microphone array can avoid the ill condition where the mixing matrix at each frequency bin has linearly dependent columns. To reduce the computation cost of IVA, the microphones with the same number of the sound sources are selected from the array, and their signals are used for audio source separation. Meanwhile, considering the calculation stability, the selected microphones are required to minimize the condition number of the microphone signal covariance matrix. To investigate the effectiveness and the localization performance of the proposed method, a practical five-element microphone array is used and 3-D multiple SSL experiments are carried out. The TDOA values are obtained by the generalized cross correlation based on the phase transform (GCC-PHAT). The experimental results show that the proposed method is effective and the maximum root mean square error of localization is less than 3 cm . Compared with the conventional methods, the proposed method has the advantages of lower computation cost and fewer microphones, and can locate sources close to each other.

Abstract:
Point cloud classification and segmentation are fundamental tasks in 3D computer vision. Recently, deep learning-based methods, particularly 3D Transformers, have demonstrated their effectiveness across a variety of point cloud tasks. However, transformer-based methods embed position information into feature vectors, which can introduce a significant computational cost. Additionally, these approaches often struggle to adaptively extract different features across varying receptive fields, which limits their performance in various tasks. To address these challenges, we propose a novel Multi-Scale Parallel-Channel Self-Attention (MuSPaCSA) network, designed with a multi-scale feature extraction architecture by stacking Parallel-Channel Self-Attention (PaCSA) layers for classification and segmentation tasks. Specifically, our MuSPaCSA employs the PaCSA module to extract essential semantic and spatial features. The core components of the PaCSA module include the Semantic-Spatial Integration (SSI) and Adaptive Self-Attention (ASA) modules. The SSI module employs a parallel-channel approach to integrate semantic and spatial information, enabling the representation of high-dimensional structural features in point clouds. The ASA module calculates adaptive weights to aggregate rich, high-dimensional structural features from neighboring nodes in a lightweight manner. Through the multi-scale feature fusion architecture of MuSPaCSA, local and global features, as well as semantic and spatial features, are effectively integrated, significantly enhancing the model’s representational capacity. Extensive experiments demonstrate that our model achieves superior performance and results with lower computational cost compared to competing methods.

Abstract:
Microassembly plays an important role in fabricating complex structures with small basic components in industrial and biomedical fields. Inverted optical microscope could provide high-quality image feedback for microassembly with its continuously improving resolution. However, a compact stage capable of positioning and reorienting micro-objects while fitting within the limited space under an inverted optical microscope remains unavailable. This paper proposes a compact R-X-Y stage that can transport micro-objects over long distances in the X and Y directions, and reorient the objects by the 360-degree continuous rotation. Additionally, different from commonly putting the rotational stage on the X-Y stage, we mount the thin X-Y stage on a rotational stage. Thus, after aligning the centers of the visual field and rotational stage at the beginning, all the visiable micro-objects will not move out of the visual field during the rotation. We further integrate the R-X-Y stage and the dual-finger micromanipulator, and then use them to assemble 2-D patterns and complex 3-D micromachine. The obtained results and preliminary demonstration indicate that the proposed compact R-X-Y has great potential in assembling complex micromachines.

Abstract:
Multi-vehicle trajectory planning (MVTP) is one of the key challenges in multi-robot systems (MRSs) and has broad applications across various fields. This paper presents ESCoT, an enhanced step-based coordinate trajectory planning method for multiple car-like robots. ESCoT incorporates two key strategies: collaborative planning for local robot groups and replanning for duplicate configurations. These strategies effectively enhance the performance of step-based MVTP methods. Through extensive experiments, we show that ESCoT 1) in sparse scenarios, significantly improves solution quality compared to baseline step-based method, achieving up to 70% improvement in typical conflict scenarios and 34% in randomly generated scenarios, while maintaining high solving efficiency; and 2) in dense scenarios, outperforms all baseline methods, maintains a success rate of over 50% even in the most challenging configurations. The results demonstrate that ESCoT effectively solves MVTP, further extending the capabilities of step-based methods. Finally, practical robot tests validate the algorithm’s applicability in real-world scenarios.

Abstract:
Contact-rich problems, such as snake robot locomotion, offer unexplored yet rich opportunities for optimization-based trajectory and acyclic contact planning. So far, a substantial body of control research has focused on emulating snake locomotion and replicating its distinctive movement patterns using shape functions that either ignore the complexity of interactions or focus on complex interactions with matter (e.g., burrowing movements). However, models and control frameworks that lie in between these two paradigms and are based on simple, fundamental rigid body dynamics, which alleviate the challenging contact and control allocation problems in snake locomotion, remain absent. This work makes meaningful contributions, substantiated by simulations and experiments, in the following directions: 1) introducing a reduced-order model based on Moreau’s stepping-forward approach from differential inclusion mathematics, 2) verifying model accuracy, 3) experimental validation.

Abstract:
The existing language-driven grasping methods struggle to fully handle ambiguous instructions containing implicit intents. To tackle this challenge, we propose LangGrasp, a novel language-interactive robotic grasping framework. The framework integrates fine-tuned large language models (LLMs) to leverage their robust commonsense understanding and environmental perception capabilities, thereby deducing implicit intents from linguistic instructions and clarifying task requirements along with target manipulation objects. Furthermore, our designed point cloud localization module, guided by 2D part segmentation, enables partial point cloud localization in scenes, thereby extending grasping operations from coarse-grained object-level to fine-grained part-level manipulation. Experimental results show that the LangGrasp framework accurately resolves implicit intents in ambiguous instructions, identifying critical operations and target information that are unstated yet essential for task completion. Additionally, it dynamically selects optimal grasping poses by integrating environmental information. This enables high-precision grasping from object-level to part-level manipulation, significantly enhancing the adaptability and task execution efficiency of robots in unstructured environments. More information and code are available here: https://github.com/wu467/LangGrasp.

Abstract:
This paper presents MelumiTac, a vision-based tactile (ViTAC) sensor enhanced with mechanoluminescent (ML) materials that emit green light under dynamic tactile stimuli. The integration of an ML elastomer generates self-illumination in response to dynamic tactile stimuli, enabling direct visualization of both dynamic tactile events and nociceptive responses while simultaneously tracking deformation in real-time. Experimental evaluations involving cyclic loading, in-plane motion, and piercing reveal a strong correlation between ML emission, stress rate, and localized deformation, thereby validating its multi-modal tactile sensing capabilities. Additionally, frame-by-frame analysis offers rich insights into the contact dynamics during physical interactions. These improvements, implemented within a small form factor of conventional ViTac sensor, render the approach highly accessible. Thus, we expect that the proposed solution will offer practical and unique advantages to engineers developing and applying vision-based multi-modal tactile sensors.

Abstract:
Performing shotcrete operations at construction sites can be hazardous to humans and inefficient. Robots can offer a safer and more efficient alternative to assist in these tasks. We present a new planning strategy for shotcrete robots, including both the spraying and surface finishing phases, that can plan for a general target area, whether flat or complexly curved. Our method uses learning from demonstrations and dynamical systems concepts to enable reactive and adaptive planning for robots, allowing them to effectively handle disturbances. We evaluated the effectiveness of the proposed planning and control framework in a laboratory setup using a velocity-controlled robot and curved targets both in the spraying and polishing phases. The results demonstrate the effectiveness of the proposed approach.

Abstract:
In this paper, we investigate the feasibility of using knowledge graphs to interpret actions and behaviors for robot manipulation control. Equipped with an uncalibrated visual servoing controller, we propose to use robot knowledge graphs to unify behavior trees and geometric constraints, conceptualizing robot manipulation control as semantic events. The robot knowledge graphs not only preserve the advantages of behavior trees in scripting actions and behaviors, but also offer additional benefits of mapping natural interactions between concepts and events, which enable knowledgeable explanations of the manipulation contexts. Through real-world evaluations, we demonstrate the flexibility of the robot knowledge graphs to support explainable robot manipulation control.

Abstract:
Zebrafish are widely used in the biomedical field, as an ideal model for microinjection. In automated zebrafish microinjection, posture adjustment is the first and key step, which takes a lot of skill, and injection success assessment is a challenging task. Constrained by these two aspects, it is difficult to further enhance the efficiency and success rate of injection. In this study, we propose an automated dual-micropipette coordination microinjection system. Zebrafish are randomly arranged in our system, reducing the operational difficulty, and the yolk is positioned using a pose estimation algorithm, followed by injection accomplished with dual-micropipette. Due to the reduction of posture adjustment time by half, the proposed system achieves the shortest injection time of 15.2s. Moreover, the simplicity of the system and the ease of operation contribute to the clinical feasibility of our system.

Abstract:
Continuum robots are widely used in the medical scenarios due to their dexterity and flexibility. However, precise end-to-end control of continuum robots remains challenging, limited by the kinematic or kinetostatic accuracy and no enough space for additional sensors configurations. This paper proposes a precise position control method for fiber-driven continuum robots using the reconstructed shape based on distributed force sensing from the same fibers, where the optical fibers serve as both robot actuation and force sensing simultaneously without requiring additional sensors. First, we use single-core optical fibers (SCFs) as the actuation cables of the continuum robot, and each fiber has multiple fiber Bragg grating (FBG) sensors inscribed on it to sense distributed force along the entire cables. Then, the forward kinetostatics model of the fiber-driven continuum robot is established using the known distributed forces as the inputs. Notably, the nonlinear friction between the cables and actuation channels does not require an additional estimation model. Benefiting from this, the shape can be accurately reconstructed after the stiffness calibration of the continuum robot. Finally, a position controller based on real-time feedback from shape is developed to achieve the tip position control of the continuum robot. Experimental results demonstrate that the proposed forward kinetostatics model can achieve the shape reconstruction with the errors of 0.45 mm and 0.57 mm in planar bending and spatial bending states, respectively. By comparison to the traditional constant curvature kinematics-based control method, the proposed methods can achieve the mean absolute error of 0.37 and 0.6 mm in two distinct path tracking tests. The proposed method using distributed forces sensing enables a real-time accurate position feedback control combined with kinetostatic model, instead of modelling the nonlinear friction or adding additional external sensors.

Abstract:
Although modular robots with snake-like structures have been proposed in previous studies, very few have achieved separation and docking without direct human intervention. Based on prior research on conventional snake robots and modular robots, we propose a modular snake robot composed of a minimum module consisting of five links and four joints. This robot is capable of rotating and translating in a two-dimensional plane even as a single module. Each module is equipped with a unique detachable and dockable structure using magnets and mechanical hooks, and reconfiguration between modules is realized through these motions. Through field experiments, we verified the reproducibility of the detachment and docking actions, as well as the locomotion capability of each module in the detached state.

Abstract:
This paper introduces an anthropomorphic robot hand built entirely using LEGO MINDSTORMS: the Educational SoftHand-A, a tendon-driven, highly-underactuated robot hand based on the Pisa/IIT SoftHand and related hands. To be suitable for an educational context, the design is constrained to use only standard LEGO pieces with tests using common equipment available at home. The hand features dual motors driving an agonist/antagonist opposing pair of tendons on each finger, which are shown to result in reactive fine control. The finger motions are synchonized through soft synergies, implemented with a differential mechanism using clutch gears. Altogether, this design results in an anthropomorphic hand that can adaptively grasp a broad range of objects using a simple actuation and control mechanism. Since the hand can be constructed from LEGO pieces and uses state-of-the-art design concepts for robotic hands, it has the potential to educate and inspire children to learn about the frontiers of modern robotics.

Abstract:
Localizing gas sources is a challenging task due to the complex nature of gas dispersion. Informative Path Planning (IPP) plays a crucial role in guiding robots to sample at high-information positions, thereby accelerating the estimation process. Existing probabilistic gas source localization methods often require robots to halt at sampling positions, averaging gas measurements over time. Consequently, when selecting the next sampling position, information gains are usually computed precisely through computationally heavy procedures, limiting evaluations to a small set of potential positions. In our previous work, we introduced a sense-in-motion strategy that eliminates the need for prolonged stops at sampling points, therefore allowing the incorporation of measurements taken during robot movement. Building upon this advancement, we propose to extend information gain evaluation in a more continuous manner, from a point evaluation to a path evaluation. However, existing IPP methods are too computationally expensive when transitioning from goal-based to region-based evaluations. To address this challenge, we first assess three lightweight information extraction metrics. Based on the selected metrics, we propose a novel IPP algorithm that computes cumulative information gain along the robot’s path and dynamically prioritizes exploration or exploitation based on the uncertainty of the source estimation. The proposed method is extensively evaluated through both high-fidelity simulations and physical experiments. Results show that our proposed method consistently outperforms a benchmark state-of-the-art method, achieving a 40% increase in source localization success rate and halving the experimental time in challenging environments.

Abstract:
Utilizing robots for autonomous target search in complex and unknown environments can greatly improve the efficiency of search and rescue missions. However, existing methods have shown inadequate performance due to hardware platform limitations, inefficient viewpoint selection strategies, and conservative motion planning. In this work, we propose HEATS, which enhances the search capability of mobile manipulators in complex and unknown environments. We design a target viewpoint planner tailored to the strengths of mobile manipulators, ensuring efficient and comprehensive viewpoint planning. Supported by this, a whole-body motion planner integrates global path search with local IPC optimization, enabling the mobile manipulator to safely and agilely visit target viewpoints, significantly improving search performance. We present extensive simulated and real-world tests, in which our method demonstrates reduced search time, higher target search completeness, and lower movement cost compared to classic and state-of-the-art approaches. Our method will be open-sourced for community benefit3.

Abstract:
Aiming to address the challenges associated with complex manufacturing processes and the difficulties in batch production of insect-scale robots. A mechatronic origami mechanism applied to an insect-scale parallel-legged structure is designed, manufactured, and tested. The origami mechanism is constructed using a multilayer composite laminate, which allows for the integrated fabrication of robotic hinges, linkages, and actuators. Utilizing the origami mechanism, it becomes feasible to fold and create the insect-scale parallel-legged structure. This enables the rapid assembly of various types of insect-scale robots, including monopods, bipedal robots, quadrupeds, and hexapods. We built the experimental prototype and test environments to validate the kinematic performance of the insect-scale parallel-legged structure. The monopod robot, weighing 200 mg and featuring the parallel leg, possesses the ability to rotate around the center of an adaptive rotating platform at a speed of 5 cm/s. The bipedal robot demonstrates the ability to navigate the rotating platform by performing alternating leg swings. The quadrupedal robot, designed with four parallel-legged structures, exhibited a movement speed of 1.9 cm/s when actuated at a frequency of 20 Hz. In contrast, the hexapod robot achieved a superior speed of 3.25 cm/s under the same actuation frequency of 20 Hz. The origami mechanism and the insect-scale parallel-legged structure provide a new method for the design and fabrication of insect-scale robots.

Abstract:
Goal assignment is a critical challenge in multi-robot systems. The emergence of large language models (LLMs) has enabled the use of natural language commands for tackling goal assignment problems. However, applying LLMs directly to these tasks presents two limitations: 1) limited accuracy and 2) excessive decision delays due to their autoregressive nature, hindering adaptability to unexpected changes. To address these issues, inspired by dual-process theory, we propose a framework called Collaborative LLMs for dynamic Goal Assignment (CLGA). Specifically, we leverage LLMs for pre-planning tasks and invoke an external solver to generate an initial goal assignment solution, ensuring solution accuracy. During execution, small-scale models enable real-time adjustments to respond to dynamic environmental changes. This approach integrates the strengths of slow, precise pre-planning and fast, adaptive online adjustments, allowing agents to efficiently handle real-world challenges. Additionally, we introduce a benchmark dataset for NLP-based goal assignment to advance research in this domain. Simulation and real-world experiments demonstrate that CLGA significantly enhances task execution efficiency and flexibility in multi-robot systems. The prompt, experimental videos, and datasets associated with this work are available at https://sites.google.com/view/project-clga/.

Abstract:
Ensuring connectivity and coordination in heterogeneous multi-robot systems (MRS) navigating complex environments is a critical challenge, especially when communication constraints and obstacles cause robots to become lost or disconnected. This paper presents a novel approach integrating Model Predictive Control (MPC) with Generalized Connectivity Maintenance (GCM) to enable real-time path adaptation while preserving connectivity. We introduce a decentralized decision-making framework that enables robots to recover lost members dynamically. When reconnection is infeasible, the system adapts the mission to continue while accounting for disconnected robots. Our method is evaluated through extensive simulations, showing its scalability and effectiveness in maintaining connectivity and ensuring mission success. Additionally, we propose a new evaluation metric that comprehensively assesses system performance, considering connectivity, coordination, and mission success in challenging environments.

Abstract:
Autonomous fine manipulation in space for orbital assembly continues to present a critical challenge in the field of aerospace engineering. Under low-gravity conditions, during satellite manipulator operations on free-floating objects, the absence of significant gravitational forces and friction constraints leads to unpredictable relative motions between the manipulator’s end-effectors and objects, degrading the manipulation performance. This study proposes a fine manipulation method for satellite robots with floating platforms, grounded in multimodal perception and an enhanced Action Chunking with Transformer (ACT) architecture that enables multimodal state interaction. By integrating visual, tactile, and depth sensory data, the satellite robot’s space fine manipulation capabilities are substantially improved. From the experimental results generated from simulated environments, the proposed method achieves a success rate exceeding 80% for peg-in-socket insertion tasks, outperforming conventional approaches with a success rate of approximately 45%. Project Website: https://github.com/LSY0528/VDTF-ACT.

Abstract:
This paper introduces EmbodiedAgent, a hierarchical framework for heterogeneous multi-robot control. EmbodiedAgent addresses critical limitations of hallucination in impractical tasks. Our approach integrates a next-action prediction paradigm with a structured memory system to decompose tasks into executable robot skills while dynamically validating actions against environmental constraints. We present Mul-tiPlan+, a dataset of more than 18,000 annotated planning instances spanning 100 scenarios, including a subset of impractical cases to mitigate hallucination. To evaluate performance, we propose the Robot Planning Assessment Schema (RPAS), combining automated metrics with LLM-aided expert grading. Experiments demonstrate EmbodiedAgent’s superiority over state-of-the-art models, achieving 71.85% RPAS score. Real-world validation in an office service task highlights its ability to coordinate heterogeneous robots for long-horizon objectives.

Abstract:
Recent advancements in quadruped robot research have significantly improved their ability to traverse complex and unstructured outdoor environments. However, the issue of noise generated during locomotion is generally overlooked, which is critically important in noise-sensitive indoor environments, such as service and healthcare settings, where maintaining low noise levels is essential. This study aims to optimize the acoustic noise generated by quadruped robots during locomotion through the development of advanced motion control algorithms. To achieve this, we propose a novel approach that minimizes noise emissions by integrating optimized gait design with tailored control strategies. This method achieves an average noise reduction of approximately 8 dBA during movement, thereby enhancing the suitability of quadruped robots for deployment in noise-sensitive indoor environments. Experimental results demonstrate the effectiveness of this approach across various indoor settings, highlighting the potential of quadruped robots for quiet operation in noise-sensitive environments.

Abstract:
This paper presents a novel event-based depth sensing system with line laser scan. Our main contribution involves both hardware and software improvements to previous state-of-the-art works. The polygon mirror scanner is designed to steer line laser with a constant velocity, which minimizes non-linearity of the projected time map to improve depth precision. A piecewise linear model is then proposed to model the behavior of the scanner, which is simple and easy to calibrate. The corresponding reconstruction pipeline achieves a high-speed depth map with an efficient plane-ray intersection-based depth calculation. Experimental results verify the approach is capable of realizing 0.6mm precision at a distance of 500mm and 8.3ms depth reconstruction runtime on embedded platforms.

Abstract:
Self-assembly planning for modular robots is critical for constructing functional structures, yet existing methods often suffer from inefficiency, poor scalability, or collision risks. This paper presents an innovative framework that formulates modular robot self-assembly as a time-varying online Multi-Agent Path Finding (MAPF) problem and resolves it through an enhanced Time-Expanded Network (TEN). Key modifications are introduced to handle the dynamic nature of the self-assembly process, including the varying number of agents and evolving target configurations. Simulations conducted with hexagonal modular robots demonstrate that the proposed algorithm significantly outperforms the benchmark A-based approach in terms of both assembly efficiency and success rate across various target configurations. The proposed framework establishes a scalable planning framework for modular robot self-assembly, with future extensions toward real-world validation.

Abstract:
Agricultural harvesting requires careful handling of delicate crops, a challenge often unmet by traditional machinery. To address this need, this paper presents an intelligent baromorphic end-effector that marks a significant innovation in agricultural technology. This novel architecture integrates sensing elements that enable the system to dynamically adjust to different crop shapes and textures, ensuring a gentle touch that minimizes damage.The design and testing of the end-effector, which utilizes flexible materials and embedded air channel and embedded sensor to effectively modulate grip adaptability are presented. The manufacturing approach focuses on using advanced techniques to ensure the end-effector’s durability and scalability. The embedded sensor provided real-time data feedback, allowing constant adjustments that enhance both safety and efficiency during the harvesting process. Through extensive simulation and experimental testing, the system has demonstrated its features.

Abstract:
Robust and accurate proprioceptive state estimation of the main body is crucial for legged robots to execute tasks in extreme environments where exteroceptive sensors, such as LiDARs and cameras, may become unreliable. In this paper, we propose DogLegs, a state estimation system for legged robots that fuses the measurements from a body-mounted inertial measurement unit (Body-IMU), joint encoders, and multiple leg-mounted IMUs (Leg-IMU) using an extended Kalman filter (EKF). The filter system contains the error states of all IMU frames. The Leg-IMUs are used to detect foot contact, thereby providing zero-velocity measurements to update the state of the Leg-IMU frames. Additionally, we compute the relative position constraints between the Body-IMU and Leg-IMUs by the leg kinematics and use them to update the main body state and reduce the error drift of the individual IMU frames. Field experimental results have shown that our proposed DogLegs system achieves better state estimation accuracy compared to the traditional leg odometry method (using only Body-IMU and joint encoders) across various terrains. We make our datasets publicly available to benefit the research community (https://github.com/YibinWu/leg-odometry).

Abstract:
Collaborative robots often exhibit limited absolute accuracy despite high repeatability, necessitating cost-effective calibration solutions. This paper presents a novel sensor-free self-calibration method for collaborative robots using position and distance constraints. A tri-sphere end-effector with precision balls and magnetic holders enables repeatable Tool Center Point (TCP) positioning (<0.01mm) through hand-guiding, where the three-sphere configuration crucially enhances the orientation calibration accuracy compared to a single-sphere approach. The proposed device eliminates expensive external sensors while establishing geometric constraints through workspace-wide TCP engagements. By analyzing relative position/distance errors between multiple configurations, the method identifies kinematic parameters via a Local Product of Exponential (Local POE) based error model. Experiments demonstrated a 91.7% position error reduction (7.98mm to 0.66mm) and 69.6% orientation improvement (0.0069rad to 0.0021rad), achieving comparable accuracy to laser-tracker methods at <1% device cost. This approach offers a low-cost, mechanically robust solution for enhancing collaborative robot accuracy in industrial applications.

Abstract:
The use of autonomous systems in medical evacuation (MEDEVAC) scenarios is promising, but existing implementations overlook key insights from human-robot interaction (HRI) research. Studies on human-machine teams demonstrate that human perceptions of a machine teammate are critical in governing the machine’s performance. Consequently, it is essential to identify the factors that contribute to positive human perceptions in human-machine teams. Here, we present a mixed factorial design to assess human perceptions of a MEDEVAC robot in a simulated evacuation scenario. Participants were assigned to the role of casualty (CAS) or bystander (BYS) and subjected to three within-subjects conditions based on the MEDEVAC robot’s operating mode: autonomous-slow (AS), autonomous-fast (AF), and teleoperation (TO). During each trial, a MEDEVAC robot navigated an 11-meter path, acquiring a casualty and transporting them to an ambulance exchange point while avoiding an idle bystander. Following each trial, subjects completed a questionnaire measuring their emotional states, perceived safety, and social compatibility with the robot. Results indicate a consistent main effect of operating mode on reported emotional states and perceived safety. Pairwise analyses suggest that the employment of the AF operating mode negatively impacted perceptions along these dimensions. There were no persistent differences between CAS and BYS responses.

Abstract:
Long-horizon composite task planning for multi-robot systems in cross-regional complex scenarios faces dual challenges: spatial-semantic comprehension of natural language described tasks and collaborative optimization of subtask al-location. To address these challenges, this paper proposes a progressive three-stage task planning framework. First, an augmented scene graph is constructed to enable large language models (LLMs) to comprehend environmental structures, thereby generating simplified Linear Temporal Logic (LTL) task sequences. Subsequently, a novel heuristic function is employed to select optimal task allocation plans. Finally, LLMs are used to generate low-level executable robot instructions based on robotic system instruction templates. We establish a long-horizon composite task dataset for experimental validation on real-world quadrupedal multi-robot systems. Experimental results demonstrate the effectiveness of our approach in resolving cross-regional composite tasks.

Abstract:
Existing robots designed for locomotion in granular media typically excel at a single purpose—either surface travel or subsurface digging—while lacking the ability to perform both within the same platform. In contrast, nature offers various examples of burrowing organisms that exhibit multi-functional digging behaviors by separating their body into two essential parts: a digger for substrate intrusion and rest of the body as anchor for stabilization and controlling digger orientation. Inspired by these biological strategies, we present an extension to an existing Screw Propelled Vehicle (SPV) that incorporates an adjustable body anchor to reduce drag and enable orientation control. This integration allows the robot to transition between horizontal crawling and vertical digging. We also investigate the effect of local fluidization (LF), a bio-inspired technique that temporarily reduces the resistive forces in granular media. Experimental results show that integrating LF improves surface propulsion performance in terms of speed and depth with increment of over 5x compared to the baseline configuration. These findings support the hypothesis that bio-inspired design principles—specifically body–anchor separation and local fluidization—significantly enhance both the functionality and efficiency of granular locomotion robots, providing a pathway toward more versatile, autonomous, and high-performance subsurface exploration.

Abstract:
In warehouse environments, robots require robust picking capabilities to manage a wide variety of objects. Effective deployment demands minimal hardware, strong generalization to new products, and resilience in diverse settings. Current methods often rely on depth sensors for structural information, which suffer from high costs, complex setups, and technical limitations. Inspired by recent advancements in computer vision, we propose an innovative approach that leverages foundation models to enhance suction grasping using only RGB images. Trained solely on a synthetic dataset, our method generalizes its grasp prediction capabilities to real-world robots and a diverse range of novel objects not included in the training set. Our network achieves an 81.9% success rate in real-world applications. The project website with code and data will be available at http://optigrasp.github.io.

Abstract:
Stretchable sensors indicate promising prospects for soft robotics, medical devices, and human-machine interactions due to the high compliance of soft materials. Discrete sensing strategies, including sensor arrays and distributed sensors, are broadly involved in tactile sensors across versatile applications. However, it remains a challenge to achieve high spatial resolution with self-decoupled capacity and insensitivity to other off-axis stimuli for stretchable tactile sensors. Herein, we develop a stretchable tactile sensor based on the proposed continuous spectral-filtering principle, allowing superhigh resolution for applied stimuli. This proposed sensor enables a high-linear spatial response (R2 > 0.996) even during stretching and bending, and high continuous spatial (7 μm) and force (5 mN) resolutions with design scalability and interaction robustness to survive piercing and cutting. We further demonstrate the sensors' performance by integrating them into a planar parallel mechanism for precise trajectory tracking (rotational resolution: 0.02°) in real time.

Abstract:
Conventional single LiDAR systems are inherently constrained by their limited field of view (FoV), leading to blind spots and incomplete environmental awareness, particularly on robotic platforms with strict payload limitations. Integrating a motorized LiDAR offers a practical solution by significantly expanding the sensor’s FoV and enabling adaptive panoramic 3D sensing. However, the high-frequency vibrations of the quadruped robot introduce calibration challenges: these oscillations continually disturb the LiDAR–motor extrinsics, so parameters calibrated once may drift during operation and degrade sensing accuracy.Existing calibration methods that use artificial targets or dense feature extraction lack feasibility for on-site applications and real-time implementation. To overcome these limitations, we propose LiMo-Calib, an efficient on-site calibration method that eliminates the need for external targets by leveraging geometric features directly from raw LiDAR scans. LiMo-Calib optimizes feature selection based on normal distribution to accelerate convergence while maintaining accuracy and incorporates a reweighting mechanism that evaluates local plane fitting quality to enhance robustness. We integrate and validate the proposed method on a motorized LiDAR system mounted on a quadruped robot, demonstrating significant improvements in calibration efficiency and 3D sensing accuracy, making LiMo-Calib well-suited for real-world robotic applications. We further demonstrate the accuracy improvements of the Lidar Inertial Odometry (LIO) on the panoramic 3D sensing system using the calibrated parameters. The code will be available at: https://github.com/kafeiyin00/LiMo-Calib.

Abstract:
Imitation-based teleoperation enables intuitive robot control in hazardous or hard-to-reach environments. Existing methods, however, lack an effective and quickly-deployable system that uses simple visual sensors to achieve end-effector control and human-like arm joint configuration imitation across various robotic arm structures. This paper therefore presents a teleoperation system that utilizes a single RGB camera and advanced computer vision techniques to capture human motion, coupled with a kinematic mapping method to transfer movements from human to robotic arms. The system generates robot motion that ensures both end-effector tracking and human-like joint configuration imitation, adaptable to diverse structures, including those with multiple offset links. Experiments demonstrate that the system produces robot arm poses more closely aligned with human configurations compared to traditional methods that overlook human pose. The performance of the end-effector tracking control and human arm shape imitation is evaluated, with no noticeable error observed when the robot completes its motion and a maximum position error of 17.03% and a maximum orientation error of 0.0925 rad are observed during motion, which are likely attributed to delays cased by filters and communications. Additionally, the system’s ability to actively avoid obstacles via arm configuration imitation in specific scenarios is confirmed. Supplementary video is available.

Abstract:
Multi-Agent Systems (MAS) excel at accomplishing complex objectives through the collaborative efforts of individual agents. Among the methodologies employed in MAS, Multi-Agent Reinforcement Learning (MARL) stands out as one of the most efficacious algorithms. However, when confronted with the complex objective of Formation Control with Collision Avoidance (FCCA): designing an effective reward function that facilitates swift convergence of the policy network to an optimal solution. In this paper, we introduce a novel framework that aims to overcome this challenge. By giving large language models (LLMs) on the prioritization of tasks and the observable information available to each agent, our framework generates reward functions that can be dynamically adjusted online based on evaluation outcomes by employing more advanced evaluation metrics rather than the rewards themselves. This mechanism enables the MAS to simultaneously achieve formation control and obstacle avoidance in dynamic environments with enhanced efficiency, requiring fewer iterations to reach superior performance levels. Our empirical studies, conducted in both simulation and real-world settings, validate the practicality and effectiveness of our proposed approach.Project Website: https://macsclab.github.io/LLM_FCCA

Abstract:
Autonomous navigation of car-like robots on uneven terrain poses unique challenges compared to flat terrain, particularly in traversability assessment and terrain-associated kinematic modelling for motion planning. This paper introduces SEB-Naver, a novel SE(2)-based local navigation framework designed to overcome these challenges. First, we propose an efficient traversability assessment method for SE(2) grids, leveraging GPU parallel computing to enable real-time updates and maintenance of local maps. Second, inspired by differential flatness, we present an optimization-based trajectory planning method that integrates terrain-associated kinematic models, significantly improving both planning efficiency and trajectory quality. Finally, we unify these components into SEB-Naver, achieving real-time terrain assessment and trajectory optimization. Extensive simulations and real-world experiments demonstrate the effectiveness and efficiency of our approach. The code is at https://github.com/ZJU-FAST-Lab/seb_naver.

Abstract:
Equipping quadruped robots with manipulators provides unique loco-manipulation capabilities, enabling diverse practical applications. This integration creates a more complex system that has increased difficulties in modeling and control. Reinforcement learning (RL) offers a promising solution to address these challenges by learning optimal control policies through interaction. Nevertheless, RL methods often struggle with local optima when exploring large solution spaces for motion and manipulation tasks. To overcome these limitations, we propose a novel approach that integrates an explicit kinematic model of the manipulator into the RL framework. This integration provides feedback on the mapping of the body postures to the manipulator’s workspace, guiding the RL exploration process and effectively mitigating the local optima issue. Our algorithm has been successfully deployed on a DeepRobotics X20 quadruped robot equipped with a Unitree Z1 manipulator, and extensive experimental results demonstrate the superior performance of this approach. We have established a project website to showcase our experiments.

Abstract:
In this paper, we propose a variant of the anytime hybrid A algorithm that generates a fast but suboptimal solution before progressively optimizing the paths to find the shortest winding-constrained paths for a pair of tethered robots under curvature constraints. Specifically, our proposed algorithm uses a tangent graph as its underlying search graph and leverages an anytime A search framework with appropriately defined cost metrics in order to reduce the overall computation and to ensure that a winding angle constraint is satisfied. Moreover, we prove the completeness and optimality of the algorithm for finding the shortest winding-constrained paths in an anytime fashion. The effectiveness of the proposed algorithm is demonstrated via simulation experiments.

Abstract:
Keypoint detection and description are fundamental tasks for a variety of computer vision applications. Due to the limited receptive field of convolutional neural networks, most existing methods based on deep learning mainly focus on the local features, instead of taking into account the global context from entire image. The purpose of this work is to enhance the detection and description process of keypoints by leveraging global information obtained from Transformer, and to boost the consistence between keypoints and descriptors through their interaction. Specifically, the above two improvements are respectively implemented through the Local & Global Context Aggregation (LGCA) Module and Point & Descriptor Cross Augmentation (PDCA) Module proposed in this article. The LGCA module, which can model the long-range context, is inserted a Feature Pyramid Network (FPN) to extract features which contain diverse scales and different receptive fields. Moreover, the PDCA module enhances descriptors by the geometry information of keypoints detected, while enhancing the keypoint detection process by the position coordinates of correctly matched descriptors. Finally, we design a lightweight model to improve the running efficiency. Extensive experiments on various tasks demonstrate that our method achieves a substantial performance improvement over the current feature extraction methods. Code is available at: https://github.com/meng152634/CA2Point.

Abstract:
Micropuncture is a critical step in drug injection during retinal vein cannulation (RVC) surgery. Minimizing deformation during the micropuncture process is beneficial to reduce mechanical damage. However, this goal is challenging due to the viscoelastic characteristics of retinal tissue. In this paper, a robotic micropuncture scheme for deformation optimization that incorporates a nonlinear force model is proposed. Before micropuncture, a preload strategy is utilized to ensure stable contact between needle and retinal vein. Secondly, a nonlinear viscoelastic (NV) model is developed to characterize the nonlinearity and relaxation behavior of the tissue. Finally, a speed optimization framework, based on the NV model and physical constraint, is adopted to minimize deformation. The effectiveness of the proposed scheme is validated through in vitro experiments conducted on open-sky porcine eyes. With average force error of 1.48 μN, stable contact can be achieved via proportion-integral-differential controller. The experimental results demonstrate that the NV model is more suitable for force modeling of retinal tissue. Furthermore, the optimized speed results in an average deformation of 0.5727 mm, which represents a reduction of at least 21.02% compared to the linear model. Thanks to the proposed scheme, the robotic micropuncture based on a varying speed trajectory can reduce deformation and enhance the safety of RVC surgery.

Abstract:
This paper investigates the problem of safe visual servoing control of manipulators using an uncalibrated eye-in-hand camera based on control barrier functions (CBFs). Traditional CBFs are defined in the workspace, corresponding to the global coordinates of the base frame. However, when the camera’s position or orientation is adjusted for a better field of view, it becomes uncalibrated, making it challenging to obtain the precise positions of the robot and obstacles using onboard sensors like a camera. To address this, we propose a novel visual servoing control barrier function (VS-CBF) for manipulators, which depends only on the image and depth data sensed by an RGB-D camera. Given an uncalibrated camera, we develop an adaptive estimator for the unknown camera parameters. Based on this estimator, we also design a kinematic visual servoing control law as a nominal controller, ensuring the convergence of the robotic system. The safe controller is then obtained by solving a quadratic programming problem that incorporates the designed VS-CBF and the nominal controller. Finally, experimental results conducted on a UR3 manipulator are presented to demonstrate the effectiveness of our approach.

Abstract:
Although significant progress has been made in coordinating multi-unmanned surface vehicle (multi-USV or USVs) systems over the past decades, consistent monitoring (or tracking) of such systems remains challenging as they do not share data with monitoring systems, further exacerbated by observed data inaccuracies and sparsity. To tackle the complicated issue, we hereby introduce the multi-unmanned aerial vehicle (multi-UAV or UAVs) system to monitor the multi-USV system. Therein, by introducing a sparse-Bayesian-learning-based (SBL-based) algorithm, the multi-UAV system can identify the potential coordinated dynamics of multi-USV system via only noisy and limited data. Then, by employing the Kalman filter (KF), the proposed approach can predict and update real-time data and optimize trajectory estimation for USVs, and enhance coordination control in the multi-UAV system to achieve coordinated monitoring. Finally, comparative simulations against the traditional control method, conducted under varying noise levels and data availability ratios, demonstrate the effectiveness and superiority of the proposed method.

Abstract:
Natural Orifice Transluminal Endoscopic Surgery (NOTES) holds great promise due to its ability to eliminate external incisions, reduce trauma, and accelerate recovery. However, the adoption of NOTES is hindered by the limited capabilities of existing instruments, particularly in achieving the required balance between compact size, dexterity, and load capacity. This paper introduces a novel robotic manipulator designed for NOTES, featuring a 5 mm diameter and 7 degrees of freedom (DoF). The manipulator incorporates an innovative 3-PRS flexible parallel mechanism combined with a continuum parallel structure, achieving enhanced dexterity and variable stiffness functionality within a miniaturized design. A kinematic and variable stiffness analysis is performed, and experimental validation demonstrates its bending performance and stiffness modulation. Additionally, the feasibility and practicality of the robotic system are confirmed through a peg-transfer experiment, proving its potential for real-world surgical applications. This research offers a viable solution for enhancing the performance of NOTES instruments.

Abstract:
Modern-day autonomous robots need high-level map representations to perform sophisticated tasks. Recently, 3D scene graphs (3DSGs) have emerged as a promising alternative to traditional grid maps, blending efficient memory use and rich feature representation. However, most efforts to apply them have been limited to static worlds. This work introduces REACT, a framework that efficiently performs real-time attribute clustering and transfer to relocalize object nodes in a 3DSG. REACT employs a novel method for comparing object instances using an embedding model trained on triplet loss, facilitating instance clustering and matching. Experimental results demonstrate that REACT is able to relocalize objects while maintaining computational efficiency. The REACT framework’s source code will be available as an open-source project, promoting further advancements in reusable and up-datable 3DSGs1.

Abstract:
Artificial neural networks can be used to solve a variety of robotic tasks. However, they risk failing catastrophically when faced with out-of-distribution (OOD) situations. Several approaches have employed a type of synaptic plasticity known as Hebbian learning that can dynamically adjust weights based on local neural activities. Research has shown that synaptic plasticity can make policies more robust and help them adapt to unforeseen changes in the environment. However, networks augmented with Hebbian learning can lead to weight divergence, resulting in network instability. Furthermore, such Hebbian networks have not yet been applied to solve legged locomotion in complex real robots with many degrees of freedom. In this work, we improve the Hebbian network with a weight normalization mechanism for preventing weight divergence, analyze the principal components of the Hebbian’s weights, and perform a thorough evaluation of network performance in locomotion control for real 18-DOF dung beetle-like and 16-DOF gecko-like robots. We find that the Hebbian-based plastic network can execute zero-shot sim-to-real adaptation locomotion and generalize to unseen conditions, such as uneven terrain and morphological damage.

Abstract:
Fast-legged humanoid robots are transforming industries from manufacturing to medical robotics, with the global market projected to grow from 0.67 billion in 2024 to 2.27 billion by 2033 at a 14.3% CAGR. Despite rapid advancements, challenges remain in navigating complex terrains, especially uneven, deformable, and high-friction surfaces. This paper presents the first minimally sensorised blade leg made by coupling soft and rigid materials for robots: an alternative approach for multimodal sensing and advanced control algorithms in terrain navigation. This incorporates a passive leg design embedded with barometric pressure sensors that are proven to retain high dimentional spatio-temporal data. Hence we hypothesized that barometric pressure sensors can capture multidimensional terrain data and subtle surface compliance changes through spatiotemporal pressure patterns. The blade was mounted on an UR5 robotic arm and tested in terrains of varied textures, including aluminium, pebble, coir, and sandpaper; materials spanning a diverse range of stiffness. Spatiotemporal data from the sensors were recorded and analyzed to assess terrain characteristics and leg-terrain interactions under different conditions. The results demonstrated that barometric pressure sensors could accurately recognize different terrains with as few as three sensors in a 2-second time frame. Recognition accuracy improved with more sensors, demonstrating the effectiveness of morphologically adapted composite structures with optimally placed minimal sensors.

Abstract:
High-dynamic tactile sensing and tactile servo control present challenges in robustness and real-time performance. This paper proposes a closed-loop tactile servo control strategy for robotic nail hammering, by allowing controlled hammer slide within a rigid robotic 2-finger gripper. The proposed approach detects tactile information of continuous sliding and sliding-induced vibrations in real time and modulates gripping force. The control encourages rotational sliding to enhance impact and reduce recoil while restricting parallel slippage to maintain grip stability. To achieve real-time processing and effective sliding feature extraction, we employ Short-Time Fourier Transform (STFT) and a dual-stream Physics-Informed Machine Learning (PIML) model, processing tactile data at 1 kHz with an average latency of 1.04 ms. Experimental results show that, compared to conventional methods, controlling hammer slippage reduces arm joint recoil by 64.26% (223.30 N → 79.81 N) while increasing hammer impact force by 179.97% (28.06 N → 78.56 N). The method adapts to hammers with varying mass distributions, significantly improving impact resilience and manipulation performance in high-dynamic interactions. These advancements pave the way for more dexterous and robust robotic systems with embodied intelligence.

Abstract:
Underwater assistance is crucial for individuals who depend on diving for their livelihood. In this paper, we propose a novel underwater exosuit actuator designed to assist with flutter kicking during diving, thereby decreasing the effort the diver has to exert. The actuator can provide bidirectional assistance to the up and downbeats when the diver kicks underwater, and has no restriction on leg movements when it is deactivated. Both the benchtop experiment and human subject tests were conducted to verify its performance. The benchtop experiment verified its kinematic features, while tests with five participants validated its assistive performance. The results indicate that the actuator delivers a peak torque of 0.0947 Nm/kg and a peak force of 100 N in both directions, while allowing free leg movement during walking or kicking when not powered, thus ensuring safety during diving.

Abstract:
This study presents a novel environment-aware reinforcement learning (RL) framework designed to augment the operational capabilities of autonomous underwater vehicles (AUVs) in underwater environments. Departing from traditional RL architectures, the proposed framework integrates an environment-aware network module that dynamically captures flow field data, effectively embedding this critical environmental information into the state space. This integration facilitates real-time environmental adaptation, significantly enhancing the AUV’s situational awareness and decision-making capabilities. Furthermore, the framework incorporates AUV structure characteristics into the optimization process, employing a large language model (LLM)-based iterative refinement mechanism that leverages both environmental conditions and training outcomes to optimize task performance. Comprehensive experimental evaluations demonstrate the framework’s superior performance, robustness and adaptability.

Abstract:
Offline reinforcement learning faces a critical challenge in synthesizing high-reward trajectories from suboptimal datasets while robustly handling the stochasticity inherent in real-world decision-making. While combination of return-conditioned sequence models, such as Decision Transformers (DT), and dynamics programming critics shows great potential in trajectory synthesis, their deterministic action generation and scale Q value critic often fails to distinguish intentional behavioral variability from detrimental noise, leading to suboptimal policy collapse. To address this challenge, we propose the Distributional Decision Transformer (DDT), a novel framework that unifies probabilistic return distribution modeling with autoregressive action generation. DDT introduces two key innovations: (1) a Gaussian stochastic return mechanism that reparameterizes target returns as samplable distributions, enabling diverse action candidate generation; and (2) an Implicit Quantile Network (IQN) critic embedded within the deciding loop, which evaluates actions across the full spectrum of return distributions (quantiles τ ~ U(0, 1)). In D4RL benchmarks, DDT achieves state-of-the-art performance, achieving a 91.6 average normalized score in MuJoCo locomotion and 69.3 in sparse-reward settings. The results establish DDT as a principled solution for synthesis of risk-aware trajectory in offline RL.

Abstract:
In this work, we propose a novel quadrotor design capable of folding its arms vertically to grasp objects and navigate through narrow spaces. The transformation is controlled actively by a central servomotor, gears, and racks. The arms connect the motor bases to the central frame, forming a parallelogram structure that ensures the propellers maintain a constant orientation during morphing. In its stretched state, the quadrotor resembles a conventional design, and when contracted, it functions as a gripper with grasping components emerging from the motor bases. To mitigate disturbances during transforming and grasping payloads, we employ an adaptive sliding mode controller with a disturbance observer. After fully folded, the quadrotor frame shrinks to 67% of its original size. The control performance and versatility of the morphing quadrotor are validated through real-world experiments.

Abstract:
In an aging society, the need for rehabilitation treatment is expected to rise. As current healthcare systems have limited capacity and personnel, access to rehabilitation devices usable in households can help address the demand. A lower-limb rehabilitation robot designed for home use must be adaptable to accommodate acute and chronic rehabilitation phases. Existing devices are mechanically complex and require intricate, patient-specific adjustments. To address this, we propose a single degree of freedom (DoF) mechanism based on a chain drive that can be used in multiple configurations, inside and outside a patient bed. We model the gait pattern and construct a custom cost function that captures key features of natural human walking. This cost function is then used to optimize the design parameters of the robot via a direct-search solver to accommodate patients of varying sizes and achieve effective rehabilitation with a fixed trajectory. The outcome is validated experimentally by comparing two robot configurations with five healthy subjects.

Abstract:
Designing robotic systems to act autonomously in unforeseen environments is a challenging task. This work presents a novel approach to use formal verification, specifically Statistical Model Checking (SMC), to verify system properties of autonomous robots at design-time. We introduce an extension of the SCXML format, designed to model system components including both Robot Operating System 2 (ROS 2) and Behavior Tree (BT) features. Further, we contribute Autonomous Systems to Formal Models (AS2FM), a tool to translate the full system model into JANI. The use of JANI, a standard format for quantitative model checking, enables verification of system properties with off-the-shelf SMC tools. We demonstrate the practical usability of AS2FM both in terms of applicability to real-world autonomous robotic control systems, and in terms of verification runtime scaling. We provide a case study, where we successfully identify problems in a ROS 2-based robotic manipulation use case that is verifiable in less than one second using consumer hardware. Additionally, we compare to the state of the art and demonstrate that our method is more comprehensive in system feature support, and that the verification runtime scales linearly with the size of the model, instead of exponentially.

Abstract:
Humans interacting with robots often form predictions of what the robot will do next. For instance, based on the recent behavior of an autonomous car, a nearby human driver might predict that the car is going to remain in the same lane. It is important for the robot to understand the human’s prediction for safe and seamless interaction: e.g., if the autonomous car knows the human thinks it is not merging — but the autonomous car actually intends to merge — then the car can adjust its behavior to prevent an accident. Prior works typically assume that humans make precise predictions of robot behavior. However, recent research on human-human prediction suggests the opposite: humans tend to approximate other agents by predicting their high-level behaviors. We apply this finding to develop a second-order theory of mind approach that enables robots to estimate how humans predict they will behave. To extract these high-level predictions directly from data, we embed the recent human and robot trajectories into a discrete latent space. Each element of this latent space captures a different type of behavior (e.g., merging in front of the human, remaining in the same lane) and decodes into a vector field across the state space that is consistent with the underlying behavior type. We hypothesize that our resulting high-level and course predictions of robot behavior will correspond to actual human predictions. We provide initial evidence in support of this hypothesis through proof-of-concept simulations, testing our method’s predictions against those of real users, and experiments on a real-world interactive driving dataset.

Abstract:
In recent years, humanoid robots have garnered significant attention from both academia and industry due to their high adaptability to environments and human-like characteristics. With the rapid advancement of reinforcement learning, substantial progress has been made in the walking control of humanoid robots. However, existing methods still face challenges when dealing with complex environments and irregular terrains. In the field of perceptive locomotion, existing approaches are generally divided into two-stage methods and end-to-end methods. Two-stage methods first train a teacher policy in a simulated environment and then use distillation techniques, such as DAgger, to transfer the privileged information learned as latent features or actions to the student policy. End-to-end methods, on the other hand, forgo the learning of privileged information and directly learn policies from a partially observable Markov decision process (POMDP) through reinforcement learning. However, due to the lack of supervision from a teacher policy, end-to-end methods often face difficulties in training and exhibit unstable performance in real-world applications. This paper proposes an innovative two-stage perceptive locomotion framework that combines the advantages of teacher policies learned in a fully observable Markov decision process (MDP) to regularize and supervise the student policy. At the same time, it leverages the characteristics of reinforcement learning to ensure that the student policy can continue to learn in a POMDP, thereby enhancing the model’s upper bound. Our experimental results demonstrate that our two-stage training framework achieves higher training efficiency and stability in simulated environments, while also exhibiting better robustness and generalization capabilities in real-world applications.

Abstract:
Throwing is a fundamental skill that enables robots to manipulate objects in ways that extend beyond the reach of their arms. We present a control framework that combines learning and model-based control for prehensile whole-body throwing with legged mobile manipulators. Our framework consists of three components: a nominal tracking policy for the end-effector, a high-frequency residual policy to enhance tracking accuracy, and an optimization-based module to improve end-effector acceleration control. The proposed controller achieved the average of 0.28m landing error when throwing at targets located 6m away. Furthermore, in a comparative study with university students, the system achieved a velocity tracking error of 0.398m/s and a success rate of 56.8%, hitting small targets randomly placed at distances of 3-5m while throwing at a specified speed of 6m/s. In contrast, humans have a success rate of only 15.2%. This work provides an early demonstration of prehensile throwing with quantified accuracy on hardware, contributing to progress in dynamic whole-body manipulation. A video summarizing the proposed method and the hardware tests is available at https://youtu.be/8ItiQrgN_fw.

Abstract:
Resection of pathological tissue is a common procedure in surgical oncology for treating tumors. In robot-assisted electrosurgery, the use of predefined markers to guide autonomous robotic resection is gaining traction. Accurate tracking of these markers and minimizing electrocautery damage are critical for the safe and effective autonomous resection of tumors. This paper introduces a safety enhanced autonomous resection method for laparoscopic surgery, designed to mitigate the risks posed by tissue deformation during the resection process. Initially, we pre-plan the cutting path and design a switching strategy for navigation waypoints based on a preview tracking mechanism. Then, we develop a depth-fused navigation controller and a safe withdrawal motion controller. Next, an inertial tracking mechanism is established to evaluate tissue deformation over short periods. Finally, we develop a confidence generator to fuse the two controllers, ensuring that tissue deformation during the resection process does not cause additional electrocautery damage. Simulation and phantom experiments were conducted, demonstrating the effectiveness of our proposed method. This work represents a significant step toward achieving autonomous robotic resection.

Abstract:
Human intention detection with hand motion prediction is critical to drive the upper-extremity assistive robots in neurorehabilitation applications. However, the traditional methods relying on physiological signal measurement are restrictive and often lack environmental context. We propose a novel approach that predicts future sequences of both hand poses and joint positions. This method integrates gaze information, historical hand motion sequences, and environmental object data, adapting dynamically to the assistive needs of the patient without prior knowledge of the intended object for grasping. Specifically, we use a vector-quantized variational autoencoder for robust hand pose encoding with an autoregressive generative transformer for effective hand motion sequence prediction. We demonstrate the usability of these novel techniques in a pilot study with healthy subjects. To train and evaluate the proposed method, we collect a dataset consisting of various types of grasp actions on different objects from multiple subjects. Through extensive experiments, we demonstrate that the proposed method can successfully predict sequential hand movement. Especially, the gaze information shows significant enhancements in prediction capabilities, particularly with fewer input frames, highlighting the potential of the proposed method for real-world applications.

Abstract:
Needle puncture is a fundamental technique in minimally invasive surgical procedures. However, the limited flexibility of flexible needles and their complex interactions with tissues make it challenging to avoid critical organs along the puncture path. Preoperative path planning, which generates feasible collision-free trajectories, can effectively reduce repeated punctures and mitigate patient discomfort. To address this challenge, a flexible needle with increased maximum curvature is designed, which introduces more complex kinematic characteristics and poses greater challenges for trajectory planning under kinematic constraints. Then, for the first time, a convex feasible set (CFS)-based flexible needle trajectory planning method is developed to tackle the non-convex optimization problem posed by obstacle avoidance in unstructured surgical environments. Specifically, our method explicitly incorporates kinematic and curvature constraints, enabling direct generation of feasible trajectories without additional post-processing. Finally, comparative experiments on a self-developed robotic-assisted flexible needle system demonstrate the superior performance of the proposed algorithm. In particular, the proposed trajectory generation method allows the flexible needle to effectively avoid obstacles and accurately reach the target.

Abstract:
Due to recent advances in large language models and robotics, social robots will potentially play an important role in people’s daily lives soon, and are expected to improve dynamic multi-party group discussions in social scenarios. In this paper, we developed a system to assist dynamic group discussion with our social robot Haru. Our system is composed of three modules: a Dialogue Assistance module via integrating Haru with large language models which facilitates Haru to be an embodied chatbot; a Balancing and Welcoming Behavior module to improve users’ engagement and welcome new users to join the discussion with verbal behaviors; an Autonomous Eye Gazing module to show politeness during group discussion, e.g., gazing to the talking user or the less-engaging user to encourage her, looking to the new comer when she joins the discussion, gazing via eyeball movement when the current speaking user is close to the previous one. The autonomous eye gazing behavior was first trained via deep reinforcement learning in simulation and transferred to physical Haru in the real world. Results of our user study with 50 subjects show the significant performance of our system in assisting dynamic group discussion.

Abstract:
Soft shape-morphing technologies are explored in fields such as soft robotics, metamaterials and design, enabling systems to adapt dynamically through elastic deformations. Applied in mobile devices, actuators, interactive objects and environments, these systems can respond to functional needs and environmental stimuli, communicating information and enhancing human experiences. A primary design goal for these systems is achieving extensive and complex shape transformations. Traditionally, soft robotics employs a pneumatic active layer constrained by a passive layer, limiting the deformation range. However, using dual active layers can expand deformation potential. Expanding on these principles, this work introduces PNEUmorph: a pneumatic surface constrained by a network of variable-length tendons, allowing broader shape transformations than traditional single-layer systems. PNEUmorph’s dual-layer actuation overcomes fixed deformation limits, significantly enhancing shape-morphing capabilities. This paper presents PNEUmorph’s design and geometrical characterization, achieved through an interdisciplinary approach that merges design and soft robotics methods. This study details methods for simulation, fabrication, operation and evaluation, offering insights into experimental results and directions for advancing surface-based soft shape-morphing systems.

Abstract:
For individuals with limited mobility who are bedridden for extended periods, providing comfortable assisted feeding services is one of the most significant actions to enhance their quality of life. Despite the development of various feeding assistive robots, there remain limitations in terms of interaction convenience and safety, which restrict the overall feeding experience for users. To address these challenges, this study first establishes a feeding assistive robot system that integrates multimodal interaction methods. Furthermore, we propose an interactive feeding method that combines both safety and comfort. This method utilizes visual recognition to detect the user’s active meal intent, food selection preferences, and chewing status. Additionally, based on a Large Language Model (LLM), a monitoring thread is designed to conduct voice interactions regarding the user’s ambiguous intentions, temporary changes in intent, emergency situations, and risky behaviors throughout the feeding process. Comprehensive experimental results demonstrate that the proposed multimodal interaction method, which aligns with the natural eating patterns, incorporates both language and visual interactions, the two most convenient forms for users. It also matches force sensing and pose control techniques during the feeding stages, thereby enhancing the flexibility and safety of the assisted feeding system.

Abstract:
The ability to perform reliable long-horizon task planning is crucial for deploying robots in real-world environments. However, directly employing Large Language Models (LLMs) as action sequence generators often results in low success rates due to their limited reasoning ability for long-horizon embodied tasks. In the STEP framework, we construct a subgoal tree through a pair of closed-loop models: a subgoal decomposition model and a leaf node termination model. Within this framework, we develop a hierarchical tree structure that spans from coarse to fine resolutions. The subgoal decomposition model leverages a foundation LLM to break down complex goals into manageable subgoals, thereby spanning the subgoal tree. The leaf node termination model provides real-time feedback based on environmental states, determining when to terminate the tree spanning and ensuring each leaf node can be directly converted into a primitive action. Experiments conducted in both the VirtualHome WAH-NL benchmark and on real robots demonstrate that STEP achieves long-horizon embodied task completion with success rates up to 34% (WAH-NL) and 25% (real robot) outperforming SOTA methods.

Abstract:
Space-limited U-shape bend is a safety-critical scenario that requires the high maneuverability of vehicles. However, due to the non-holonomic nature of the vehicle, it is difficult to perform flexible U-turns without intricate adjustments, which is detrimental to the efficient execution of tasks. To address these issues, this work incorporates the drifting maneuver of the vehicle and proposes a planning and control framework for time-space efficient passing in constrained U-shape bends. First, a dual-track, 3-Dof vehicle model is developed, incorporating load transfer effects and nonlinear tire forces to enhance trajectory precision. Based on this model, a nonlinear optimization-based planner generates time-optimal, space-efficient, and drift-compatible trajectories while ensuring dynamic feasibility. Finally, a multilayer controller is designed for precise trajectory tracking, integrating a trajectory error feedback compensator, a dynamic state feedforward-feedback regulator, and a model inversion-based actuator controller. Simulation experiments in CarSim validate the proposed framework, demonstrating significant improvements in spatial efficiency and completion time. The results highlight its effectiveness in enhancing autonomous vehicle maneuverability for high-performance applications in constrained environments.

Abstract:
In nature, fish locomotion is primarily classified into the BCF (body and caudal fin) propulsion mode and the MPF (median and paired fin) propulsion mode. This paper presents a bio-inspired robotic electric ray that integrates a BCF-mode caudal fin with MPF-mode pectoral fins. The caudal fin consists of a set of wire-driven, multi-joint active segments coupled with a soft, compliant segment, while each symmetrical pectoral fin incorporates two sets of wire-driven joints and a soft fin structure. The undulatory motion of the MPF-mode pectoral fins enables the robotic ray to execute maneuvers such as forward swimming, backward swimming, and in-place turning, whereas the BCF-mode caudal fin enhances linear swimming and turning capabilities. Experimental results demonstrate that MPF-mode swimming achieves a maximum speed of 0.190 m/s (0.358 BL/s), while the cooperative propulsion of MPF and BCF modes enables speeds of up to 0.352 m/s (0.664 BL/s). Notably, the robotic electric ray's large pectoral fins can function as grippers, allowing it to grasp and transport objects using caudal fin propulsion, thereby facilitating object manipulation tasks.

Abstract:
With the rapid development and widespread application of autonomous driving technology, the accurate analysis and prevention of traffic accidents have become critical challenges. However, current traffic accident datasets are often constrained by limited scale and diversity, impeding progress in this field. To address these limitations, we introduce AccidentX, a large-scale multimodal dataset specifically curated for comprehensive traffic accident analysis and prevention. Our AccidentX comprises over 10,000 bird’s-eye view (BEV) videos generated using the CARLA simulator, with detailed annotations covering a wide range of traffic scenarios. In comparison to existing datasets such as nuScenes, our AccidentX offers seven times more video frames and leverages Vision-Language Models (VLMs) and GPT-4o for enhanced scene understanding and decision-making. We also establish a benchmark for state-of-the-art Multimodal Large Language Models (MLLMs) on AccidentX, fostering further research and innovation within the community. AccidentX will be made available as a fully open source resource for the advancement of the autonomous driving safety algorithm community.

Abstract:
Low stiffness in 7-DOF cable-driven humanoid manipulators limits their precision, posing a significant challenge in complex human-robot interaction (HRI) scenarios. This paper presents a motion planning framework to enhance manipulator stiffness for both single and dual-arm configurations. For a single arm, we introduce a novel method that integrates dynamic obstacle avoidance with posture optimization to maximize end-effector stiffness. For dual-arm systems, we develop a coupled stiffness model that addresses inter-arm dynamics to improve performance in coordinated tasks. Experimental results on prototypes confirm that the proposed methods significantly reduce end-effector deviation under load, thereby improving the precision and reliability of these manipulators in sophisticated collaborative applications.

Abstract:
The code performance of industrial robots is typically analyzed through CPU metrics, which overlook the physical impact of code on robot behavior. This study introduces a novel framework for assessing robot program performance from an embodiment perspective by analyzing the robot’s electrical power profile. Our approach diverges from conventional CPU-based evaluations and instead leverages a suite of normalized metrics, namely, the energy utilization coefficient (fU), the energy conversion metric (fC), and the reliability coefficient (fR), to capture how efficiently and reliably energy is used during task execution. Complementing these metrics, the established robot wear metric (α) provides further insight into long-term reliability. Our approach is demonstrated through an experimental case study in machine tending, comparing four programs with diverse strategies using a UR5e robot. The proposed metrics directly compare and categorize different robot programs, regardless of the specific task, by linking code performance to its physical manifestation through power consumption patterns. Our results reveal the strengths and weaknesses of each strategy, offering actionable insights for optimizing robot programming practices. Enhancing energy efficiency and reliability through this embodiment-centric approach not only improves individual robot performance but also supports broader industrial objectives such as sustainable manufacturing and cost reduction.

Affiliations: Chaoquan Tang is With the State Key Laboratory of Intelligent Mining Equipment Technology, China University of Mining and Technology; School of Mechanical and Electrical Engineering, China University of Mining and Technology, Xuzhou, China; Institute of Machine Intelligence, University of Shanghai for Science and Technology; Thrust of Robotics and Autonomous Systems, The Hong Kong University of Science and Technology (Guangzhou); Information Institute, Ministry of Emergency Management of the People’s Republic of China; School of Mechanical Engineering and Automation, HARBIN Institute of Technology and Nankai University

Abstract:
Improving motion speed and efficiency remains a critical challenge in snake robots gait control. This paper introduces the Loop gait, a novel locomotion gait designed to enhance both speed and energy efficiency of snake robots without passive wheels. Compared to Crawler gait and S-pedal gait, which are more widely used, the Loop gait has a better motion speed (1.8 times of the Crawler gait in the same parameter) and a better motion efficiency (1.6 times of the Crawler gait in the same parameter) due to its more loop body morphology. A static stability model is developed to guide parameter optimization, addressing potential instability caused by elevated center of mass of snake robots. Experiments confirm the Loop gait’s exceptional energy efficiency and propulsion, validating the static stability model’s utility in selecting parameters.

Abstract:
Social navigation has become increasingly important for robots operating in human environments, yet many newly proposed navigation methods remain narrowly tailored or exist only as proof-of-concept prototypes. Building on our previous work with Arena, a social navigation development platform, we now propose, Arena-Bench 2.0 a comprehensive social navigation benchmark of state-of-the-art planners, fully integrated into the Arena framework. To achieve this, we developed a novel plugin structure—implemented on ROS2—to streamline the integration process and ensure straightforward, efficient workflows. As a demonstration, we integrated various learning-based and model-based navigation approaches and constructed a diverse set of social navigation scenarios to rigorously evaluate each planner. Specifically, we introduce a scenario generation node that allows users to construct complex, realistic social contexts through a web-based interface. We subsequently perform an extensive benchmark of all integrated planners, assessing both navigational and social metrics. Our evaluation also considers factors such as sensor input, reaction time, and latency, enabling insights into which planner may be most appropriate under different circumstances. The findings offer valuable guidance for selecting suitable planners for specific scenarios. The code is publicly available at https://github.com/Arena-Rosnav.

Abstract:
We present a large language model (LLM) based system to empower quadrupedal robots with problem-solving abilities for long-horizon tasks beyond short-term motions. Long-horizon tasks for quadrupeds are challenging since they require both a high-level understanding of the semantics of the problem for task planning and a broad range of locomotion and manipulation skills to interact with the environment. Our system builds a high-level reasoning layer with large language models, which generates hybrid discrete-continuous plans as robot code from task descriptions. It comprises multiple LLM agents: a semantic planner that sketches a plan, a parameter calculator that predicts arguments in the plan, a code generator that converts the plan into executable robot code, and a replanner that handles execution failures or human interventions. At the low level, we adopt reinforcement learning to train a set of motion planning and control skills to unleash the flexibility of quadrupeds for rich environment interactions. Our system is tested on long-horizon tasks that are infeasible to complete with one single skill. Simulation and real-world experiments show that it successfully figures out multi-step strategies and demonstrates non-trivial behaviors, including building tools or notifying a human for help. Demos are available on our project page: https://sites.google.com/view/long-horizon-robot.

Abstract:
We present CROSS-GAiT, a novel algorithm for quadruped robots that uses Cross Attention to fuse terrain representations derived from visual and time-series inputs; including linear accelerations, angular velocities, and joint efforts. These fused representations are used to continuously adjust two critical gait parameters (step height and hip splay), enabling adaptive gaits that respond dynamically to varying terrain conditions. To generate terrain representations, we process visual inputs through a masked Vision Transformer (ViT) encoder and time-series data through a dilated causal convolutional encoder. The Cross Attention mechanism then selects and integrates the most relevant features from each modality, combining terrain characteristics with robot dynamics for informed gait adaptation. This fused representation allows CROSS-GAiT to continuously adjust gait parameters in response to unpredictable terrain conditions in real-time. We train CROSS-GAiT on a diverse set of terrains including asphalt, concrete, brick pavements, grass, dense vegetation, pebbles, gravel, and sand and validate its generalization ability on unseen environments. Our hardware implementation on the Ghost Robotics Vision 60 demonstrates superior performance in challenging terrains, such as high-density vegetation, unstable surfaces, sandbanks, and deformable substrates. We observe at least a 7.04% reduction in IMU energy density and a 27.3% reduction in total joint effort, which directly correlates with increased stability and reduced energy usage when compared to state-of-the-art methods. Furthermore, CROSS-GAiT demonstrates at least a 64.5% increase in success rate and a 4.91% reduction in time to reach the goal in four complex scenarios. Additionally, the learned representations perform 4.48% better than the state-of-the-art on a terrain classification task.

Abstract:
Jumping constitutes an essential component of quadruped robots’ locomotion capabilities, which includes dynamic take-off and adaptive landing. Existing quadrupedal jumping studies mainly focused on the stance and flight phase by assuming a flat landing ground, which is impractical in many real world cases. This work proposes a safe landing framework that achieves adaptive landing on rough terrains by combining Trajectory Optimization (TO) and Reinforcement Learning (RL) together. The RL agent learns to track the reference motion generated by TO in the environments with rough terrains. To enable the learning of compliant landing skills on challenging terrains, a reward relaxation strategy is synthesized to encourage exploration during landing recovery period. Extensive experiments validate the accurate tracking and safe landing skills benefiting from our proposed method in various scenarios.

Abstract:
Collective microrobots enable controlled batch delivery, showing promising application in the biomedical field. However, significant challenges remain in achieving long-distance delivery of collective microrobots in dynamic environments. This study proposes a magnetic actuation strategy for delivering collective cell microrobots in flowing conditions. A magnetic actuation method is developed, and a mobile actuation system with multiple coils coordination is designed to generate spatially isotropic magnetic fields. Experiments of delivering collective microrobots are conducted in flowing conditions, including downstream and upstream with an average flow velocity up to 8.84 mm/s. Results demonstrate that the proposed actuation strategy enhances driving performance in dynamic environments, achieving long-distance delivery of collective microrobots (over 548 mm). The final access rate of microrobots reaches 90.63% and 94.79% in upstream and downstream conditions, respectively. Our strategy provides an efficient control method for delivering collective microrobots, showing potential for targeted delivery in biomedical applications.

Abstract:
Shared autonomy can improve teleoperating robotic systems in complex manufacturing and assembly tasks by combining human decision-making and robotic capabilities. A key aspect of seamless collaboration and trust in shared autonomy is the robot’s ability to interpret human intentions in a consistent and explainable manner. To achieve this, a graph neural network-based intention estimation framework is introduced, which generates dynamic graphs that capture spatial relationships evolving over time. The framework predicts human intentions at two hierarchical levels: low-level actions and high-level tasks. Furthermore, we empirically and anecdo-tally verify the correctness and consistency of the predictions using explainability metrics. The algorithm is demonstrated by teleoperating a bi-manual robot to assemble various block structures in a virtual reality simulation environment.

Affiliations: School of Mechanical Engineering, Jiangnan University, Wuxi, China; School of Artificial Intelligence, Beijing University of Posts and Telecommunications, Beijing, China; Wuhan University of Science and Technology, Wuhan, China; Department of Computer Science and Technology, Beijing National Research Center for Information Science and Technology, Institute for Artificial Intelligence, State Key Lab of Intelligent Technology and Systems, Tsinghua University, Beijing, China

Abstract:
This paper presents a novel multi-mode bionic robotic hand. Its bionic finger (BIF) ingeniously combines a magnetic-silica-gel skin with a rigid skeletal framework and integrates a vacuum suction cup at the fingertip. This design enables the bionic manipulator to execute multiple grasping modes, namely enveloping, parallel, and suction grasping. The proposed BIF emulates the skeletal structure of human fingers and equips the fingertip with suction-based grasping functionality, thus achieving both formal bionics and functional superiority. The overall grasping space range of the bionic manipulator can be determined through the computation of the offset of the steel wire, which corresponds to the bending angles of the three joints of the finger. Furthermore, by discerning the four phases within the bionic manipulator’s object-grasping process, in-depth exploration is carried out regarding the unique data characteristics of the magnetic-tactile sensing unit during the grasping operation. On this basis, an accurate prediction of the grasped object’s diameter is achieved. We constructed an autonomous grasping operation platform by integrating an external depth camera with the robotic arm to assess the fundamental performance of this robotic hand in grasping diverse objects.

Affiliations: Key Laboratory of Metallurgical Equipment and Control Technology Ministry of Education, Wuhan University of Science and Technology, Wuhan, China; Institute of Applied Artificial Intelligence of the Guangdong-Hong Kong-Macao Greater Bay Area, and School of Artificial Intelligence, Shenzhen Polytechnic University, Shenzhen, China; Shenzhen Institute of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; School of Mechanical and Electrical Engineering, Shenzhen Polytechnic University, Shenzhen, China

Abstract:
Live fish possess the ability to modulate their body stiffness to achieve diverse swimming characteristics, a feature that is largely absent in existing robotic fish designs. Most robotic fish are constrained by rigid or fixed-stiffness bodies, which limit their flexibility, axial modulation capabilities, and ability to navigate confined spaces. This paper presents a novel soft robotic fish capable of stiffness modulation and earthworm-inspired wriggling locomotion. The design incorporates a pneumatic stiffness modulation mechanism, a cable-driven actuation system, and two passive flapping foils near the caudal fin. Two distinct swimming modes are demonstrated: a body and/or caudal fin (BCF) mode with active stiffness modulation and an earthworm-inspired wriggling mode. Experimental results validate the effectiveness of the proposed design in both swimming modes. This work advances the development of soft robotic fish by introducing innovative structural and actuation mechanisms, enhancing flexibility and adaptability for future applications in underwater exploration and related fields.

Abstract:
Continuum robots, known for their compliance in unstructured environments, face limitations due to the lack of rotational degrees of freedom (DOFs) about the backbone. This prevents them from compensating undesired torsional deformation and performing 6-DOF control of the end-effector, thereby restricting their mobility. This paper presents a continuum robot with integrated dual rotational DOFs. One is integrated at the arm base to compensate for torsional deformation caused by external loads, while the other one, located at the arm tip, enables full 6-DOF control of the end-effector. To control the robot, a screw-theory-based kinematic model and a kinematic control framework are proposed to enable real-time, simultaneous control of the end-effector’s position and orientation. Experimental results show that the arm base rotational joint can fully compensate for undesired torsional deformation caused by a 1000 g payload. Thanks to the arm tip’s DOF and the proposed kinematic control framework, the robot’s end-effector can maintain a constant orientation while achieving open-loop path-tracking errors of only 3.3% of the arm’s length (930 mm), and successfully executing valve-closing tasks with coordinated 6-DOF motion, demonstrating the robot’s potential for industrial maintenance, human-robot interaction, and confined-space manipulation.

Abstract:
Accurate temporal segmentation of human actions is critical for intelligent robots in collaborative settings, where a precise understanding of sub-activity labels and their temporal structure is essential. However, the inherent noise in both human pose estimation and object detection often leads to over-segmentation errors, disrupting the coherence of action sequences. To address this, we propose a Multi-Modal Graph Convolutional Network (MMGCN) that integrates low-frame-rate (e.g., 1 fps) visual data with high-frame-rate (e.g., 30 fps) motion data (skeleton and object detections) to mitigate fragmentation. Our framework introduces three key contributions. First, a sinusoidal encoding strategy that maps 3D skeleton coordinates into a continuous sin-cos space to enhance spatial representation robustness. Second, a temporal graph fusion module that aligns multi-modal inputs with differing resolutions via hierarchical feature aggregation, Third, inspired by the smooth transitions inherent to human actions, we design SmoothLabelMix, a data augmentation technique that mixes input sequences and labels to generate synthetic training examples with gradual action transitions, enhancing temporal consistency in predictions and reducing over-segmentation artifacts.Extensive experiments on the Bimanual Actions Dataset, a public benchmark for human-object interaction understanding, demonstrate that our approach outperforms state-of-the-art methods, especially in action segmentation accuracy, achieving F1@10: 94.5% and F1@25: 92.8%.

Abstract:
In this article, we present the Layered Semantic Graphs (LSG), a novel actionable hierarchical scene graph, fully integrated with a multi-modal mission planner, the FLIE: A First-Look based Inspection and Exploration planner [1]. The novelty of this work stems from aiming to address the task of maintaining an intuitive and multi-resolution scene representation, while simultaneously offering a tractable foundation for planning and scene understanding during an ongoing inspection mission of apriori unknown targets-of-interest in an unknown environment. The proposed LSG scheme is composed of locally nested hierarchical graphs, at multiple layers of abstraction, with the abstract concepts grounded on the functionality of the integrated FLIE planner. Furthermore, LSG encapsulates real-time semantic segmentation models that offer extraction and localization of desired semantic elements within the hierarchical representation. This extends the capability of the inspection planner, which can then leverage LSG to make an informed decision to inspect a particular semantic of interest. We also emphasize the hierarchical and semantic path-planning capabilities of LSG, which could extend inspection missions by improving situational awareness for human operators in an unknown environment. The validity of the proposed scheme is proven through extensive evaluations of the proposed architecture in simulations, as well as experimental field deployments on a Boston Dynamics Spot quadruped robot in urban outdoor environment settings.

Abstract:
Autonomous navigation of unmanned ground vehicles (UGVs) in dynamic scenes is a challenging task that requires them to avoid obstacles and move toward the goal simultaneously. This paper proposes ALVO, an adaptive learning policy that leverages velocity obstacles for UGV navigation. ALVO employs an adaptive gating-based mechanism for reactive obstacle avoidance, which enables the UGV to either slow down or proactively navigate around obstacles based on the relative importance of the environmental state and the goal. A reward function based on velocity obstacles is also designed to guide the UGV to navigate toward the goal while avoiding obstacles. Extensive experiments demonstrate that ALVO outperforms the competing approaches in various dynamic environments. We also implemented our method on a real UGV and showed that it performed well in real-world scenarios.

Abstract:
Soft growing robots mimic plant-like growth to navigate complex environments thanks to their specific actuation and material. This class of robots can also be used for manipulation tasks. While manufacturing these robots for specific tasks, it is crucial to carefully design their length and placement of joints. In this work, we extend our state-of-the-art optimizer for planar soft growing manipulators design, which retrieves the optimal robot dimensions for a specific given task. While the first version of the optimizer only considered a base case (where targets were only points in space), in this work, we implement five target handling modalities based on real-case manipulation scenarios. Specifically, targets are treated as obstacles and, as such, occupy space in the environment. Depending on the modality, the way these targets are handled can change. Results show that with this extension, the optimizer can tackle different manipulation cases correctly.

Abstract:
Engineered skeletal muscle tissue (SMT) is the ideal driving units for achieving fine movements in biosyncretic robots due to their excellent controllability and potentially large driving force. However, the selective stimulation of SMTs continues to pose a significant technical challenge. In this study, we propose a method for the selective stimulation of 3D SMT using thin-film interdigitated electrodes (IDEs). By optimizing the IDEs geometry through finite element simulations, electrical field intensity of fingertip is effectively reduced. The thin-film IDEs are fabricated on a Polyester (PET) substrate using screen printing technology and successfully enable selective activation and controlled contraction of the SMTs. Compared to conventional parallel-plate electrodes (PPEs) and rod-shaped electrodes (RSEs), the IDEs significantly improve the electrical field distribution and enhance spatial resolution. This advancement provides a promising new approach for achieving high-precision motion control in biosyncretic robots (or biohybrid robots).

Abstract:
In order to provide natural and immersive interactive experience for teleoperation in the context of human-robot collaboration and interaction, this work introduces a natural human-robot interaction system for teleoperation based on ultrasonic haptic feedback. Specifically, our system can accurately capture an operator's hand movements and replicate these actions on the remote robot with low latency and high fidelity. It utilizes an ultrasonic phased array to achieve non-contact haptic feedback. We propose a dynamic ultrasonic array acoustic field customization method based on interactive feature information image. This method can dynamically adjust the acoustic field according to the operator's hand characteristics, focus on multiple target points in real time, and project them onto an operator's fingertips, thereby providing force-controllable non-contact haptic feedback to the operator. The operator is integrated into the feedback loop of our system, controlling the system through multimodal feedback to form a high-quality human-in-the-loop closed control system. The system's performance is validated in two classic robotic tasks: block pick-and-place and nut-tightening. The experimental results show that the system exhibits excellent accuracy and dexterity, and can efficiently complete tasks with high accuracy while providing great interactive experience for operators.

Abstract:
Multi-Agent Reinforcement Learning (MARL) has shown promise in solving complex problems involving cooperation and competition among agents, such as an Unmanned Surface Vehicle (USV) swarm used in search and rescue, surveillance, and vessel protection. However, aligning system behavior with user preferences is challenging due to the difficulty of encoding expert intuition into reward functions. To address the issue, we propose a Reinforcement Learning with Human Feedback (RLHF) approach for MARL that resolves credit-assignment challenges through an Agent-Level Feedback system categorizing feedback into intra-agent, inter-agent, and intra-team types. To overcome the challenges of direct human feedback, we employ a Large Language Model (LLM) evaluator to validate our approach using feedback scenarios such as region constraints, collision avoidance, and task allocation. Our method effectively refines USV swarm policies, addressing key challenges in multi-agent systems while maintaining fairness and performance consistency.

Abstract:
We propose a Res-Mlp-based attention mechanism for robotic navigation where dynamic and static obstacles coexist, enhancing collision avoidance and robot navigation. Traditional approaches struggle with predictive foresight and real-world complexity, limiting their applicability. To address these challenges, we introduce an inverse attention-weighted module based on Res-Mlp module to refine Robot-Human and Robot-Obstacle attention, improving model robustness and sensitivity to hazards. Our approach builds upon HH attn and integrates the Gumbel Social Transformer (GST), enabling more accurate and safer robot navigation. Additionally, we construct a heterogeneous spatio-temporal interaction graph and incorporate diverse obstacle shapes to create realistic scenarios. Using curriculum learning, we improve model efficiency and generalization. Experimental results show a 96% success rate in high-density crowds and 92% in complex environments with diverse obstacles, demonstrating enhanced feature extraction and safer navigation. Our method achieves a balance between conservatism and assertiveness, making it a reliable solution for real-world deployment.

Abstract:
In recent years, numerous studies have been conducted on dialogue robots powered by large language models,enabling sophisticated interactions such as providing guidance and engaging in small talk. However, the interaction performance remains imperfect, and the robots sometimes cause problems during interactions. In this study, we aim to automatically detect such anomalies in human-robot interactions by creating a dataset and developing anomaly detection models. To this end, we created a dataset by manually annotating videos of in-the-wild interactions collected from our field experiment designed to test a framework of parallel conversations in which a human intervenes when a problem occurs in the interaction. Using this dataset, we trained classification models to construct anomaly detection models. We then conducted another field experiment in which the model’s detection results were presented as alerts to operators within the parallel conversation framework. The results confirmed that providing alerts on the basis of the anomaly detection model was useful for facilitating operator intervention.

Abstract:
Humans possess delicate dynamic balance mechanisms that enable them to maintain stability across diverse terrains and under extreme conditions. However, despite significant advances recently, existing locomotion algorithms for humanoid robots are still struggle to traverse extreme environments, especially in cases that lack external perception (e.g., vision or LiDAR). This is because current methods often rely on gait-based or perception-condition rewards, lacking effective mechanisms to handle unobservable obstacles and sudden balance loss. To address this challenge, we propose a novel whole-body locomotion algorithm based on dynamic balance and Reinforcement Learning (RL) that enables humanoid robots to traverse extreme terrains, particularly narrow pathways and unexpected obstacles, using only proprioception. Specifically, we introduce a dynamic balance mechanism by leveraging a novel Zero Moment Point (ZMP)-driven reward and task-driven rewards in a whole-body actor-critic framework, aiming to achieve coordinated actions of the upper and lower limbs for robust locomotion. Experiments conducted on a full-sized Unitree H1-2 robot verify the ability of our method to maintain balance on extremely narrow terrains and under external disturbances, demonstrating its effectiveness in enhancing the robot's adaptability to complex environments. The videos are given at https://whole-body-loco.github.io.

Abstract:
Slow-speed animals can also exhibit remarkable capabilities, as seen in snails that crawl while maintaining adhesion. Snails have inspired researchers to develop traveling wave-based robots and suction robots; however, the combination of traveling wave propulsion with suction ability remains a challenge. In this paper, we propose a snail-inspired robot that integrates a corkscrew propulsion mechanism with distributed suction cups, enabling it to crawl upside down on the ceiling. The propulsion model of the corkscrew generating the traveling wave is derived, and a temporal-spatial decomposition method is applied to validate the high efficiency of traveling wave generation. The trade-off between wave amplitude and suction cup depth is investigated to determine an optimized configuration. The results show that the robot’s speed aligns well with the propulsion model. The traveling wave ratio calculated from experiments is 0.938. The optimized configuration consists of a corkscrew with a 14 mm diameter and suction cups with a 2.5 mm depth, achieving a crawling speed of 3.02 ± 0.28 mm/s while moving upside down. The combination of the proposed smooth traveling wave generation method and distributed suction cups enables the robot to crawl upside down while carrying a 200 g load and to climb a vertical wall, like a natural snail.

Abstract:
Current hand exoskeleton interaction methods primarily focus on recognizing a limited range of hand motion intentions and rely on pre-programmed control to execute predefined commands. However, these approaches face significant limitations when confronted with unanticipated or non-predefined scenarios, such as performing various gestures or grasping different objects. To address this challenge, this paper proposes an embodied interaction control (EIC) framework for hand exoskeletons based on a multimodal large language model (MLLM). First, an embodied interaction method leveraging multi-modal fusion of speech and image information is developed, enabling more intuitive, hands-free, accurate, and robust human-robot interaction. By utilizing multi-modal data, the MLLM infers the user’s hand motion intentions and generates corresponding motion plans for the exoskeleton. The underlying control strategy is then used to execute the motion planning. Notably, leveraging the advanced reasoning and code-generation capabilities of MLLMs, the framework can generate undefined gestures and grasping actions. Finally, experimental results validate the effectiveness and generalizability of the EIC framework.

Abstract:
Robotic grasping faces challenges in adapting to objects with varying shapes and sizes. In this paper, we introduce MISCGrasp, a volumetric grasping method that integrates multi-scale feature extraction with contrastive feature enhancement for self-adaptive grasping. We propose a query-based interaction between high-level and low-level features through the Insight Transformer, while the Empower Transformer selectively attends to the highest-level features, which synergistically strikes a balance between focusing on fine geometric details and overall geometric structures. Furthermore, MISCGrasp utilizes multi-scale contrastive learning to exploit similarities among positive grasp samples, ensuring consistency across multi-scale features. Extensive experiments in both simulated and real-world environments demonstrate that MISCGrasp outperforms baseline and variant methods in tabletop decluttering tasks. More details are available at https://miscgrasp.github.io/.

Abstract:
Wearable exoskeleton robots play a crucial role in promoting upper limb function recovery. To enhance human-robot interaction and achieve precise control, continuous prediction of limb joint angles is required. This paper proposes a decoupled network model (VGANet) based on Variable Graph Convolutional Networks (V-GCN) and Temporal External Attention (TEA) for motion prediction in upper limb rehabilitation training. By establishing a mapping relationship between surface electromyography (sEMG) signals and upper limb movements, the model can predict future joint angles based on real-time sEMG signals. Experimental results demonstrate that this method can achieve continuous motion prediction for the shoulder joint and has been successfully applied to the control system of exoskeleton robots, providing an effective solution for the intelligent development of rehabilitation exoskeletons.

Abstract:
In this work, we introduce a predictive control framework enabling shipboard robots to execute highly dynamic maneuvers for motion compensation in rough sea conditions. Such offshore operations poses significant challenges to traditional feedback controllers in maintaining workspace constraints under extreme wave disturbances, while real-time feasibility remains a challenge for model-based planning methods, given the limited predictability of future dynamics and the computational demands of extended planning horizons. To address these challenges, we propose a hierarchical planner and a model predictive controller that integrate real-time deterministic wave forecasting with ship motion prediction to enable anticipatory maneuver planning and execution in dynamic offshore environments. We apply this framework to a Stewart-Gangway system onboard a service operation vessel, featuring a 6-DoF parallel mechanism designed for precise motion control in offshore operations. Numerical experiments demonstrate that our approach significantly outperforms traditional reactive methods in stabilizing shipboard platforms under mild sea states. Most importantly, it effectively extends the operational capabilities of the platform across a broader spectrum of sea conditions. Our study demonstrates how wave forecasting can be leveraged to enhance the operational capabilities of shipboard robotic platforms through predictive control and by exploiting their inherent agility.

Abstract:
The wheel-legged quadruped robot, equipped with leg and end-wheel structures, possesses the capability to traverse continuous surfaces at relatively high speeds while also being able to navigate unstructured terrains. However, designing its controller using traditional methods presents significant challenges, particularly under conditions of limited or lost external environmental perception and highly variable terrain complexity. In light of this issue, this paper proposes a novel, concise, and effective reinforcement learning framework. The framework employs an asymmetric actor-critic structure incorporating a velocity estimation network and leverages multi-contact states generated by a central pattern generator for fusion, thereby training a single control policy to address the robust and flexible traversal of complex terrains by wheel-legged robots relying solely on an inertial measurement unit and joint sensors. Our method enables the modified Unitree Go1-based wheel-legged robot to traverse various challenging terrains, such as steps, high obstacles, rough terrain, and low-adhesion surfaces, while ensuring efficient locomotion performance on smooth and continuous surfaces. The effectiveness of the framework’s training results has been validated through testing in both simulation and real-world environments.

Abstract:
We consider a large-scale data center where a fleet of heterogeneous mobile robots and human workers collaborate to handle various installation and maintenance tasks. We focus on the underlying multi-agent task assignment problem which is crucial to optimize the overall system. We formalize the problem as a Markov Decision Process and propose an end-to-end learning approach to solve it. We demonstrate the effectiveness of our approach in simulation with realistic data and in the presence of uncertainty.

Abstract:
In automated driving, predicting trajectories of surrounding vehicles supports reasoning about scene dynamics and enables safe planning for the ego vehicle. However, existing models handle predictions as an instantaneous task of forecasting future trajectories based on observed information. As time proceeds, the next prediction is made independently of the previous one, which means that the model cannot correct its errors during inference and will repeat them. To alleviate this problem and better leverage temporal data, we propose a novel retrospection technique. Through training on closed-loop rollouts the model learns to use aggregated feedback. Given new observations it reflects on previous predictions and analyzes its errors to improve the quality of subsequent predictions. Thus, the model can learn to correct systematic errors during inference. Comprehensive experiments on nuScenes and Argoverse demonstrate a considerable decrease in minimum Average Displacement Error of up to 31.9% compared to the state-of-the-art baseline without retrospection. We further showcase the robustness of our technique by demonstrating a better handling of out-of-distribution scenarios with undetected road-users.

Abstract:
Accurate mass estimation of table-top grown strawberries under field conditions remains challenging due to frequent occlusions and pose variations. This study proposes a vision-based pipeline integrating RGB-D sensing and deep learning to enable non-destructive, real-time and online mass estimation. The method employed YOLOv8-Seg for instance segmentation, Cycle-Consistent generative adversarial network (CycleGAN) for occluded region completion, and tilt-angle correction to refine frontal projection area calculations. A polynomial regression model then mapped the geometric features to mass. Experiments demonstrated mean mass estimation errors of 8.11% for not-occluded strawberries and 10.47% for occluded cases. CycleGAN outperformed large mask inpainting (LaMa) model in occlusion recovery, achieving superior pixel area ratios (PAR) (mean: 0.978 vs. 1.112) and higher intersection over union (IoU) scores (92.3% vs. 47.7% in the [0.9–1] range). This approach addresses critical limitations of traditional methods, offering a robust solution for automated harvesting and yield monitoring with complex occlusion patterns.

Abstract:
Robots whose shape and stiffness are determined by internal forces generally have complex shape-stiffness relationships that depend on their structure. As a result, there are difficulties such as a decrease in shape reproducibility when the robot is not stiff, and a decrease in the range of motion when the robot is stiff. In this study, we propose a motion planning method that balances shape and stiffness by learning forward and inverse kinematics using a stochastic neural network (NN) and using the uncertainty that can be evaluated by the NN. Through experiments using a tensegrity manipulator with 40 actuators and 20 degrees of freedom in bending posture, we verify the validity of the proposed method.

Abstract:
LiDAR semantic segmentation is crucial for autonomous vehicles and mobile robots, requiring high accuracy and real-time processing, especially on resource-constrained embedded systems. Previous state-of-the-art methods often face a trade-off between accuracy and speed. Point-based and sparse convolution-based methods are accurate but slow due to the complexity of neighbor searching and 3D convolutions. Projection-based methods are faster but lose critical geometric information during the 2D projection. Additionally, many recent methods rely on test-time augmentation (TTA) to improve performance, which further slows the inference. Moreover, the pre-processing phase across all methods increases execution time and is demanding on embedded platforms. Therefore, we introduce HARP-NeXt, a high-speed and accurate LiDAR semantic segmentation network. We first propose a novel pre-processing methodology that significantly reduces computational overhead. Then, we design the Conv-SE-NeXt feature extraction block to efficiently capture representations without deep layer stacking per network stage. We also employ a multi-scale range-point fusion backbone that leverages information at multiple abstraction levels to preserve essential geometric details, thereby enhancing accuracy. Experiments on the nuScenes and SemanticKITTI benchmarks show that HARP-NeXt achieves a superior speed-accuracy trade-off compared to all state-of-the-art methods, and, without relying on ensemble models or TTA, is comparable to the top-ranked PTv3, while running 24× faster. The code is available at https://github.com/SamirAbouHaidar/HARP-NeXt

Abstract:
Accurate dynamic modeling and external force estimation are crucial for high-precision robot control and applications. However, model incompleteness and external disturbances inevitably lead to a residual between the actual joint torque and the torque calculated by the identified dynamic model. To address this, this paper proposes a hierarchical fusion framework. First, a multi-layer perceptron neural network (MLPNN) is employed to systematically compensate for these joint torque residuals. Subsequently, a generalized momentum-based third-order external force observer is designed to enhance the accuracy of estimating external forces acting on the manipulator. This approach retains the interpretability inherent in physics-based models while augmenting generalization capability through data-driven correction. The advantages of the third-order external force observer are substantiated via comparative analysis with first- and second-order observers on a Simulink simulation platform using a 2-DOF planar manipulator. Furthermore, the effectiveness of the proposed method was validated through a dragging experiment conducted on a 6-DOF manipulator without end-effector force/torque sensor, demonstrating its performance in practical applications.

Abstract:
The steerability of catheter is critical to the success of interventional procedure. In this paper, a magnetic continuum robot is presumably mounted to the distal of a catheter to pull it in the narrow, bifurcate, tortuous pathways of the blood vessels. The continuum robot is actuated by a permanent magnet. It is linear and soft, which complicates its interaction with the magnet. However, it generates the omnidirectional deflection at its tip and improves its steerability instead. In the non-uniform field generated by a permanent magnet, besides magnetic torque, magnetic force is acting on the robot as well, and both are used to derive the dynamic equations to govern the interaction between the robot and the magnet, in terms of the Euler-Bernoulli beam theory. An iterative algorithm to calculate the magnet pose is proposed to generate an optimal moving magnetic field and actuate the robot to follow the planned trajectory at its tip. A robot prototype is fabricated, and the experimental results show that this prototype can accurately track the planned trajectories in 2D space by the magnet manipulation with a robot arm.

Abstract:
In this paper, we present a user-friendly LiDAR-camera calibration toolkit that is compatible with various LiDAR and camera sensors and requires only a single pair of laser points and a camera image in targetless environments. Our approach eliminates the need for an initial transform and remains robust even with large positional and rotational LiDAR-camera extrinsic parameters. We employ the Gluestick pipeline to establish 2D-3D point and line feature correspondences for a robust and automatic initial guess. To enhance accuracy, we quantitatively analyze the impact of feature distribution on calibration results and adaptively weight the cost of each feature based on these metrics. As a result, extrinsic parameters are optimized by filtering out the adverse effects of inferior features. We validated our method through extensive experiments across various LiDAR-camera sensors in both indoor and outdoor settings. The results demonstrate that our method provides superior robustness and accuracy compared to SOTA techniques. Our code is open-sourced on GitHub1 to benefit the community.

Abstract:
We present a novel method, AutoSpatial, an efficient approach with structured spatial grounding to enhance VLMs’ spatial reasoning. By combining minimal manual supervision with large-scale Visual Question-Answering (VQA) pairs auto-labeling, our approach tackles the challenge of VLMs’ limited spatial understanding in social navigation tasks. By applying a hierarchical two-round VQA strategy during training, AutoSpatial achieves both global and detailed understanding of scenarios, demonstrating more accurate spatial perception, movement prediction, Chain of Thought (CoT) reasoning, final action, and explanation compared to other SOTA approaches. These five components are essential for comprehensive social navigation reasoning. Our approach was evaluated using both expert systems (GPT-4o, Gemini 2.0 Flash, and Claude 3.5 Sonnet) that provided cross-validation scores and human evaluators who assigned relative rankings to compare model performances across four key aspects. Augmented by the enhanced spatial reasoning capabilities, AutoSpatial demonstrates substantial improvements by averaged cross-validation score from expert systems in: perception & prediction (up to 10.71%), reasoning (up to 16.26%), action (up to 20.50%), and explanation (up to 18.73%) compared to baseline models trained only on manually annotated data.

Abstract:
This paper introduces Q8bot 2, an open-source, miniature quadruped designed for robotics research and education. We present the robot’s novel, zero-wire design method-ology, which leads to its superior form factor, robustness, replicability, and high performance. With the size and weight similar to a modern smartphone, this standalone robot can walk hour-long on a single battery charge and can survive meter-high drops with simple repairs. Its 300 bill of materials contains minimal off-the-shelf components, readily available custom electronics from online vendors, and structural parts that can be manufactured on hobbyist 3D-printers. A preliminary user assembly study confirms that Q8bot can be easily replicated, with an average assembly time of under one hour by a single person. With heuristic open-loop control, Q8bot is capable of a stable walking speed of 5.4 body lengths per second and a turning speed of 5 radians per second, along with other dynamic movements such as jumping and climbing moderate slopes.

Abstract:
Vision-based robotic object grasping is typically investigated in the context of isolated objects or unstructured object sets in bin picking scenarios. However, there are several settings, such as construction or warehouse automation, where a robot needs to interact with a structured object formation such as a stack. In this context, we define the problem of selecting suitable objects for grasping along with estimating an accurate 6DoF pose of these objects. To address this problem, we propose a camera-IMU based approach that prioritizes unobstructed objects on the higher layers of stacks and introduce a dataset for benchmarking and evaluation, along with a suitable evaluation metric that combines object selection with pose accuracy. Experimental results show that although our method can perform quite well, this is a challenging problem if a completely error-free solution is needed. Finally, we show results from the deployment of our method for a brick-picking application in a construction scenario.

Abstract:
Model-based robot control requires an accurate dynamics model and a machine learning-based method can extract robot dynamics from collected motion data by simulation and experiment. A Gaussian process (GP) has been used as one of the learning methods to obtain robot dynamics. To avoid large training datasets for learning robot dynamics, we propose an active training data selection strategy. The data sampling criteria are to minimize the probability density difference between the actual model and the GP-based estimate. Using such a criterion, the active training data strategy identifies where to sample the next data point for model training. We demonstrate the proposed active learning strategy with a 3-link robot arm in both fully actuated and underactuated modes. With the selected dataset containing 150 data points, the integrated probability density error compared with the entire dataset (over 30, 000 data points) is less than 0.3. The experimental results confirm that the GP-based control performance is greater than that under the model-based control.

Abstract:
In this study, we propose and implement a method for grasping only the topmost layer of stacked fabrics with a dual-arm robot and placing it at a target position. The proposed method employs a three-finger robot hand consisting of one adhesive finger and two non-adhesive grasping fingers to grasp only the top fabric layer. After grasping, feature matching is performed to determine the rotation angle and translation vector required to align the fabric with the target position. The fabric is then moved while being pressed by the fingers. The proposed method was implemented on a robot and validated through experiments using five different fabrics with varying structures, surface materials, masses, and thicknesses, thereby confirming its effectiveness.

Abstract:
Accurate and reliable online real-time localization and mapping are crucial for autonomous navigation of robot. Dynamic objects within the perception field can affect the accuracy of registration and localization, and also introduce ghost trail artifacts in the map, hindering robot planning and decision-making. While semantic segmentation can assist in perceiving object categories, it struggles to accurately segment moving objects. In this paper, we present SOLO-SMap, a real-time localization and static map construction framework based solely on LiDAR point cloud. We leverage semantic inference to identify potential dynamic points. And then, our instance-level true dynamic points removal is achieved by utilizing geometric rules based on moving point occlusion relationships and multi-object tracking (MOT) within a nearby temporal window in the pre-alignment stage. This design preserves stable static constraints while adhering to the static world model assumption of SLAM systems, benefiting accuracy and reducing drift, particularly in busy intersections. We evaluated the performance of SOLO-SMap in dynamic scenes on KITTI datasets and our self-made datasets, and conducted a comprehensive comparison with other methods, validating the effectiveness and robustness of the proposed method. A supplementary video can be accessed at https://www.youtube.com/watch?v=x-VKr3ag03M.

Abstract:
Considerable advancements have been achieved in SLAM methods tailored for structured environments, yet their robustness under challenging corner cases remains a critical limitation. Although multi-sensor fusion approaches integrating diverse sensors have shown promising performance improvements, the research community faces two key barriers: On one hand, the lack of standardized and configurable benchmarks that systematically evaluate SLAM algorithms under diverse degradation scenarios hinders comprehensive performance assessment. While on the other hand, existing SLAM frameworks primarily focus on fusing a limited set of sensor types, without effectively addressing adaptive sensor selection strategies for varying environmental conditions.To bridge these gaps, we make three key contributions: First, we introduce M3DGR dataset: a sensor-rich benchmark with systematically induced degradation patterns including visual challenge, LiDAR degeneracy, wheel slippage and GNSS denial. Second, we conduct a comprehensive evaluation of forty SLAM systems on M3DGR, providing critical insights into their robustness and limitations under challenging real-world conditions. Third, we develop a resilient modular multi-sensor fusion framework named Ground-Fusion++, which demonstrates robust performance by coupling GNSS, RGB-D, LiDAR, IMU (Inertial Measurement Unit) and wheel odometry. Codes 1 and datasets 2 are publicly available.

Abstract:
In this work, we model and control a cable-driven, remote-actuated system that includes both friction and compliance in its dynamics. The control objective is to solve a regulation problem using a Model Predictive Controller (MPC). Unlike the flexible-joint robot models, which typically assume frictionless compliant elements, the proposed model incorporates friction forces between two compliant cable-sheaths that connect the motor to the driven link. Three controllers are developed based on the cascade control principles integrated with the MPC framework. Their performance is evaluated through both simulations and experiments on a custom-designed testbed. The results demonstrate that the MPC-Cascade control scheme achieves the best overall performance, with fast convergence and low control effort.

Abstract:
Robotic instruction following tasks require seamless integration of visual perception, task planning, target localization, and motion execution. However, existing task planning methods for instruction following are either data-driven or underperform in zero-shot scenarios due to difficulties in grounding lengthy instructions into actionable plans under operational constraints. To address this, we propose FlowPlan, a structured multi-stage LLM workflow that elevates zero-shot pipeline and bridges the performance gap between zero-shot and data-driven in-context learning methods. By decomposing the planning process into modular stages—task information retrieval, language-level reasoning, symbolic-level planning, and logical evaluation—FlowPlan generates logically coherent action sequences while adhering to operational constraints and further extracts contextual guidance for precise instance-level target localization. Benchmarked on ALFRED and validated in real-world applications, our method achieves competitive performance relative to data-driven in-context learning methods and demonstrates adaptability across diverse environments. This work advances zero-shot task planning in robotic systems without reliance on labeled data. Project website: https://instruction-following-project.github.io/.

Abstract:
Model Predictive Path Integral (MPPI) controllers are drawing increasing attention for their ability to efficiently handle complex systems by leveraging GPU acceleration while with flexible prediction models and cost functions. However, their performance generally degrades with low-quality prediction models and unknown external disturbances. Existing methods that rely solely on feedforward disturbance compensation are limited by the assumption of matched disturbances, which rarely holds in practice due to the complex lumped disturbances. To this end, we propose a novel Disturbance-Aware (DA-) MPPI framework, which seamlessly integrates an Extended high-order Sliding Mode Observer (ESMO) into MPPI. The ESMO provides accurate estimates of uncertainties and external disturbances, which are directly incorporated into the MPPI rolling dynamics to improve prediction and therefore tracking control performance. The proposed algorithm is verified against the baseline MPPI in AirSim simulation environment by stochastic simulation. Comparatively statistical experiments show that incorporating ESMO within the MPPI framework significantly enhances tracking performance, with the RMSE reduction in term of mean by 8.0%, 17.7%, 6.17%, 12.9% and in term of standard variance by 11.5%, 26.0%, 10.4%, and 9.2% in four representative scenarios. The effects of target velocity and prediction horizon on control performance are also systematically evaluated. These results validate the robustness and accuracy of the DA-MPPI controller in complex and uncertain environments. 1

Abstract:
Robotic middleware is fundamental to ensuring reliable communication among system components and is crucial for intelligent robotics, autonomous vehicles, and smart manufacturing. However, existing robotic middleware often struggles to meet the diverse communication demands, optimize data transmission efficiency, and maintain scheduling determinism between Orin computing units in large-scale L4 autonomous vehicle deployments. This paper presents RI-MAOS2C, a service discovery-based hybrid network communication middleware designed to tackle these challenges. By leveraging multi-level service discovery multicast, RIMAOS2C supports a wide variety of communication modes, including multiple cross-chip Ethernet protocols and PCIe communication capabilities. The core mechanism of the middleware, the Message Bridge, optimizes data flow forwarding and employs shared memory for centralized message distribution, reducing message redundancy and minimizing transmission delay uncertainty, thus improving both communication efficiency and scheduling stability. Tested and validated on L4 vehicles and Jetson Orin domain controllers, RIMAOS2C leverages TCP-based ZeroMQ to overcome the large-message transmission bottleneck inherent in native CyberRT middleware. In scenarios involving two cross-chip subscribers, RIMAOS2C eliminates message redundancy, enhancing transmission efficiency by 36%–40% for large data transfers while reducing callback time differences by 42%–906%. This research advances the communication capabilities of robotic operating systems and introduces a novel approach to optimizing communication in distributed computing architectures for autonomous driving systems.

Abstract:
Autonomous exploration in unknown environments requires estimating the information gain of an action to guide planning decisions. While prior approaches often compute information gain at discrete waypoints, pathwise integration offers a more comprehensive estimation but is often computationally challenging or infeasible and prone to overestimation. In this work, we propose the Pathwise Information Gain with Map Prediction for Exploration (PIPE) planner, which integrates cumulative sensor coverage along planned trajectories while leveraging map prediction to mitigate overestimation. To enable efficient pathwise coverage computation, we introduce a method to efficiently calculate the expected observation mask along the planned path, significantly reducing computational overhead. We validate PIPE on real-world floorplan datasets, demonstrating its superior performance over state-of-the-art baselines. Our results highlight the benefits of integrating predictive mapping with pathwise information gain for efficient and informed exploration. Website: pipe-planner.github.io

Abstract:
Autonomous flight in unknown environments requires precise spatial and temporal trajectory planning, often involving computationally expensive nonconvex optimization prone to local optima. To overcome these challenges, we present the Neural-Enhanced Trajectory Planner (NEO-Planner), a novel approach that leverages a Neural Network (NN) Planner to provide informed initial values for trajectory optimization. The NN-Planner is trained on a dataset generated by an expert planner using batch sampling, capturing multimodal trajectory solutions. It learns to predict spatial and temporal parameters for trajectories directly from raw sensor observations. NEO-Planner starts optimization from these predictions, accelerating computation speed while maintaining explainability. Furthermore, we introduce a robust online replanning framework that accommodates planning latency for smooth trajectory tracking.Extensive simulations demonstrate that NEO-Planner reduces optimization iterations by 20%, leading to a 26% decrease in computation time compared with pure optimization-based methods. It maintains trajectory quality comparable to baseline approaches and generalizes well to unseen environments. Real-world experiments validate its effectiveness for autonomous drone navigation in cluttered, unknown environments.Code: https://github.com/Amos-Chen98/neo-plannerVideo: https://youtu.be/UoroRe-euDk

Abstract:
Ultraviolet (UV) germicidal radiation is an established non-contact method for surface disinfection in medical environments. Traditional approaches require substantial human intervention to define disinfection areas, complicating automation, while deep learning-based methods often need extensive fine-tuning and large datasets, which can be impractical for large-scale deployment. Additionally, these methods often do not address scene understanding for partial surface disinfection, which is crucial for avoiding unintended UV exposure. We propose a solution that leverages foundation models to simplify surface selection for manipulator-based UV disinfection, reducing human involvement and removing the need for model training. Additionally, we propose a VLM-assisted segmentation refinement to detect and exclude thin and small non-target objects, showing that this reduces mis-segmentation errors. Our approach achieves over 92% success rate in correctly segmenting target and non-target surfaces, and real-world experiments with a manipulator and simulated UV light demonstrate its practical potential for real-world applications.

Abstract:
In this paper, we present an autonomous control system for Unmanned Aerial Vehicles (UAVs), specifically designed to inspect a detected suspicious vessel and capture information-rich images in a maritime environment. The maritime environment is ever-changing and uncertain, making it challenging to perform maritime monitoring tasks efficiently and reliably. The proposed UAV control system consists of multiple modules, including path planning, vessel searching, image processing, and image optimization. A novel image optimization approach utilizing deep reinforcement learning (DRL) is proposed to enhance the quality of the captured images by jointly controlling the movement of the UAV and camera orientation. The effectiveness and efficiency of the proposed system were validated and evaluated by searching the vessel and optimizing the captured images in the self-developed simulation environment in Gazebo.

Affiliations: Department of Electrical and Electronic Engineering, The University of Hong Kong, Hong Kong, China; Shenzhen International Center For Industrial And Applied Mathematics, Shenzhen Research Institute of Big Data, Shenzhen, China; Shenzhen Institutes of Advanced Technology, Chinese Academy of Sciences, Shenzhen, China; Department of Computer Science and Engineering, Southern University of Science and Technology, Shenzhen, China

Abstract:
Street Scene Semantic Understanding (denoted as S3U) is a crucial but complex task for autonomous driving (AD) vehicles. Their inference models typically face poor generalization due to domain-shift. Federated Learning (FL) has emerged as a promising paradigm for enhancing the generalization of AD models through privacy-preserving distributed learning. However, these FL AD models face significant temporal catastrophic forgetting when deployed in dynamically evolving environments, where continuous adaptation causes abrupt erosion of historical knowledge. This paper proposes Federated Exponential Moving Average (FedEMA), a novel framework that addresses this challenge through two integral innovations: (I) Server-side model’s historical fitting capability preservation via fusing current FL round’s aggregation model and a proposed previous FL round’s exponential moving average (EMA) model; (II) Vehicle-side negative entropy regularization to prevent FL models’ possible overfitting to EMA-introduced temporal patterns. Above two strategies empower FedEMA a dual-objective optimization that balances model generalization and adaptability. In addition, we conduct theoretical convergence analysis for the proposed FedEMA. Extensive experiments both on Cityscapes dataset and Camvid dataset demonstrate FedEMA’s superiority over existing approaches, showing 7.12% higher mean Intersectionover-Union (mIoU).

Abstract:
Unmanned Aerial Vehicles (UAVs) are gaining attention for inspections due to their improved safety, efficiency, and accuracy, alongside reduced costs and environmental risks. Visual servoing is crucial for autonomous UAV flight in GPS-degraded environments, guiding the UAV by minimizing errors between observed and desired visual features. This study focuses on Image-Based Visual Servoing (IBVS) control for quadrotor UAVs under complex dynamics and environmental disturbances. A nonlinear model predictive control (MPC) framework is first integrated with visual servoing to handle dynamics nonlinearity, control optimality, and constraints. To address uncertainties and disturbances, a Generalized Extended State Observer (GESO) is incorporated into the MPC, forming the Disturbance-Resilient (DR-) MPC. The GESO estimates the lumped disturbance to improve model predictions within the MPC horizon. The proposed algorithm is validated in a realistic Gazebo environment for UAV pipeline inspection in 3D scenarios, showing better control accuracy and reduced inspection time compared to three baseline methods: IBVS, IBVS-MPC(K) with kinematics, and IBVS-MPC(D) with dynamics. 1

Abstract:
The number of motors directly influences the dexterity, size, and cost of a robotic hand. In this paper, we present MuxHand, a robotic hand that utilizes a time-division multiplexing motor (TDMM) mechanism. This system enables independent control of 9 cables with just 4 motors, significantly reducing both cost and size while maintaining high dexterity. To enhance stability and smoothness during grasping and manipulation tasks, we integrate magnetic joints into the three 3D-printed fingers. These joints provide impact resistance, resetting capabilities. The three fingers together have a total of 30 degrees of freedom (DOF), 18 of which are passive DOF, allowing the hand to conform closely to the surface of an object during grasping. We conduct a series of experiments to assess the performance parameters of MuxHand, including its grasping and manipulation capabilities. The results show that the TDMM mechanism precisely controls each cable connected to the finger joints, enabling robust grasping and dexterous manipulation. Furthermore, compared to the traditional approach of assigning a motor to each active DOF, the cost is reduced by 42.06%. The maximum load of a single finger reaches 7.0 kg, the maximum load at the finger joint root is 12.0 kg, the maximum driving force at the joint root is 5.0 kg, and the maximum fingertip force is 10.0 N.

Abstract:
Quadruped manipulators require precise detection of external impact forces to ensure safe and compliant responses during environmental interactions. However, these systems often lack tactile sensors on their body surfaces or force/torque sensors at critical joints. This study introduces a whole-body admittance control framework for quadruped manipulators, utilizing a novel external impact force observer that estimates impact forces acting on the manipulator or the quadruped’s base without relying on dedicated force sensors. The observer leverages the robustness of a super-twisting algorithm (STA) based on the momentum model of quadruped manipulators. Model uncertainties are mitigated using a low-pass filter (LPF) and compensated by ground reaction forces, significantly reducing estimation oscillations during dynamic gaits. By integrating these estimated impact forces, the whole-body admittance control framework enables compliant interactions with the environment and mitigates unsafe behaviors caused by torque saturation through a set-valued feedback loop that constrains command torques within actuation limits, including joint torque boundaries and friction cone constraints of the ground reaction force. Experimental validation across diverse scenarios confirms the effectiveness of this approach in facilitating safe and adaptive interactions between quadruped manipulators and external forces.

Abstract:
Accurate 3D reconstruction is crucial for AR and VR applications. Compared with traditional pinhole camera-based methods, 360° image-based reconstruction can achieve higher precision with fewer input images, making it especially effective in low-texture environments. However, the severe distortion resulting from the wide field of view complicates feature extraction and matching, leading to geometric inconsistencies in multi-view reconstruction. To address these challenges, we propose 360Recon, a novel multi-view stereo (MVS) algorithm specifically designed for equirectangular projection (ERP) images. With the proposed spherical feature extraction module mitigating distortion, 360Recon integrates a 3D cost volume with multi-scale ERP features to deliver high-precision scene reconstruction while preserving local geometric consistency. Experimental results demonstrate that 360Recon outperforms existing methods in terms of accuracy, computational efficiency, and generalization capability. The source code will be released at https://github.com/LeonATP/360Recon.

Affiliations: Department of Control Science and Engineering, College of Electronic and Information Engineering, and Shanghai Institute of Intelligent Science and Technology, Tongji University, Shanghai, China; Department of Artificial Intelligence and Computer Science, School of Computer Science, University of Birmingham, Birmingham, United Kingdom; Robotics and Microsystems Center, School of Mechanical and Electric Engineering, Soochow University, Suzhou, China; School of Oriental Pan-Vascular Devices Innovation College, University of Shanghai for Science and Technology, Shanghai, China; Department of Electronic Engineering, Faculty of Engineering, The Chinese University of Hong Kong, Hong Kong SAR, China

Abstract:
Accurate three-dimensional (3D) reconstruction of guidewire shapes is crucial for precise navigation in robot-assisted endovascular interventions. Conventional 2D Digital Subtraction Angiography (DSA) is limited by the absence of depth information, leading to spatial ambiguities that hinder reliable guidewire shape sensing. This paper introduces a novel multimodal framework for real-time 3D guidewire reconstruction, combining preoperative 3D Computed Tomography Angiography (CTA) with intraoperative 2D DSA images. The method utilizes robust feature extraction to address noise and distortion in 2D DSA data, followed by deformable image registration to align the 2D projections with the 3D CTA model. Subsequently, the inverse projection algorithm reconstructs the 3D guidewire shape, providing real-time, accurate spatial information. This framework significantly enhances spatial awareness for robotic-assisted endovascular procedures, effectively bridging the gap between preoperative planning and intraoperative execution. The system demonstrates notable improvements in real-time processing speed, reconstruction accuracy, and computational efficiency. The proposed method achieves a projection error of 1.76±0.08 pixels and a length deviation of 2.93±0.15%, with a frame rate of 39.3 1.5 frames per second (FPS). These advancements have the ±potential to optimize robotic performance and increase the precision of complex endovascular interventions, ultimately contributing to better clinical outcomes.

Abstract:
Image matching is a fundamental task in computer vision, underpinning applications such as visual localization and structure-from-motion. While deep convolutional neural network (CNN)-based approaches have achieved high detection accuracy, their high computational cost limits their deployment on resource-constrained platforms such as mobile and embedded systems. This paper presents a lightweight image matching network that achieves a favorable trade-off between accuracy and efficiency. The proposed model further enhances robustness to large image rotations, a common challenge in aerial and robotics applications. Extensive experiments demonstrate that our method maintains competitive accuracy while significantly reducing inference time compared to existing CNN-based approaches.

Abstract:
This paper presents a novel high-level task planning and optimal coordination framework for autonomous masonry construction, using a team of heterogeneous aerial robotic workers, consisting of agents with separate skills for brick placement and mortar application. This introduces new challenges in scheduling and coordination, particularly due to the mortar curing deadline required for structural bonding and ensuring the safety constraints among UAVs operating in parallel. To address this, an automated pipeline generates the wall construction plan based on the available bricks while identifying static structural dependencies and potential conflicts for safe operation. The proposed framework optimizes UAV task allocation and execution timing by incorporating dynamically coupled precedence deadline constraints that account for the curing process and static structural dependency constraints, while enforcing spatio-temporal constraints to prevent collisions and ensure safety. The primary objective of the scheduler is to minimize the overall construction makespan while minimizing logistics, traveling time between tasks, and the curing time to maintain both adhesion quality and safe workspace separation. The effectiveness of the proposed method in achieving coordinated and time-efficient aerial masonry construction is extensively validated through Gazebo simulated missions. The results demonstrate the framework’s capability to streamline UAV operations, ensuring both structural integrity and safety during the construction process. - A video with the framework is available at https://youtu.be/kGvFGDCUkDQ

Abstract:
Many real-world applications require legged robots to be able to carry variable payloads. Model-Based controllers such as model predictive control (MPC) have become the de facto standard in research for controlling these systems. However, most model-based control architectures use fixed plant models, which limits their applicability to different tasks. In this paper, we present a Kalman filter (KF) formulation for online identification of the mass and center of mass (COM) of a four-legged robot. We evaluate our method on a quadrupedal robot carrying various payloads and find that it is more robust to strong measurement noise than classical recursive least squares (RLS) methods. Moreover, it improves the tracking performance of the model-based controller with varying payloads when the model parameters are adjusted at runtime.

Abstract:
In robotics applications, event cameras provide low-latency and high-dynamic-range sensing by asynchronously detecting brightness changes, making them well-suited for capturing fast motions and subtle cues in dynamic environments. However, most existing Spiking Neural Network (SNN)-based methods enhance spatial information by stacking multiple frames of events, while neglecting the explicit modeling of high-and low-frequency components in the event stream. To address this limitation, we proposes a 3D Wavelet Spiking Neural Network (3DWSNet), which integrates a 3D wavelet transform with a cascaded Wavelet Spiking Convolution (WSC) module as its core. Specifically, the 3D wavelet transform decomposes input data into eight frequency sub-bands across spatial and temporal dimensions, enabling the model to preserve fine-grained high-frequency details while enriching low-frequency motion representations. The cascaded WSC architecture further improves the extraction of multi-scale spatio-temporal features by integrating information from feature maps at different resolutions. Extensive experiments show that our 3DWSNet significantly outperforms SOTA SNN performances on the CIFAR-10, CIFAR-100, DVS128 Gesture, and CIFAR10-DVS datasets. The source code will be publicly released soon.

Abstract:
Robotic fish face considerable challenges in natural environment due to the absence of a comprehensive and precise model that can depict the intricate fluid-structure interactions, particularly in the presence of background flow fields. To this end, we present a novel data-driven dynamic modeling framework capable of characterizing the swimming motions of the robotic fish under various background flow conditions without the necessity for explicit flow information. The model is synthesized by an internal model with an adaptive residual acceleration model to effectively isolate and address external flow effects. Notably, the residual model employs the innovative Domain Adversarially Invariant Meta-Learning (DAIML) approach, allowing the framework to adapt to fluctuating and previously unseen background flow scenarios, enhancing its robustness and scalability. Validation through high-fidelity Computational Fluid Dynamics (CFD) simulations demonstrates the framework’s effectiveness in improving the performance of robotic fish across diverse real-world aquatic environments.

Abstract:
In robot-assisted minimally invasive surgery (RMIS), the absence of haptic feedback presents a significant challenge for surgeons in accurately gauging the forces applied during procedures. However, obtaining precise force/torque (F/T) information at the surgical site is challenging. One key obstacle is distinguishing external forces from those induced by the cable-actuated kinematics of the surgical tool. We present a novel method to eliminate this interference by employing differential F/T measurement. We utilize two miniature 6-axis F/T sensors, positioned proximally and distally, to counterbalance the undesired forces and torques generated by the cable-driven system. To demonstrate the efficacy of this approach, we developed an experimental cable-actuated forceps with two degrees of freedom. We conducted a series of dynamic tests, attaching various weight configurations to the gripper to simulate external forces ranging from 0.5 N to 1.5 N. Subsequently, we evaluated three measurement methods: raw distal sensor readings, differential compensation, and a multilayer perceptron (MLP) that processes a sliding window of inputs from both sensors and actuators. Differential compensation improves performance by 70% over the distal sensor alone, achieving a root-mean-square error (RMSE) of 0.15 N and 3 mNm across the entire dataset. The MLP yields a further improvement of 90% lower RMSE relative to the distal sensor, achieving 0.05 N and 0.5 mNm on a test subset of the data not used for training.

Abstract:
Tactile perception is crucial for robots to interact effectively with their environments, particularly in cluttered settings or when visual sensing is unavailable. However, a major limitation is the insufficient coverage of tactile sensors on current robots, which makes navigating cluttered spaces challenging due to the lack of capability to detect collisions. This limitation also hinders the use of teleoperation systems in such spaces by reducing the human operator’s situational awareness. To address this issue, this paper proposes the Touch-Linked Sleeve (TLS), a haptic mapping system that redirects contact on robot arms to human skin. The system consists of a tactile skin for contact detection and a haptic sleeve that enables human operators to experience telepresented contact. By establishing a transparent mapping between the robot’s tactile skin and the user’s haptic sleeve, operators can intuitively sense contacts from the robot’s perspective. To evaluate the system’s effectiveness, we conducted experiments demonstrating the functionality of both the tactile skin and the haptic sleeve. Moreover, we performed human studies using a virtual reality robot teleoperation interface to simulate navigation and manipulation in a cluttered scenario. The results indicate that the proposed system enhances perceptual transparency during object grasping tasks, leading to improved task completion times, fewer collisions, and improved overall usability.

Abstract:
Leveraging Large Language Models (LLMs) to write policy code for controlling robots has gained significant attention. However, in long-horizon implicative tasks, this approach often results in API parameter, comments and sequencing errors, leading to task failure. To address this problem, we propose a collaborative Triple-S framework that involves multiple LLMs. Through In-Context Learning, different LLMs assume specific roles in a closed-loop Simplification-Solution-Summary process, effectively improving success rates and robustness in long-horizon implicative tasks. Additionally, a novel demonstration library update mechanism which learned from success allows it to generalize to previously failed tasks. We validate the framework in the Long-horizon Desktop Implicative Placement (LDIP) dataset across various baseline models, where Triple-S successfully executes 89% of tasks in both observable and partially observable scenarios. Experiments in both simulation and real-world robot settings further validated the effectiveness of Triple-S. Our code and dataset is available at: https://github.com/Ghbbbbb/Triple-S.

Abstract:
Modern robots are required to operate in complex environments and perform diverse tasks, resulting in redundant degrees of freedom (DoF) for flexibility. However, managing redundancy is challenging due to the high-dimensional and non-convex nature of robotic kinematics. When executing complex tracking tasks, redundant robots must handle non-convex constraints while maintaining many objectives, such as balancing and obstacle avoidance. This paper models the pathwise inverse kinematics of redundant mechanisms as a multi-objective nonlinear optimization problem. We propose an efficient gradient-free optimization method named MoeIK, which demonstrates strong multi-objective balance, rapid global convergence, and adaptability. Our approach enhances the method by integrating relaxation dominance, adaptive interval search strategies, and a restart strategy, significantly improving performance in overcoming many-objective optimization challenges. We compared MoeIK with RelaxedIK, Trac-IK, and BioIK across multiple trajectories on various redundant robots, and the experimental results demonstrate that our algorithm exhibits better multi-objective balance capabilities and supports real-time computation.

Abstract:
Physical interaction tasks within dense foliage such as leaf pruning and fruit harvesting are current challenges for agricultural robotics. This is due to the cluttered and unstructured environment these tasks are conducted within, complex stem structures providing a number of obstacles that obscure and constrain end-effector operational workspaces. Therefore, enabling robots to operate within a dense foliage environment requires a purpose-built end-effector, able to perform precision based tasks despite workspace challenges. Whilst many tools have been implemented within related literature, most prior work evaluates their operational performance within reduced foliage testing environments. As such, this paper presents the performance evaluation of three end-effector mechanisms developed for robotic leaf pruning operations within unaltered dense foliage, referred to as the: scissor-cutter, curved-cutter and vacuum-cutter. End-effector mechanisms were chosen based on compact tool shapes, target approach direction range and deployment within related literature. Evaluation criteria focused on the mechanisms operational success rate and damage caused to the plant. Through this paper we show that a vacuum-cutter had the highest success rate of 75% whilst the scissor-cutter caused the least plant damage. A comprehensive failure mechanism assessment and improvement recommendations for future prototypes are also provided.

Abstract:
This paper presents a Spatio-Temporal Transformer-based algorithm for underwater diver hand gesture recognition, forming a key component of diver-robot teaming. Existing computer vision-based approaches primarily rely on frame-wise gesture detection, which often fails to capture motion continuity and suffers under degraded underwater visibility. The presented method integrates temporal modeling to (i) improve recognition accuracy by capturing spatio-temporal patterns in hand motion, and (ii) increase robustness in challenging underwater environments by leveraging sequential image data, thereby mitigating the impact of intermittent misclassifications. The system is evaluated using real-world underwater footage, demonstrating high recognition accuracy and robustness to lighting fluctuations and partial occlusions. The results highlight the effectiveness and practicality of the presented method for real-world diver-robot collaboration, establishing a foundation for more reliable and intelligent underwater human-robot collaboration.

Abstract:
Robotic tactile sensors based on Electrical Impedance Tomography (EIT) have gained great attention in robotic sensing applications due to their features such as no internal wiring, "all-in-one" structure, and continuous sensing capabilities. However, the effectiveness of EIT-based tactile sensors is hampered by limited spatial resolution and artifacts in the reconstructed images. To address these challenges, various iterative optimization methods based on spatial regularizations and model-based methods have been proposed. In this study, a new EIT reconstruction method using null-space decomposition based on a diffusion model (NSDM) is proposed. Specifically, NSDM consists of a forward diffusion process that first gradually adds Gaussian noise to a clean conductivity image, followed by a backward process that learns to predict the noise that should be removed during each sampling step, utilizing a prior to ensure that the denoising process does not deviate from the correct direction. NSDM requires no training, no optimization, and only requires a pre-prepared diffusion model. Experimental results (both simulation and actual tests) demonstrate that the proposed method outperforms existing generation methods and provides higher quality reconstruction, providing a new solution for robotic tactile sensing in real scenarios.

Abstract:
Rare, yet critical, scenarios pose a significant challenge in testing and evaluating autonomous driving planners. Relying solely on real-world driving scenes requires collecting massive datasets to capture these scenarios. While automatic generation of traffic scenarios appears promising, data-driven models require extensive training data and often lack fine-grained control over the output. Moreover, generating novel scenarios from scratch can introduce a distributional shift from the original training scenes which undermines the validity of evaluations especially for learning-based planners. To sidestep this, recent work proposes to generate challenging scenarios by augmenting original scenarios from the test set. However, this involves the manual augmentation of scenarios by domain experts. An approach that is unable to meet the demands for scale in the evaluation of self-driving systems. Therefore, this paper introduces a novel LLM-agent based framework for augmenting real-world traffic scenarios using natural language descriptions, addressing the limitations of existing methods. A key innovation is the use of an agentic design, enabling fine-grained control over the output and maintaining high performance even with smaller, cost-effective LLMs. Extensive human expert evaluation demonstrates our framework’s ability to accurately adhere to user intent, generating high quality augmented scenarios comparable to those created manually.

Abstract:
Risk quantification is a critical component of safe autonomous driving, however, constrained by the limited perception range and occlusion of single-vehicle systems in complex and dense scenarios. Vehicle-to-everything (V2X) paradigm has been a promising solution to sharing complementary perception information, nevertheless, how to ensure the risk interpretability while understanding multi-agent interaction with V2X remains an open question. In this paper, we introduce the first V2X-enabled risk quantification pipeline, CooperRisk, to fuse perception information from multiple agents and quantify the scenario driving risk in future multiple timestamps. The risk is represented as a scenario risk map to ensure interpretability based on risk severity and exposure, and the multi-agent interaction is captured by the learning-based cooperative prediction model. We carefully design a risk-oriented transformer-based prediction model with multi-modality and multi-agent considerations. It aims to ensure scene-consistent future behaviors of multiple agents and avoid conflicting predictions that could lead to overly conservative risk quantification and cause the ego vehicle to become overly hesitant to drive. Then, the temporal risk maps could serve to guide a model predictive control planner. We evaluate the CooperRisk pipeline in a real-world V2X dataset V2XPnP, and the experiments demonstrate its superior performance in risk quantification, showing a 44.35% decrease in conflict rate between the ego vehicle and background traffic participants.

Abstract:
Object detection systems must reliably perceive objects of interest without being overly confident to ensure safe decision-making in dynamic environments. Filtering techniques based on out-of-distribution (OoD) detection are commonly added as an extra safeguard to filter hallucinations caused by overconfidence in novel objects. Nevertheless, evaluating YOLO-family detectors and their filters under existing OoD benchmarks often leads to unsatisfactory performance. This paper studies the underlying reasons for performance bottlenecks and proposes a methodology to improve performance fundamentally. Our first contribution is a calibration of all existing evaluation results: Although images in existing OoD benchmark datasets are claimed not to have objects within in-distribution (ID) classes (i.e., categories defined in the training dataset), around 13% of objects detected by the object detector are actually ID objects. Dually, the ID dataset containing OoD objects can also negatively impact the decision boundary of filters. These ultimately lead to a significantly imprecise performance estimation. Our second contribution is to consider the task of hallucination reduction as a joint pipeline of detectors and filters. By developing a methodology to carefully synthesize an OoD dataset that semantically resembles the objects to be detected, and using the crafted OoD dataset in the fine-tuning of YOLO detectors to suppress the objectness score, we achieve a 88% reduction in overall hallucination error with a combined fine-tuned detection and filtering system on the self-driving benchmark BDD-100K. Our code and dataset are available at: https://gricad-gitlab.univ-grenoble-alpes.fr/dnn-safety/m-hood.

Abstract:
Object Simultaneous Localization and Mapping (SLAM) systems struggle to correctly associate semantically similar objects in close proximity, especially in cluttered indoor environments and when scenes change. We present Semantic Enhancement for Object SLAM (SEO-SLAM), a novel framework that enhances semantic mapping by integrating heterogeneous multimodal large language model (MLLM) agents. Our method enables scene adaptation while maintaining a semantically rich map. To improve computational efficiency, we propose an asynchronous processing scheme that significantly reduces the agents’ inference time without compromising semantic accuracy or SLAM performance. Additionally, we introduce a multi-data association strategy using a cost matrix that combines semantic and Mahalanobis distances, formulating the problem as a Linear Assignment Problem (LAP) to alleviate perceptual aliasing. Experimental results demonstrate that SEO-SLAM consistently achieves higher semantic accuracy and reduces false positives compared to baselines, while our asynchronous MLLM agents significantly improve processing efficiency over synchronous setups. We also demonstrate that SEO-SLAM has the potential to improve downstream tasks such as robotic assistance.Our dataset is publicly available at: jungseokhong.com/SEO-SLAM.

Abstract:
A robust environment representation is critical for enabling robot systems to accomplish embodied navigation tasks. While offering efficient and sparse representations of environments compared to dense semantic maps, traditional 3D Scene Graphs often rely on multi-level semantic hierarchies that risk semantic discrepancies between high-level nodes and objects. Furthermore, separating semantic context from way topological relationships creates a disconnect between scene interpretation and actionable navigation strategies. To address these challenges, we propose CTSG, a hierarchical 3D scene graph mapping framework for zero-shot object navigation that supports both visual and textual queries. CTSG features a dual-layer structure: an object layer and a novel conway layer (contextual information + way topology). The conway layer integrates topological waypoints with rich multi-modal context, enhancing the continuity of environmental semantics. By aligning observations with navigation-centric perspectives, CTSG bridges the gap between scene understanding and task execution. We validate our method through simulation and real-world experiments across diverse environments, demonstrating robust performance in both visual target and language-guided navigation scenarios.

Abstract:
When robots replicate human actions in peg-in-hole assembly tasks, such as USB Type-A insertion and removal, the complexity of the process and frequent obstructions from the inner walls make it difficult for robots to handle collisions or avoid jamming. These difficulties contribute to a low success rate in assembly. This paper proposes a vision-guided reinforcement learning pre-assembly combined with tactile feedback-based pose estimation adjustment method for peg-in-hole assembly, achieving significant improvement in success rates for complex assembly tasks. First, during the pretraining process of reinforcement learning, high-reward sample data is collected, and a behaviour cloning (BC) algorithm is constructed based on sample data structure. The network is pretrained as a policy regression layer. Under sparse reward conditions, outputs of the twin delayed deep deterministic policy gradient (TD3) network and the BC network are combined to improve training stability and accelerate convergence, enhancing the efficiency of vision-based assembly. Then, to address the instability caused by collisions with the inner and outer walls of the hole when vision-based assembly remains incomplete, an in-hand pose estimation algorithm based on the Gelsight visuotactile sensor is integrated. This algorithm facilitates real-time adjustments to the position of the robot’s end-effector, improving the likelihood of successful peg-in-hole assembly. Finally, to validate the effectiveness of the proposed method, experiments were conducted using the V-REP simulation platform and the real Franka robot platform. In the experiments, success rates of 90-93% and 80-85%, respectively, were achieved.

Abstract:
In robot-assisted cell culture tasks, fluctuations in lighting conditions can result in blurred boundaries, intensified reflections, and pronounced refractions of transparent objects. These optical phenomena collectively escalate the complexity of image processing and target recognition. To address these challenges, this paper takes a dual-strategy approach. Firstly, it utilizes the Unity platform to construct a synthetic dataset (STTO-9k) containing 9,000 images of six types of transparent objects, providing abundant training samples for the detection and recognition of transparent objects. Secondly, it proposes an improved YOLOv8 visual detection algorithm (YOLO-Edge-Guided Lighting Adaptation, YL-EGLA). The algorithm realizes feature fusion by dynamically extracting the high-dimensional features of the input through the self-attention mechanism combined with the enhanced edge features extracted by the edge detection operator, and is equipped with adaptive image enhancement module to ensure stable detection under different lighting conditions. Algorithm comparison results demonstrate that the YL-EGLA can be fully trained on the synthetic dataset and directly applied to real-world scenarios without additional fine-tuning. Furthermore, physical experiments further validate the efficiency and practicality of this algorithm in transparent object manipulation, fully showcasing its significant value in practical applications.

Abstract:
This paper tackles the challenging task of maintaining formation among multiple unmanned aerial vehicles (UAVs) while avoiding both static and dynamic obstacles during directed flight. The complexity of the task arises from its multi-objective nature, the large exploration space, and the sim-to-real gap. To address these challenges, we propose a two-stage reinforcement learning (RL) pipeline. In the first stage, we randomly search for a reward function that balances key objectives: directed flight, obstacle avoidance, formation maintenance, and zero-shot policy deployment. The second stage applies this reward function to more complex scenarios and utilizes curriculum learning to accelerate policy training. Additionally, we incorporate an attention-based observation encoder to improve formation maintenance and adaptability to varying obstacle densities. Experimental results in both simulation and real-world environments demonstrate that our method outperforms both planning-based and RL-based baselines in terms of collision-free rates and formation maintenance across static, dynamic, and mixed obstacle scenarios. Ablation studies further confirm the effectiveness of our curriculum learning strategy and attention-based encoder. Animated demonstrations are available at: https://sites.google.com/view/uav-formation-with-avoidance/.

Abstract:
Visual localization methods often present a trade-off between the high efficiency of specialized approaches, such as scene coordinate regression, and the need for rich, versatile scene representations for broader robotics tasks. To bridge this gap, we explore the use of 3D Gaussian Splatting (3DGS), which enables a unified, photorealistic encoding of 3D geometry and appearance. We propose GSplatLoc, a self-contained framework that tightly integrates structure-based keypoint matching with rendering-based pose refinement. Our two-stage procedure first distills robust descriptors from the lightweight XFeat extractor into the 3DGS model, enabling coarse pose estimation via 2D-3D correspondences without external dependencies. In the second stage, the initial pose is refined by minimizing a photometric warp loss, which leverages the fast, differentiable rendering of 3DGS. Benchmarking on widely used indoor and outdoor datasets demonstrates state-of-the-art performance among neural rendering-based localization methods and highlights the framework’s robustness in challenging dynamic scenes. Project page: https://gsplatloc.github.io

Abstract:
Diffusion models have been widely employed in the field of 3D manipulation due to their efficient capability to learn distributions, allowing for precise prediction of action trajectories. However, diffusion models typically rely on large parameter UNet backbones as policy networks, which can be challenging to deploy on resource-constrained devices. Recently, the Mamba model has emerged as a promising solution for efficient modeling, offering low computational complexity and strong performance in sequence modeling. In this work, we propose the Mamba Policy, a lighter but stronger policy that reduces the parameter count by over 80% compared to the original policy network while achieving superior performance. Specifically, we introduce the XMamba Block, which effectively integrates input information with conditional features and leverages a combination of Mamba and Attention mechanisms for deep feature extraction. Extensive experiments demonstrate that the Mamba Policy excels on the Adroit, Dexart, and MetaWorld datasets, requiring significantly fewer computational resources. Additionally, we highlight the Mamba Policy’s enhanced robustness in long-horizon scenarios compared to baseline methods and explore the performance of various Mamba variants within the Mamba Policy framework. Real-world experiments are also conducted to further validate its effectiveness. Our open-source project page can be found at https://andycao1125.github.io/mamba_policy/.

Abstract:
Garment folding is a common yet challenging task in robotic manipulation. The deformability of garments leads to a vast state space and complex dynamics, which complicates precise and fine-grained manipulation. In this paper, we present MetaFold, a unified framework that disentangles task planning from action prediction and learns each independently to enhance model generalization. It employs language-guided point cloud trajectory generation for task planning and a low-level foundation model for action prediction. This structure facilitates multi-category learning, enabling the model to adapt flexibly to various user instructions and folding tasks. We also construct a large-scale MetaFold dataset comprising folding point cloud trajectories for a total of 1210 garments across multiple categories, each paired with corresponding language annotations. Extensive experiments demonstrate the superiority of our proposed framework. Supplementary materials are available on our website: https://meta-fold.github.io/.

Abstract:
In this paper, we introduce S3D: A Spatial Steerable Surgical Drilling Framework for Robotic Spinal Fixation Procedures. S3D is designed to enable realistic steerable drilling while accounting for the anatomical constraints associated with vertebral access in spinal fixation (SF) procedures. To achieve this, we first enhanced our previously designed concentric tube Steerable Drilling Robot (CT-SDR) to facilitate steerable drilling across all vertebral levels of the spinal column. Additionally, we propose a four-Phase calibration, registration, and navigation procedure to perform realistic SF procedures on a spine holder phantom by integrating the CT-SDR with a seven-degree-of-freedom robotic manipulator. The functionality of this framework is validated through planar and out-of-plane steerable drilling experiments in vertebral phantoms.

Abstract:
A thorough understanding of the interaction between the target agent and surrounding agents is a prerequisite for accurate trajectory prediction. Although many methods have been explored, they assign correlation coefficients to surrounding agents in a purely learning-based manner. In this study, we present ASPILin, which manually selects interacting agents and replaces the attention scores in Transformer with a newly computed physical correlation coefficient, enhancing the interpretability of interaction modeling. Surprisingly, these simple modifications can significantly improve prediction performance and substantially reduce computational costs. We intentionally simplified our model in other aspects, such as map encoding. Remarkably, experiments conducted on the INTERACTION, highD, and CitySim datasets demonstrate that our method is efficient and straightforward, outperforming other state-of-the-art methods.

Abstract:
Recent advances in Reinforcement Learning (RL) have demonstrated promising results in autonomous car racing. However, two fundamental challenges remain: sparse rewards, which hinder efficient learning process, and the quality of demonstrations, which directly affects the effectiveness of RL from Demonstration (RLfD) approaches. To address these issues, we propose SAC(λ), a novel RLfD algorithm tailored for sparse-reward racing tasks with imperfect demonstrations. SAC(λ) introduces two key components: (1) a discriminator-augmented Q-function, which integrates prior knowledge from demonstrations into value estimation while maintaining off-policy learning benefits, and (2) a Positive-Unlabeled (PU) learning framework with adaptive prior adjustment, which enables the agent to progressively refine its understanding of positive behaviors, while mitigating the overfitting problem. Through extensive experiments in the Assetto Corsa simulator, we demonstrate that SAC(λ) significantly accelerates training, surpasses the provided demonstrations, and achieves superior lap times over existing RL and RLfD approaches. Code and videos are available at https://heesungsung.github. io/AC-RLRacer/.

Abstract:
To address the challenge of short-term object pose tracking in dynamic environments with monocular RGB input, we introduce a large-scale synthetic dataset Omni-Pose6D, crafted to mirror the diversity of real-world conditions. We additionally present a benchmarking framework for a comprehensive comparison of pose tracking algorithms. We propose a pipeline featuring an uncertainty-aware keypoint refinement network, employing probabilistic modeling to refine pose estimation. Comparative evaluations demonstrate that our approach achieves performance superior to existing baselines on real datasets, underscoring the effectiveness of our synthetic dataset and refinement technique in enhancing tracking precision in dynamic contexts. Our contributions set a new precedent for the development and assessment of object pose tracking methodologies in complex scenes.

Abstract:
Dexterous telemanipulation critically relies on the continuous and stable tracking of the human operator’s commands to ensure robust operation. Vison-based tracking methods are widely used but have low stability due to anomalies such as occlusions, inadequate lighting, and loss of sight. Traditional filtering, regression, and interpolation methods are commonly used to compensate for explicit information such as angles and positions. These approaches are restricted to low-dimensional data and often result in information loss compared to the original high-dimensional image and video data. Recent advances in diffusion-based approaches, which can operate on high-dimensional data, have achieved remarkable success in video reconstruction and generation. However, these methods have not been fully explored in continuous control tasks in robotics. This work introduces the Diffusion-Enhanced Telemanipulation (DET) framework, which incorporates the Frame-Difference Detection (FDD) technique to identify and segment anomalies in video streams. These anomalous clips are replaced after reconstruction using diffusion models, ensuring robust telemanipulation performance under challenging visual conditions. We validated this approach in various anomaly scenarios and compared it with the baseline methods. Experiments show that DET achieves an average RMSE reduction of 17.2% compared to the cubic spline and 51.1% compared to FFT-based interpolation for different occlusion durations.

Abstract:
Unmanned Aerial Vehicles (UAVs) play a crucial role in various scenarios ranging from disaster response to traffic surveillance. However, aerial video footage often suffers from severe motion blur due to rapid flight maneuvers, vibrations, and camera panning, which can significantly degrade downstream tasks such as target detection. Our goal is to explore a computationally-efficient and effective video deblurring approach to enhance UAV target detection performance. To reduce computational cost, we first propose an Adaptive Latent Scale Selector that dynamically adjusts the latent space resolution according to the intensity of UAV motion, thus balancing detail preservation with inference efficiency. To ensure temporal consistency, we introduce a Multi-Frame Alignment and Learnable Gating module to warp and gate the preceding frames, allowing the model to fuse only relevant temporal information and suppress misaligned or uninformative features. Our method can effectively recover sharp details from the UAV video stream. Extensive experiments on real UAV benchmarks demonstrate that our method not only yields superior deblurring performance but also significantly boosts target detection accuracy, making it highly applicable to robust aerial vision tasks. Code will be publicly available here.

Abstract:
Surgeons are frequently subjected to prolonged unnatural postures, such as sustained neck flexion during surgical procedures, increasing their susceptibility to work-related musculoskeletal disorders (WMSDs). Despite the prevalence of such issues, there remains a scarcity of ergonomic solutions that effectively mitigate neck pain while permitting full operational functionality across diverse environments. In this paper, we present a novel variable-stiffness neck exoskeleton with pneumatic-driven tensile actuators for prolonged head flexion assistance. Its innovative design facilitates unrestricted head movement while providing expected neck support, thereby minimizing strain on antagonist muscle groups. Comprehensive experimental evaluations were conducted to assess the system’s performance. Transitional response times between flexible and rigid states were 0.28 s and 0.46 s, respectively. Experimental trials involving five healthy subjects demonstrated that the average muscular activity reduction of the splenius capitis and the sternocleidomastoid muscles were 38.8 ± 2.0% and 9.7 ± 2.5%, respectively. These experimental results demonstrated great potential of the exoskeleton in practical application for alleviating the physical burden during prolonged head flexion.

Abstract:
The 3D visual grounding task aims to establish correspondences between the 3D physical world and textual descriptions. Despite significant progress having been made, it still suffers from some challenges that need to be solved. a) Scene-agnostic text reasoning causes misaligned target region concentration. b) The regional pseudo-center interferences result in an inaccurate geometric center. c) Multi-modal features overemphasize semantics, leading to degradation in geometric topological perception for size regression. To address these issues, we creatively propose a Progressive Comprehension and Geometric-topology Perception Enhancement (PCGE) one-stage framework, which decouples the task into keypoint estimation and size regression under textual constraints. Specifically, to enable coarse-to-fine keypoint estimation, we propose the STAR module to focus the target region approximately with a scene-specific reasoning mechanism, while the K2C module performs geometric calibration to alleviate pseudo-center bias. For size regression, we propose GTE to enhance the geometric boundary perception during the decoding process, improving size regression via establishing topological matrices. Compared with previous methods, our approach achieves state-of-the-art performance on ScanRefer and Sr3D, with 3.94% leads of Acc@0.50 on ScanRefer, and 3.7% leads on Sr3D.

Abstract:
We introduce in this paper an original planning algorithm, Zoned Artificial Repulsion (ZAR∗), designed for multiple-robot micromanipulation. The algorithm combines a modified Artificial Potential Field (APF) with A∗ to efficiently compute micromanipulation trajectories. The motivation behind ZAR∗ is to integrate the advantages of APF and A∗ while avoiding their limitations. While APF is computationally efficient but prone to local minima, A∗ guarantees completeness but has high complexity. ZAR∗ employs a modified repulsive field to segment the configuration space into distinct zones, each converging to a unique local minimum. These zones are reduced to nodes in a graph, allowing A∗ to compute the inter-zone transitions. The modified APF then handles navigation within the zones. This method ensures the completeness of the algorithm, avoids local minima, and significantly reduces the number of nodes in the graph, leading to a highly efficient algorithm in terms of processing time and path cost. We compare ZAR∗ against A∗, APF, RRT, RRT ∗, and PRM, as well as more recent hybrid algorithms. On average, ZAR∗ reduces the number of nodes by 909 times and speeds up path construction efficiency (time × cost) by 499 times on average, while maintaining a 100% success rate. It also performs more than 4 times better than the best hybrid alternative of standard algorithms, making it suitable for dexterous micromanipulation tasks.

Abstract:
In this paper, we present a new approach to improve the neural rendering fidelity of in-the-wild unmanned aerial vehicle (UAV)-based scenes. Our formulation is designed for dynamic scenes, consisting of small moving objects or human actions in particular. We propose an extension of K-Planes Neural Radiance Field (NeRF), wherein our algorithm stores a set of tiered high dimensional feature vectors. The tiered feature vectors are generated to effectively model conceptual information about a scene as well as to be processed by an image decoder that transforms output feature maps into RGB images. Our technique leverages the information among both static and dynamic objects within a scene and is able to capture salient scene attributes of high altitude videos. We evaluate its performance on challenging datasets, including Okutama Action and UG2, and observe considerable improvement in accuracy over state of the art neural rendering methods.

Abstract:
Unmanned Aerial Vehicles (Uavs) are limited by the onboard energy. Refinement of the navigation strategy directly affects both the flight velocity and the trajectory based on the adjustment of key parameters in the Uavs pipeline, thus reducing energy consumption. However, existing techniques tend to adopt static and conservative strategies in dynamic scenarios, leading to inefficient energy reduction. Dynamically adjusting the navigation strategy requires overcoming the challenges including the task pipeline interdependencies, the environmental-strategy correlations, and the selecting parameters. To solve the aforementioned problems, this paper proposes a method to dynamically adjust the navigation strategy of the Uavs by analyzing its dynamic characteristics and the temporal characteristics of the autonomous navigation pipeline, thereby reducing Uavs energy consumption in response to environmental changes. We compare our method with the baseline through hardware-in-the-loop (HIL) simulation and real-world experiments, showing our method 3.2X and 2.6X improvements in mission time, 2.4X and 1.6X improvements in energy, respectively.

Abstract:
Target singulation involves a rearrangement of surrounding obstacles to create space for grasping the target. However, when objects are tightly packed in a confined workspace (i.e., a limited table boundary), it is not easy to relocate obstacles. Using non-prehensile manipulation, such as push, is suitable for creating space when objects are closely placed. However, it can be risky as collisions between objects might unintentionally push them beyond the table boundary. On the other hand, prehensile motion ensures safe relocation of objects. However, the objects that can be safely grasped are limited when the packing density is too high. Thus, it may not find a rearrangement plan for target singulation. To complement both methods, we suggest using collision-free push-stack synergy for rearrangement. Collision-free push prevents objects from moving out of the boundary while efficiently relocating objects and stack creates space in advance to safely push. Furthermore, we propose a modified algorithm of Local Obstacle-based Backward Search (LOBS), which generates a global rearrangement plan using only pick-and-place actions. To evaluate our method, we set up challenging scenarios - with a packing density of 50% and up to 70 objects. Compared to LOBS, the success rate increased significantly with no meaningful increase in planning time. Additionally, Our method outperformed other baselines as well.

Abstract:
Accurate walking speed estimation in lower-limb prostheses is crucial for delivering biomechanically appropriate assistance across varying speeds. However, training robust models requires extensive domain-specific, user-dependent (DEP) data, which is impractical for every new prosthesis user. This study presents a transfer learning framework to simplify and enhance the training process. Convolutional neural networks were pre-trained on publicly available datasets from able-bodied (AB) individuals and transfemoral amputees using the Open Source Leg (OSL) knee-ankle prosthesis, then fine-tuned with data from a transfemoral amputee using the Power Knee (PK) prosthesis. The fine-tuned models, AB-PK and OSL-PK were trained with varying data amounts and evaluated across constant and variable walking speed trials, with performance compared to DEP models trained from scratch on PK data. Training and testing were conducted on a per-subject basis, with performance averaged across subjects (N=7). The lowest post-fine-tuning error was observed in AB-PK, with RMSE values of 0.041 m/s for constant speeds, 0.072 m/s for variable speeds, and 0.088 m/s for novel speeds not included in the original training data. Significant error reductions were observed in both fine-tuned models compared to DEP when fewer than 30 gait cycles per speed of training data were available. Notably, AB datasets appeared highly viable for this application and may even outperform OSL datasets in transfer learning for walking speed estimation, perhaps due to the much larger original training dataset. This approach highlights the potential of transfer learning across different subject populations and devices, offering insights into the data needed to achieve state-of-the-art speed estimation.

Abstract:
Guaranteeing constraint satisfaction is challenging in imitation learning (IL), particularly in tasks that require operating near a system’s handling limits. Traditional IL methods, such as Behavior Cloning (BC), often struggle to enforce constraints, leading to suboptimal performance in high-precision tasks. In this paper, we present a simple approach to incorporating safety into the IL objective. Through simulations, we empirically validate our approach on an autonomous racing task with both full-state and image feedback, demonstrating improved constraint satisfaction and greater consistency in task performance compared to BC.

Abstract:
A domain shift exists between the large-scale, internet data used to train a Vision-Language Model (VLM) and the raw image streams collected by a robot. Existing adaptation strategies require the definition of a closed-set of classes, which is impractical for a robot that must respond to diverse natural language queries. In response, we present QueryAdapter; a novel framework for rapidly adapting a pre-trained VLM in response to a natural language query. QueryAdapter leverages unlabelled data collected during previous deployments to align VLM features with semantic classes related to the query. By optimising learnable prompt tokens and actively selecting objects for training, an adapted model can be produced in a matter of minutes. We also explore how objects unrelated to the query should be dealt with when using real-world data for adaptation. In turn, we propose the use of object captions as negative class labels, helping to produce better calibrated confidence scores during adaptation. Extensive experiments on ScanNet++ demonstrate that QueryAdapter significantly enhances object retrieval performance compared to state-of-the-art unsupervised VLM adapters and 3D scene graph methods. Furthermore, the approach exhibits robust generalization to abstract affordance queries and other datasets, such as Ego4D.

Abstract:
Stand-alone Visual Place Recognition (VPR) systems have little defence against a well-designed adversarial attack, which can lead to disastrous consequences when deployed for robot navigation. This paper extensively analyzes the effect of two adversarial attacks common in other perception tasks and two novel VPR-specific attacks on VPR localization performance. We then propose how to close the loop between VPR, an Adversarial Attack Detector (AAD), and active navigation decisions by demonstrating the performance benefit of simulated AADs in a novel experiment paradigm – which we detail for the robotics community to use as a system framework. In the proposed experiment paradigm, we see the addition of AADs across a range of detection accuracies can improve performance over baseline; demonstrating a significant improvement – such as a ≈ 50% reduction in the mean along-track localization error – can be achieved with True Positive and False Positive detection rates of only 75% and up to 25% respectively. We examine a variety of metrics including: Along-Track Error, Percentage of Time Attacked, Percentage of Time in an ‘Unsafe’ State, and Longest Continuous Time Under Attack. Expanding further on these results, we provide the first investigation into the efficacy of the Fast Gradient Sign Method (FGSM) adversarial attack for VPR. The analysis in this work highlights the need for AADs in real-world systems for trustworthy navigation, and informs quantitative requirements for system design.

Abstract:
Optimizing the initial angle of the shoulder’s base frame is crucial for defining the workspace and enhancing the manipulation performance of humanoid robotic arms. Previous studies primarily emphasized geometric analyses, neglecting dynamic factors, which limits their practical applicability. This study presents a multi-metric optimization framework to enhance the dynamic performance of humanoid robotic arms by optimizing the shoulder’s initial angle. We formulate a cost function that incorporates torque efficiency, energy consumption, and overload ratio, utilizing the differential evolution (DE) algorithm for optimization. Furthermore, to address the limitations of conventional geometric workspace analysis, we introduce the concept of effective workspace, integrating dynamic constraints to quantitatively evaluate the effects of optimized shoulder angles. We validate the proposed framework in a hybrid simulation environment combining MuJoCo and RBDL, using the KIST humanoid and Unitree G1 robotic arms. Experimental results confirm that the optimized shoulder angles enhance torque distribution, expanding the effective workspace by 18.4% and 3.78% for the KIST and Unitree G1 robotic arms, respectively. These findings demonstrate that the proposed optimization framework enhances manipulation and dynamic performance as well as energy efficiency and system reliability, contributing to advancements in humanoid robotic arm design.

Abstract:
A two-degree-of-freedom parallel mechanism control system based on an adaptive learning rate and radial basis function (RBF) neural network controller is studied in this paper. The mechanism is composed of four pneumatic artificial muscles(PAM), forming two pairs of antagonistic single-degree-of-freedom joints, which enable two-degree-of-freedom motion along the X and Y axes. The core objective of the system is to automatically output the air pressure values for the X and Y axes based on the input desired angle, driving the joints to precisely reach the specified angle. In this research, dynamic modeling of the two pairs of driving joints composed of four pneumatic muscles was conducted, analyzing the motion characteristics of the system. Subsequently, an RBF neural network was employed to approximate system modeling errors and external disturbances, combined with a PID controller to optimize the driving performance of the pneumatic muscles. The stability of the controller was proven by designing the Lyapunov function, ensuring that the system remains stable during dynamic changes. Finally, simulation experiments were conducted using MATLAB/Simulink to verify the effectiveness of the proposed algorithm. The experimental results demonstrate that the control algorithm enables the actual angle to track the desired angle in real-time, with high control accuracy and stability. This research provides a new solution for the precise control of pneumatic muscle-driven parallel joint systems, with broad application prospects, effectively addressing the limitations of traditional PAM control methods that require precise modeling and suffer from poor robustness.

Abstract:
Accurate prediction of pedestrian trajectories is crucial for improving the safety of autonomous driving. However, this task is generally nontrivial due to the inherent stochasticity of human motion, which naturally requires the predictor to generate multi-modal prediction. Previous works leverage various generative methods, such as GAN and VAE, for pedestrian trajectory prediction. Nevertheless, these methods may suffer from mode collapse and relatively low-quality results. The denoising diffusion probabilistic model (DDPM) has recently been applied to trajectory prediction due to its simple training process and powerful reconstruction ability. However, current diffusion-based methods do not fully utilize input information and usually require many denoising iterations that lead to a long inference time or an additional network for initialization. To address these challenges and facilitate the use of diffusion models in multi-modal trajectory prediction, we propose GDTS, a novel Goal-Guided Diffusion Model with Tree Sampling for multi-modal trajectory prediction. Considering the "goal-driven" characteristics of human motion, GDTS leverages goal estimation to guide the generation of the diffusion network. A two-stage tree sampling algorithm is presented, which leverages common features to reduce the inference time and improve accuracy for multi-modal prediction. Experimental results demonstrate that our proposed framework achieves comparable state-of-the-art performance with real-time inference speed in public datasets.

Abstract:
With the rapid advancement of smart farming towards large-scale livestock operations, the demand for model generalization in cross-pen behavior recognition has significantly increased. Traditional deep learning models suffer from substantial performance degradation due to variations in illumination and structure across different sheep pens, often necessitating the re-annotation of tens of thousands of frames for each new environment to mitigate domain shift issues. This severely limits the deployment of models in large-scale sheep farms. To achieve the goal of ’annotate once, generalize across pens,’ we propose the SheepDA-YOLO framework, which innovatively integrates contrastive image translation and feature decoupling to address cross-domain adaptation challenges in agriculture. The core of our method consists of four parts: generating bidirectional pseudo-images for source and target domains based on CUT method to reduce image-level domain discrepancies through mixed training sets; employing a Mean Teacher architecture combined with a quadruple loss function to ensure stable knowledge transfer; proposing DP-DMAF module, which suppresses illumination interference and feature confusion through dual-path feature decoupling and separable large-kernel attention, complemented by a high-resolution detection layer to enhance small-target recognition accuracy. Experimental results demonstrate that SheepDA-YOLO achieves 89.7% mAP in cross-domain testing on target sheep pens, outperforming state-of-the-art methods by 3.4% and significantly reducing annotation costs. The study is the first to validate the feasibility of cross-pen adaptation, providing an efficient solution for the scalable implementation of smart livestock farming.

Abstract:
Object tracking in inland waterways plays a crucial role in safe and cost-effective applications, including waterborne transportation, sightseeing tours, environmental monitoring and surface rescue. Our Unmanned Surface Vehicle (USV), equipped with a 4D radar, a monocular camera, a GPS, and an IMU, delivers robust tracking capabilities in complex waterborne environments. By leveraging these sensors, our USV collected comprehensive object tracking data, which we present as USVTrack, the first 4D radar-camera tracking dataset tailored for autonomous driving in new generation waterborne transportation systems. Our USVTrack dataset presents rich scenarios, featuring diverse various waterways, varying times of day, and multiple weather and lighting conditions. Moreover, we present a simple but effective radar-camera matching method, termed RCM, which can be plugged into popular two-stage association trackers. Experimental results utilizing RCM demonstrate the effectiveness of the radar-camera matching in improving object tracking accuracy and reliability for autonomous driving in waterborne environments. The USVTrack dataset is public on https://usvtrack.github.io.

Abstract:
Liquid phase change pouch actuators (liquid pouch motors) hold great promise for a wide range of robotic applications, from artificial organs to pneumatic manipulators for dexterous manipulation. However, the usability of liquid pouch motors remains challenging due to the nonlinear intrinsic properties of liquids and their highly dynamic implications for liquid-gas phase changes, which complicate state modeling and estimation. To address these issues, we propose a reservoir computing-based method for modeling the inflation states of a customized liquid pouch motor, which serves as an actuator, featuring four Peltier heating junctions. We use a motion capture system to track the landmark movements on the pouch as a proxy for its volumetric profile. These movements represent the internal liquid-gas phase changes of the pouch at stable room temperature, atmospheric pressure, and in the presence of electrical noise. The motion coordinates are thus learned by our reservoir computing framework, PhysRes, to model the states based on prior observations. Through training, our model achieves excellent results on the test set, with a normalized root mean squared error of 0.0041 in estimating the states and a corresponding volumetric error of 0.0160%. To further demonstrate how such actuators could be implemented in the future, we also design a dual-pouch actuator-based robotic gripper to control the grasping of soft objects. Our design and source code are available at: https://github.com/tatung/liquidpouch_reservoir.

Abstract:
Robotic manipulation demands precise control over both contact forces and motion trajectories. While force control is essential for achieving compliant interaction and high-frequency adaptation, it is limited to operations in close proximity to the manipulated object and often fails to maintain stable orientation during extended motion sequences. Conversely, optimization-based motion planning excels in generating collision-free trajectories over the robot’s configuration space but struggles with dynamic interactions where contact forces play a crucial role. To address these limitations, we propose a multi-modal control framework that combines force control and optimization-augmented motion planning to tackle complex robotic manipulation tasks in a sequential manner, enabling seamless switching between control modes based on task requirements. Our approach decomposes complex tasks into subtasks, each dynamically assigned to one of three control modes: Pure optimization for global motion planning, pure force control for precise interaction, or hybrid control for tasks requiring simultaneous trajectory tracking and force regulation. This framework is particularly advantageous for bimanual and multi-arm manipulation, where synchronous motion and coordination among arms are essential while considering both the manipulated object and environmental constraints. We demonstrate the versatility of our method through a range of long-horizon manipulation tasks, including single-arm, bimanual, and multi-arm applications, highlighting its ability to handle both free-space motion and contact-rich manipulation with robustness and precision. More information is available at https://sites.google.com/view/komo-force/home

Abstract:
Micro Autonomous Surface Vehicles (MicroASVs) offer significant potential for operations in confined or shallow waters and swarm robotics applications. However, achieving precise and robust control at such small scales remains highly challenging, mainly due to the complexity of modeling nonlinear hydrodynamic forces and the increased sensitivity to self-motion effects and environmental disturbances, including waves and boundary effects in confined spaces. This paper presents a physics-driven dynamics model for an over-actuated MicroASV and introduces a data-driven optimal control framework that leverages a weak formulation-based online model learning method. Our approach continuously refines the physics-driven model in real time, enabling adaptive control that adjusts to changing system parameters. Simulation results demonstrate that the proposed method substantially enhances trajectory tracking accuracy and robustness, even under unknown payloads and external disturbances. These findings highlight the potential of data-driven online learning-based optimal control to improve MicroASV performance, paving the way for more reliable and precise autonomous surface vehicle operations.

Abstract:
Endowing the curved surfaces of rounded vision-based tactile fingers is essential for dexterous robotic manipulation, as they offer more sufficient contact with the environment. However, current rounded designs are constrained by a low sensing frequency (30–60 Hz) and the need for recalibration when adapting to new sensors due to the reliance on multi-channel captures, which hinders their performance in dynamic robotic tasks and large-scale deployment. In this work, we introduce R-Tac0, a low-cost rounded VBTS engineered for high-resolution and high-speed perception. The key innovation is a monochrome vision-based sensing principle: utilizing a black-and-white camera to capture the reflection properties of the compound rounded elastomer under monochromatic illumination. This single-channel imaging significantly reduces data volume and simplifies computational complexity, enabling 120 Hz tactile perception. A lightweight neural network can calibrate the sensor to achieve a depth reconstruction accuracy of 0.169 mm per pixel, while exhibiting surprisingly good transferability to new sensors. In experiments, we demonstrate the advantages of R-Tac0’s rounded design by evaluating its performance under different contact angles, its high-frequency perception in slip detection, and its effectiveness in robotic dynamic pose estimation.

Abstract:
Building an online 3D LiDAR mapping system that produces a detailed surface reconstruction while remaining computationally efficient is a challenging task. In this paper, we present PlanarMesh, a novel incremental, mesh-based LiDAR reconstruction system that adaptively adjusts mesh resolution to achieve compact, detailed reconstructions in real-time. It introduces a new representation, planar-mesh, which combines plane modeling and meshing to capture both large surfaces and detailed geometry. The planar-mesh can be incrementally updated considering both local surface curvature and free-space information from sensor measurements. We employ a multi-threaded architecture with a Bounding Volume Hierarchy (BVH) for efficient data storage and fast search operations, enabling real-time performance. Experimental results show that our method achieves reconstruction accuracy on par with, or exceeding, state-of-the-art techniques—including truncated signed distance functions, occupancy mapping, and voxel-based meshing—while producing smaller output file sizes (10 times smaller than raw input and more than 5 times smaller than mesh-based methods) and maintaining real-time performance (around 2 Hz for a 64-beam sensor).

Abstract:
Autonomous navigation in open-world outdoor environments faces challenges in integrating dynamic conditions, long-distance spatial reasoning, and semantic understanding. Traditional methods struggle to balance local planning, global planning, and semantic task execution, while existing large language models (LLMs) enhance semantic comprehension but lack spatial reasoning capabilities. Although diffusion models excel in local optimization, they fall short in large-scale long-distance navigation. To address these gaps, this paper proposes KiteRunner, a language-driven cooperative local-global navigation strategy that combines UAV orthophoto-based global planning with diffusion model-driven local path generation for long-distance navigation in open-world scenarios. Our method innovatively leverages real-time UAV orthophotography to construct a global probability map, providing traversability guidance for the local planner, while integrating large models like CLIP and GPT to interpret natural language instructions. Experiments demonstrate that KiteRunner achieves 5.6% and 12.8% improvements in path efficiency over state-of-the-art methods in structured and unstructured environments, respectively, with significant reductions in human interventions and execution time.

Abstract:
3D point cloud semantic segmentation (PCSS) is a cornerstone for environmental perception in robotic systems and autonomous driving, enabling precise scene understanding through point-wise classification. While unsupervised domain adaptation (UDA) mitigates label scarcity in PCSS, existing methods critically overlook the inherent vulnerability to real-world perturbations (e.g., snow, fog, rain) and adversarial distortions. This work first identifies two intrinsic limitations that undermine current PCSS-UDA robustness: (a) unsupervised features overlap from unaligned boundaries in shared-class regions and (b) feature structure erosion caused by domain-invariant learning that suppresses target-specific patterns. To address the proposed problems, we propose a tripartite framework consisting of: 1) a robustness evaluation model quantifying resilience against adversarial attack/corruption types through robustness metrics; 2) an invertible attention alignment module (IAAM) enabling bidirectional domain mapping while preserving discriminative structure via attention-guided overlap suppression; and 3) a quality-guided contrastive memory bank that progressively refines pseudo-labels with feature quality for more discriminative representations. Extensive experiments on SynLiDAR-to-SemanticPOSS adaptation demonstrate a maximum mIoU improvement of 14.3% under adversarial attack.

Abstract:
Biomimetic and compliant robotic hands offer the potential for human-like dexterity, but controlling them is challenging due to high dimensionality, complex contact inter-actions, and uncertainties in state estimation. Sampling-based model predictive control (MPC), using a physics simulator as the dynamics model, is a promising approach for generating contact-rich behavior. However, sampling-based MPC has yet to be evaluated on physical (non-simulated) robotic hands, particularly on compliant hands with state uncertainties. We present the first successful demonstration of in-hand manipulation on a physical biomimetic tendon-driven robot hand using sampling-based MPC. While sampling-based MPC does not require lengthy training cycles like reinforcement learning approaches, it still necessitates adapting the task-specific objective function to ensure robust behavior execution on physical hardware. To adapt the objective function, we integrate a visual language model (VLM) with a real-time optimizer (MuJoCo MPC). We provide the VLM with a high-level human language description of the task and a video of the hand’s current behavior. The VLM gradually adapts the objective function, allowing for efficient behavior generation, with each iteration taking less than two minutes. We show the feasibility of ball rolling, flipping, and catching using both simulated and physical robot hands. Our results demonstrate that sampling-based MPC is a promising approach for generating dexterous manipulation skills on biomimetic hands without extensive training cycles. 1

Abstract:
We propose an integrated planning framework for quadrupedal locomotion over dynamically changing, unforeseen terrains. Existing approaches either rely on heuristics for instantaneous foothold selection–compromising safety and versatility–or solve expensive trajectory optimization problems with complex terrain features and long time horizons. In contrast, our framework leverages reactive synthesis to generate correct-by-construction controllers at the symbolic level, and mixed-integer convex programming (MICP) for dynamic and physically feasible footstep planning for each symbolic transition. We use a high-level manager to reduce the large state space in synthesis by incorporating local environment information, improving synthesis scalability. To handle specifications that cannot be met due to dynamic infeasibility, and to minimize costly MICP solves, we leverage a symbolic repair process to generate only necessary symbolic transitions. During online execution, re-running the MICP with real-world terrain data, along with runtime symbolic repair, bridges the gap between offline synthesis and online execution. We demonstrate, in simulation, our framework’s capabilities to discover missing locomotion skills and react promptly in safety-critical environments, such as scattered stepping stones and rebars.

Abstract:
Automated parking is a critical feature of Advanced Driver Assistance Systems (ADAS), where accurate trajectory prediction is essential to bridge perception and planning modules. Despite its significance, research in this domain remains relatively limited, with most existing studies concentrating on single-modal trajectory prediction of vehicles. In this work, we propose ParkDiffusion, a novel approach that predicts the trajectories of both vehicles and pedestrians in automated parking scenarios. ParkDiffusion employs diffusion models to capture the inherent uncertainty and multi-modality of future trajectories, incorporating several key innovations. First, we propose a dual map encoder that processes soft semantic cues and hard geometric constraints using a two-step cross-attention mechanism. Second, we introduce an adaptive agent type embedding module, which dynamically conditions the prediction process on the distinct characteristics of vehicles and pedestrians. Third, to ensure kinematic feasibility, our model outputs control signals that are subsequently used within a kinematic framework to generate physically feasible trajectories. We evaluate ParkDiffusion on the Dragon Lake Parking (DLP) dataset and the Intersections Drone (inD) dataset. Our work establishes a new baseline for heterogeneous trajectory prediction in parking scenarios, outperforming existing methods by a considerable margin.

Abstract:
Vision-based multi-drone multi-object tracking technology enables autonomous target situational awareness for unmanned aerial systems. Distributed observer drones dynamically estimate the spatio-temporal states of multiple targets through collaborative sensor fusion, enabling simultaneous localization and persistent following of the target of interest in cluttered airspaces. The challenge lies in distinguishing targets in different drones’ views and keeping the target of interest within the field of view. This paper proposes a factor graph method for joint multi-target association and localization with distributed drone following. Sensor measurements and control constraints are integrated into a probabilistic factor graph to solve the bundle adjustment and model predictive control, respectively. Both simulation and real-world experiments prove the effectiveness and robustness of our proposed approach. The source code will be available at: https://github.com/npu-ius-lab/MLMF.

Abstract:
Developing robotic systems for unstructured and contact-rich environments presents significant challenges, necessitating advanced dexterous motion planning, compliant interaction control, and spatio-temporal coordination. To address these, we introduce PACR (Point-Axis Constraint Reasoning), an unified framework that encodes robot trajectories and impedance profiles via constraint functions parameterized by point-axis primitives, extracted from multi-view RGB-D camera observations. This enables joint optimization of motion and impedance within a shared mathematical framework. For enhanced robustness, we implement a dual-agent Vision-Language Model (VLM) system: a Generator employs Chain-of-Thought reasoning to formulate constraints, while an adversarial Critic validates them, significantly mitigating hallucination risks. Integrated with the dual-agent system, the framework also features an error backtracking mechanism, enabling dynamic adaptation by learning from failures. Extensive experiments across diverse manipulation tasks reveal that PACR achieves a 61% success rate (compared to 37% for baseline methods) and reduces the average contact forces, demonstrating broad applicability through zero-shot generalization without task-specific training.

Abstract:
Achieving stable and natural locomotion in bipedal robots, comparable to that of humans and animals, remains a long-standing challenge in robotics. In this work, we propose a bio-inspired low-level control framework that streamlines the generation of naturalistic gait patterns while ensuring adaptability. Our approach begins with the design of a low-dimensional gait representation that captures key characteristics of human and animal locomotion. This representation is then integrated with the Linear Inverted Pendulum Model (LIPM) to form an abstract yet effective motion descriptor. Serving as a kinematic reference within a reinforcement learning (RL) framework, this descriptor enables the training of control policies that strike a balance between biomechanical realism and adaptability. Rather than strictly adhering to predefined gait trajectories, the learned policies dynamically adjust to optimize both stability and velocity tracking. As a result, our method enables bipedal robots to exhibit smooth, biomechanically realistic locomotion while enhancing stability and adaptability. We validate the proposed framework through real-world experiments on our bipedal robot, demonstrating its ability to achieve stable and efficient locomotion.

Abstract:
Air-ground robots have received more and more attention and applications due to their air-to-ground motion performance and excellent energy efficiency. However, airground robots have many gaps including complex structure mechanisms, low terrain adaptability and low-precision controllers to significantly limit practical application. In this paper, an air-ground robot, wheel-leg unmanned aerial vehicle (WLuav), is proposed based on a five-link wheel leg structure to obtain excellent ground adaptive and air maneuvering capabilities. Based on the improved structure mechanism, a hierarchical adaptive agile controller is proposed to improve its trajectory tracking accuracy and ground adaptability. Besides, a mode switching strategy based on the support force solver is proposed to provide smooth and rapid mode switching. Finally, comprehensive experiments and a benchmark comparison are carried out to validate the performance of the proposed system, where the WLuav system shows excellent ground adaptive performance and trajectory tracking performance, and the energy efficiency can reach 79.46 %.

Abstract:
Rapid generation of large-scale orthoimages from Unmanned Aerial Vehicles (UAVs) has been a long-standing focus of research in the field of aerial mapping. A multi-sensor UAV system, integrating the Global Positioning System (GPS), Inertial Measurement Unit (IMU), 4D millimeter-wave radar and camera, can provide an effective solution to this problem. In this paper, we utilize multi-sensor data to overcome the limitations of conventional orthoimage generation methods in terms of temporal performance, system robustness, and geographic reference accuracy. A prior-pose-optimized feature matching method is introduced to enhance matching speed and accuracy, reducing the number of required features and providing precise references for the Structure from Motion (SfM) process. The proposed method exhibits robustness in low-texture scenes like farmlands, where feature matching is difficult. Experiments show that our approach achieves accurate feature matching and orthoimage generation in a short time. The proposed drone system effectively aids in farmland management.

Abstract:
Accurate semantic segmentation of both tactile paving and the obstacle is crucial for the safe mobility of visually impaired individuals. However, existing methods face two major challenges: (i) discontinuous segmentation fragments; (ii) Inaccurate obstacle recognition. To address challenge (i), we propose incorporating appearance priors of complete tactile pavings to prevent the model from directly learning irregular ground truth masks. To tackle challenge (ii), we propose introducing cross-modal semantic priors to complement the semantic information of obstacles. We implemented these strategies in proposed Dual Prior knowledge induced tactile paving and obstacle joint Segmentation Network (DPSN). Based on bilateral network architecture, DPSN merges obstacle category masks into tactile paving categories, constructing a complete tactile paving mask. Utilizing the complete mask, DPSN transfer appearance prior knowledge to detail features from boundary and structural perspectives. Concurrently, DPSN leverages the CLIP Text Encoder to guide visual feature decoding by attention mechanisms, transferring rich cross-modal semantic prior knowledge to the visual feature maps. Furthermore, we propose the TPO-Dataset, the first dataset for joint tactile paving and obstacle segmentation acquired from actual scenes. Experiments demonstrate that DPSN achieves state-of-the-art results on the TPO-Dataset, with relative gains of 27.16% in obstacle IoU and 30.53% in accuracy metrics compared to baseline methods. Notably, DPSN achieves real-time performance at 88.25 FPS on the maximum scale of 2048×512 resolution.

Abstract:
The unique tilt-servo mechanism of the tilt-rotor unmanned aerial vehicle (UAV) facilitates seamless transitions between multi-rotor and fixed-wing modes, enhancing both flexibility and maneuverability. However, traditional modeling methods, which treat each flight mode independently, fail to provide a unified dynamic representation, limiting the accurate description of aerobatic maneuvers during mode transitions. This paper introduces a novel modeling approach based on transient Computational Fluid Dynamics (CFD) to capture the aerodynamics of the transition mode, resulting in a multimodal, consistent dynamics model. This model simplifies the mathematical representation for specific tilt angles, ensuring compatibility with both multi-rotor and fixed-wing dynamics, and accurately describes aerobatic maneuvers. An autonomous feedback motion planning method, utilizing third-order Bézier curves for angular velocity planning, is applied, along with a modal switching strategy to address the limitations of traditional fixed-wing UAVs. The feasibility of this method was validated through numerical simulations, hardware-in-the-loop simulations, and outdoor flight experiments of a tilt-rotor UAV performing the Cobra maneuver in transition mode.

Abstract:
Tactile sensors can significantly enhance the perception of humanoid robotics systems by providing contact information that facilitates human-like interactions. However, existing commercial tactile sensors focus on improving the resolution and sensitivity of single-modal detection with high-cost components and densely integrated design, incurring complex manufacturing processes and unaffordable prices. In this work, we present Bio-Skin, a cost-effective multi-modal tactile sensor that utilizes single-axis Hall-Effect sensors for planar normal force measurement and bar-shape piezo resistors for 2D shear force measurement. A thermistor coupling with a heating wire is integrated into a silicone body to achieve temperature sensation and thermostatic function analogous to human skin. We also present a cross-reference framework to validate the two modalities of the force sensing signal, improving the sensing fidelity in a complex electromagnetic environment. Bio-Skin has a multi-layer design, and each layer is manufactured sequentially and subsequently integrated, thereby offering a fast production pathway. After calibration, Bio-Skin demonstrates performance metrics—including signal-to-range ratio, sampling rate, and measurement range—comparable to current commercial products, with one-tenth of the cost. The sensor’s real-world performance is evaluated using an Allegro hand in object grasping tasks, while its temperature regulation functionality was assessed in a material detection task.

Abstract:
Formulating a multi-robot obstacle avoidance policy is essential for enabling safe and efficient navigation in multi-robot environments, forming a critical component of the effective operation of multi-robot systems. Recently, reinforcement learning has been applied to improve the performance of decentralized, policy-driven robots in task execution. However, ensuring the safety of these agents during movement remains a significant challenge due to the inherent risks associated with the reinforcement learning process, such as frequent collisions. To address this issue and enhance the safety of policy-guided multi-robot navigation, we propose a novel policy based on imitation learning. This framework introduces a novel policy neural network that integrates a graph attention mechanism with the GRU network structure. The key innovation lies in utilizing the interactions between neighboring robots to enhance the safety of their movements. In a multi-robot simulation environment, robot behaviors are directed by the proposed policy. A comparative analysis was conducted between our approach and RL-RVO, one of the advanced methods in the field. The results demonstrate that our approach outperforms RL-RVO, achieving a higher success rate and significantly improving safety performance.

Abstract:
Underwater object detection (UOD) is crucial for monitoring marine ecosystems, underwater robotics, environmental protection, and autonomous underwater vehicles (AUVs). Despite progress, many models struggle under real-world conditions due to poor visibility, dynamic lighting, and domain shifts. Traditional methods like Faster R-CNN are computationally expensive, while YOLO-based models suffer in challenging underwater scenarios. The scarcity of large-scale annotated datasets further limits model generalization. To address these challenges, we introduce UOD-SZTU-2025, a new dataset of 3,133 high-quality underwater images, sourced primarily from video platforms. The dataset is used in EFCWM (Enhanced Feature Correction and Weighting Module) to extract and refine a feature material library for detection targets. We propose EFCWM-Mamba-YOLO, a lightweight, real-time detection model designed to enhance feature representation and adapt to diverse underwater environments. The EFCWM module incorporates domain adaptation for improved robustness. Additionally, a two-stage training strategy first trains on a source domain and fine-tunes with limited target domain samples to enhance generalization. Experiments show our approach surpasses existing lightweight UOD models in accuracy, real-time performance, and robustness. Our dataset, model, and benchmark establish a strong foundation for future UOD research. The dataset for EFCWM-Mamba-YOLO is available at https://github.com/wojiaosun/UOD-SZTU-2025.

Abstract:
Creating an intelligent surgical environment requires not only advanced robotic systems but also optimized microscopic imaging. However, autofocus remains a fundamental challenge, with current methods suffering from slow iterative processes or directional ambiguity, which compromises real-time performance. This paper presents an implicit disparity-blur alignment approach for robotic microsurgical autofocus, integrating stereo geometry’s monotonic depth cues with de-focus characteristics for rapid convergence. A novel physics-guided dual-stream network is developed to encode implicit depth representations through hierarchical cross-pathway feature fusion, enabling reliable focus prediction without explicit stereo matching in blur-degraded regions. An ROI-aware attention module is proposed to dynamically optimize focus-critical regions, coupled with learnable physics-guided kernel learning for precise Z-offset estimation. The approach achieves a top directional accuracy of 94.85% and a single-pass focus error of 0.20 mm with an inference time of 53 ms on a surgical dataset, which outperforms state-of-the-art methods in reducing iteration count by 22.8% and inference time by 51.8%. An intelligent robotic microscope prototype is developed, with validation through ex vivo tests demonstrating its ability to enable fast and precise multi-region focusing for microsurgeries.

Abstract:
Recognizing and identifying human locomotion is a critical step to ensuring fluent control of wearable robots, such as transtibial prostheses. In particular, classifying the locomotion mode and estimating the gait phase are key. In this work, a novel, interpretable, and computationally efficient algorithm is presented for simultaneously predicting locomotion mode and gait phase. Using able-bodied (AB) data and transtibial prosthesis (PR) data collected via a bypass adapter, seven locomotion modes are tested including slow, medium, and fast level walking (0.6, 0.8, and 1.0 m/s), ramp ascent/descent (5 degrees), and stair ascent/descent (20 cm height). Overall classification accuracy was 99.1% and 99.3% for the AB and PR conditions, respectively. The average gait phase error across all data was less than 4%. Exploiting the structure of the data, computational efficiency reached 2.91 µs per time step. The time complexity of this algorithm scales as O(N•M) with the number of locomotion modes M and samples per•gait cycle N. This efficiency and high accuracy could accommodate a much larger set of locomotion modes (~ 700 on the Open-Source Leg Prosthesis) to handle the wide range of activities pursued by individuals during daily living.

Abstract:
Automating garment manipulation poses a significant challenge for assistive robotics due to the diverse and de-formable nature of garments. Traditional approaches typically require separate models for each garment type, which limits scalability and adaptability. In contrast, this paper presents a unified approach using vision-language models (VLMs) to improve keypoint prediction across various garment categories. By interpreting both visual and semantic information, our model enables robots to manage different garment states with a single model. We created a large-scale synthetic dataset using advanced simulation techniques, allowing scalable training without extensive real-world data. Experimental results indicate that the VLM-based method significantly enhances keypoint detection accuracy and task success rates, providing a more flexible and general solution for robotic garment manipulation. In addition, this research also underscores the potential of VLMs to unify various garment manipulation tasks within a single framework, paving the way for broader applications in home automation and assistive robotics in the future.

Abstract:
Galloping is a common high-speed gait in both animals and quadrupedal robots, yet its energetic characteristics remain insufficiently explored. This study systematically analyzes a large number of possible galloping gaits by categorizing them based on the number of flight phases per stride and the phase relationships between the front and rear legs, following Hildebrand’s framework for asymmetrical gaits. Using the A1 quadrupedal robot from Unitree, we model galloping dynamics as a hybrid dynamical system and employ trajectory optimization (TO) to minimize the cost of transport (CoT) across a range of speeds. Our results reveal that rotary and transverse gallop footfall sequences exhibit no fundamental energetic difference, despite variations in body yaw and roll motion. However, the number of flight phases significantly impacts energy efficiency: galloping with no flight phases is optimal at lower speeds, whereas galloping with two flight phases minimizes energy consumption at higher speeds. We validate these findings using a Quadratic Programming (QP)-based controller, developed in our previous work, in Gazebo simulations. These insights advance the understanding of quadrupedal locomotion energetics and may inform future legged robot designs for adaptive, energy-efficient gait transitions.

Abstract:
Reinforcement learning (RL) has gained attention for complex decision-making in uncertain environments. However, high costs and risks of real-world experimentation limit its direct application to marine vehicles. This motivates the use of simulation-based training and sim-to-real transfer techniques. Despite growing interest, a systematic understanding of how to design effective transfer strategies for marine contexts remains lacking. This paper presents a sim-to-real transfer framework tailored for marine vehicles, integrating high-fidelity, data-driven dynamics modeling with multi-factor domain randomization to address marine environmental uncertainties. Maneuvering data is utilized to extract nonlinear hydrodynamic characteristics of marine vehicles to enhance model realism. Additionally, domain randomization is explored across multiple environmental factors, including wind, wave, and current. To evaluate transferability, we construct a sim-to-sim platform with a pseudo-real environment that emulates the reality gap and adopt a path-following task using Soft Actor-Critic. We comprehensively assess the impacts of model fidelity and environmental randomization strategies on sim-to-real transfer performance. Results indicate that model accuracy positively impacts transfer performance, while aggressive domain randomization may reduce adaptability in calm conditions. Finally, a data-driven modeling and multi-factor randomization recipe is proposed for RL policy transfer in marine applications.

Abstract:
Target tracking under intermittent measurements is a fundamental challenge in autonomous systems. Traditional methods, including Kalman filters and deep learning-based models, often struggle when faced with sparse observations and high measurement noise. In this work, we present multiple transformer-based motion models desgined to learn target dynamics from noisy sensor measurements and occluded portions. By leveraging self-attention mechanisms, these models effectively capture temporal dependencies and infer motion trajectories under uncertainty. We evaluate various architectural formulations, including time-encoded position inputs to better handle occlusions. These learned motion models are then integrated with a particle filter for target estimation and with an information-driven planner to guide the tracking agent. Since the models influence the guidance logic through their predictions, we assess their effectiveness based on overall target tracking performance. Extensive simulation and hardware experiments demonstrate that our approach improves tracking accuracy and robustness compared to existing methods.

Abstract:
Large-scale outdoor navigation is essential for unmanned ground vehicles (UGVs), but despite significant advancements, they still face two key challenges in practical applications. The first one is how to ensure safe navigation in environments with dynamic and low-lying obstacles that LiDAR cannot detect. The second one is how to conduct the adaptive re-planning of target points while some of them are blocked by temporary obstacles. To address these challenges, this work proposes a Dynamic and Low-lying-obstacle Avoidance Navigation (DLAN) system to conduct perception, planning, and point correction for UGVs. To efficiently and accurately detect dynamic obstacles, it designs a lightweight Ensemble3D framework that integrates three fast but low-accuracy detection methods. A multi-criteria waypoint optimizer is used to assist UGVs in path planning. It ensures a balance between obstacle avoidance and path following. To adjust blocked target points through local re-planning, this work designs a checkpoint correction method. Extensive simulations and real-world experiments demonstrate that DLAN enables reliable navigation with high efficiency and robust obstacle avoidance in complex environments. More details can be found on our project homepage and video 1.

Abstract:
Robots that can physically interact with humans in a safe manner have the potential to revolutionize application domains like home assistance and nursing care. However, to become long-term companions, such robots must learn user-specific preferences and adapt their behaviors in real time. We propose a Constrained Partially Observable Markov Decision Process framework for modeling human safety preferences over representative variables like force, velocity, and proximity. These variables are modeled as adaptive linear constraints, with a belief over their upper bounds that is updated online based on noisy human feedback. By modeling the belief as phase dependent, the model captures varying preferences across different task phases. The robot then solves a hierarchical optimization to select actions that respect both the learned constraints and robot motion limits. Our method does not require offline training data and can be applied directly to diverse physical interaction tasks and operation modes (tele-operated or autonomous). A pilot study shows that our approach effectively learns user preferences and improves perceived safety while reducing user effort compared to baselines.

Abstract:
Minimally Invasive Surgery has advanced surgical practice, yet early-stage gastrointestinal cancer treatment remains challenging. Endoscopic Submucosal Dissection offers a solution but faces maneuverability constraints in complex anatomical environments. This paper presents a novel inflatable soft robotic manipulator with a biocompatible thin-film shell and tendon-driven antagonistic actuation. The robot remains compact at 6.5 mm of diameter, expanding to 11.7 mm to enhance stiffness for force exertion and precise manipulation. Featuring two decoupled wrists with four degrees of freedom, it enables dexterous motion for advanced endoscopic procedures. The study details design, fabrication, actuation modeling, workspace evaluation, and simulated retraction experiments in a constrained environment. Results demonstrate high repeatability, high master-slave control accuracy, effective workspace utilization, and feasibility for endoluminal applications, enhancing robotic-assisted endoscopic procedures with improved dexterity and adaptability under soft actuation constraints.

Abstract:
In this work, we introduce SPADE, a path planning framework designed for autonomous navigation in dynamic environments using 3D scene graphs. SPADE combines hierarchical path planning with local geometric awareness to enable collision-free movement in dynamic scenes. The framework bifurcates the planning problem into two: (a) solving the sparse abstract global layer plan and (b) iterative path refinement across denser lower local layers in step with local geometric scene navigation. To ensure efficient extraction of a feasible route in a dense multi-task domain scene graphs, the framework enforces informed sampling of traversable edges prior to path-planning. This removes extraneous information not relevant to path-planning and reduces the overall planning complexity over a graph. Existing approaches address the problem of path planning over scene graphs by decoupling hierarchical and geometric path evaluation processes. Specifically, this results in an inefficient replanning over the entire scene graph when encountering path obstructions blocking the original route. In contrast, SPADE prioritizes local layer planning coupled with local geometric scene navigation, enabling navigation through dynamic scenes while maintaining efficiency in computing a traversable route. We validate SPADE through extensive simulation experiments and real-world deployment on a quadrupedal robot, demonstrating its efficacy in handling complex and dynamic scenarios.

Abstract:
Visual augmentation has become a crucial technique for enhancing the visual robustness of imitation learning. However, existing methods are often limited by prerequisites such as camera calibration or the need for controlled environments (e.g., green screen setups). In this work, we introduce RoboEngine, the first plug-and-play visual robot data augmentation toolkit. For the first time, users can effortlessly generate physics- and task-aware robot scenes with just a few lines of code. To achieve this, we present a novel robot scene segmentation dataset, a generalizable high-quality robot segmentation model, and a fine-tuned background generation model, which together form the core components of the out-of-the-box toolkit. Using RoboEngine, we demonstrate the ability to generalize robot manipulation tasks across six entirely new scenes, based solely on demonstrations collected from a single scene, achieving a more than 200% performance improvement compared to the no-augmentation baseline. All datasets, model weights, and the toolkit are released https://roboengine.github.io/.

Abstract:
Robotic hand tool use has garnered significant attention from robotics researchers, because it enhances dexterity beyond the limitations imposed by manipulators with fixed tool configurations and human-involved manual tool changes. Despite extensive research, current methodologies predominantly focus on imitating human hand trajectories, often neglecting the pivotal role of tool-environment interaction. This study addresses this gap by exploring the task of cucumber peeling as a case study to implement contact-based demonstration strategies in robotic tool use. Our approach concentrates on the subtle tool contact behaviors that manifest through contact dynamics. Specifically, we select appropriate tool stiffness for the peeling tasks, which is captured via a handheld teaching device equipped with optical tactile sensors. Subsequently, object-level stiffness control strategies are employed to emulate these behaviors using a three-fingered robotic hand. Experimental results from real-world cucumber peeling trials substantiate our methodology, illustrating that the robotic hand can adjust contact through finger movements, thereby achieving humanlike peeling efficiency without necessitating alterations to the tool structure. This study not only demonstrates the feasibility of sophisticated tool use by robotic hands, but also highlights the critical importance of integrating tactile feedback to refine interaction with the environment.

Abstract:
Dynamic manipulation enables efficient interaction tasks, such as throwing, which rely on finding one or more high-quality trajectories from the initial state to the goal state. While model-free learning methods have been used to acquire efficient robot manipulation configurations, traditional planning algorithms often struggle with multi-task specifications, high-dimensional, and multi-modal trajectory data. Prior generative model-based approaches, have made significant progress in the field of motion planning. Diffusion models, as an emerging generative model, have been widely applied to planning tasks in various environments and have gained attention for their ability in encoding multidimensional and multimodal trajectories. Here we propose our method that combines the diffusion model and model-free throwing methods. Specifically, we use a backward reachable tube to search for throwing configurations, and sample from posterior trajectory distribution conditioned on the throwing configurations. Several trajectory optimization methods are used to ensure the generation of effective throwing trajectories. Experimental results show that our method is effective in generating feasible, smooth, and collision-free throwing trajectories in both simulated and real-world tasks. Additionally, different trajectories are provided to enhance the multimodality of the throwing task.

Abstract:
This paper presents a novel imitation learning framework, called constrained diffusion policy (CDP). The primary objective of CDP is to ensure that learned policies strictly adhere to safety constraints while imitating expert demonstrations. To achieve this, we define a polytopic constraint that represents the safe boundary for obstacle-free region. We introduce a novel mirror map and its inverse function to incorporate a generalized polytopic constraint manifold into the mirror diffusion model. By mapping sampled data onto a constrained manifold, the mirror diffusion model generates actions that satisfy safety constraints. This approach successfully addresses the safety issues commonly encountered in conventional imitation learning models. We apply the proposed framework to mobile navigation tasks in robotics, using the Isaac Gym simulator and the Unitree Go2 quadrupedal robot. Experimental results demonstrate that the proposed framework can successfully train policies that imitate expert behaviors while strictly maintaining safety constraints, thereby achieving safety-assured imitation learning.

Abstract:
This paper focuses on planning robot navigation tasks from natural language specifications. We develop a modular approach, where a large language model (LLM) translates the natural language instructions into a linear temporal logic (LTL) formula with propositions defined by object classes in a semantic occupancy map. The LTL formula and the semantic occupancy map are provided to a motion planning algorithm to generate a collision-free robot path that satisfies the natural language instructions. Our main contribution is LTLCodeGen, a method to translate natural language to syntactically correct LTL using code generation. We demonstrate the complete task planning method in real-world experiments involving human speech to provide navigation instructions to a mobile robot. We also thoroughly evaluate our approach in simulated and real-world experiments in comparison to end-to-end LLM task planning and state-of-the-art LLM-to-LTL translation methods.

Abstract:
Simultaneous Localization and Mapping (SLAM) in large-scale, complex, and GPS-denied underground coal mine environments presents significant challenges. Sensors must contend with abnormal operating conditions: GPS unavailability impedes scene reconstruction and absolute geographic referencing, uneven or slippery terrain degrades wheel odometer accuracy, and long, feature-poor tunnels reduce LiDAR effectiveness. To address these issues, we propose CoalMine-LiDAR-IMU-UWB-Wheel-Odometry (CM-LIUW-Odometry), a multi-modal SLAM framework based on the Iterated Error-State Kalman Filter (IESKF). First, LiDAR-inertial odometry is tightly fused with UWB absolute positioning constraints to align the SLAM system with a global coordinate. Next, wheel odometer is integrated through tight coupling, enhanced by nonholonomic constraints (NHC) and vehicle lever arm compensation, to address performance degradation in areas beyond UWB measurement range. Finally, an adaptive motion mode switching mechanism dynamically adjusts the robot’s motion mode based on UWB measurement range and environmental degradation levels. Experimental results validate that our method achieves superior accuracy and robustness in real-world underground coal mine scenarios, outperforming state-of-the-art approaches. We open source our code of this work on Github3 to benefit the robotics community.

Abstract:
Robotic manipulators capable of regulating both compliance and stiffness offer enhanced operational safety and versatility. Here, we introduce Worm Gear-based Adaptive Variable Elasticity (WAVE), a variable stiffness actuator (VSA) that integrates a non-backdrivable worm gear. By decoupling the driving motor from external forces using this gear, WAVE enables precise force transmission to the joint, while absorbing positional discrepancies through compliance. WAVE is protected from excessive loads by converting impact forces into elastic energy stored in a spring. In addition, the actuator achieves continuous joint stiffness modulation by changing the spring’s precompression length. We demonstrate these capabilities, experimentally validate the proposed stiffness model, show that motor loads approach zero at rest–even under external loading–and present applications using a manipulator with WAVE. This outcome showcases the successful decoupling of external forces. The protective attributes of this actuator allow for extended operation in contact-intensive tasks, and for robust robotic applications in challenging environments.

Affiliations: Department of Computer Science, Vanderbilt University, Nashville, TN, USA; Department of Mechanical Engineering, Johns Hopkins University, Baltimore, MD, USA; Robotics Center and Kahlert School of Computing, University of Utah, Salt Lake City, UT, USA; Department of Mechanical Engineering, Vanderbilt University, Nashville, TN, USA; Virtuoso Surgical, Nashville, TN, USA; Department of Mechanical, Aerospace and Biomedical Engineering, University of Tennessee, Knoxville, TN, USA

Abstract:
Surgical automation requires precise guidance and understanding of the scene. Current methods in the literature rely on bulky depth cameras to create maps of the anatomy; however, this does not translate well to space-limited clinical applications. Monocular cameras are small and allow minimally invasive surgeries in tight spaces, but additional processing is required to generate 3D scene understanding. We propose a 3D mapping pipeline that uses only RGB images to create segmented point clouds of the target anatomy. To ensure the most accurate reconstruction, we compare different structure from motion algorithms’ performance on mapping the central airway obstructions, and test the pipeline on a downstream task of tumor resection. In several metrics, including post-procedure percentage tissue charring, our pipeline performs comparably to RGB-D cameras and, in some cases, even surpasses their downstream task performance. These promising results demonstrate that automation guidance can be achieved in minimally invasive procedures with monocular cameras. This study is a step toward the complete autonomy of surgical robots.

Abstract:
Cardiovascular and cerebrovascular diseases are significant health issues that threaten human life. They typically develop insidiously and progress gradually, but when an event occurs, the consequences can be severe. These conditions often manifest suddenly and acutely, necessitating prompt treatment due to a very short therapeutic window. Vascular interventional surgery is the preferred treatment because of its rapid efficacy and brief recovery time. However, the procedure demands a high level of expertise and exposes physicians to radiation, and there is an uneven distribution of skilled doctors across different regions. The advent of vascular interventional surgical robots not only protects physicians from radiation exposure and improves procedural accuracy and safety but also enables remote interventions. Based on a robotic platform for vascular interventions, our team has developed a remote vascular interventional system that leverages WebRTC and 5G networks. This system ensured that the transmission latency for control commands and imaging meets clinical requirements, and our remote clinical experiments have demonstrated its feasibility and safety.

Abstract:
The agile maneuvering control of autonomous vehicles (AVs) requires the tracking of reference trajectories characterized by high acceleration, sharp curvature, considerable disturbances, and significant time-varying, all while ensuring stability and accuracy. The inherent uncertainty and time-varying nature of both the vehicle model and its environment pose significant challenges to achieving high-performance tracking during agile maneuvers. Developing a control algorithm that enables solving the optimal policy for nonlinear systems with uncertainties is critical. In this paper, we propose a learning-based predictive control approach, namely, an adaptive model predictive control (AMPC) with Actor-Critic Learning (ACL) for generating closed-loop MPC policies for agile maneuvering of AVs. The proposed approach leverages neural networks to model the dynamics uncertainties online. The control policy and model are updated simultaneously to realize performance op-timization under time-varying uncertainties. Simulation results demonstrate that our proposed algorithm outperforms other leading ACL methods, as well as MPC and Linear Quadratic Regulator (LQR). Furthermore, field test experiment results validate its effectiveness on the HongQi-EHS3 electric vehicle, showing superior control performance compared to MPC both on paved roads and curved off-roads with excellent stability performance.

Abstract:
Deploying real-time spatial perception on edge devices requires efficient multi-task models that leverage complementary task information while minimizing computational overhead. In this paper, we introduce Multi-Mono-Hydra (M2H), a novel multi-task learning framework designed for semantic segmentation and depth, edge, and surface normal estimation from a single monocular image. Unlike conventional approaches that rely on independent single-task models or shared encoder-decoder architectures, M2H introduces a Window-Based Cross-Task Attention Module that enables structured feature exchange while preserving task-specific details, improving prediction consistency across tasks. Built on a lightweight ViT-based DINOv2 backbone, M2H is optimized for real-time deployment and serves as the backbone for monocular spatial perception systems, a framework for 3D scene graph construction in dynamic environments. Comprehensive evaluations demonstrate that M2H outperforms state-of-the-art (SOTA) multi-task models on NYUDv2, exceeds single-task depth and semantic baselines on Hypersim, and achieves superior performance on Cityscapes datasets, all while maintaining computational efficiency on laptop hardware. Beyond curated benchmarks, we validate M2H on real-world data, demonstrating its practicality in spatial perception tasks. We provide our implementation and pretrained models at https://github.com/UAV-Centre-ITC/M2H.git.

Abstract:
Motion detection in 3D LiDAR is crucial for autonomous systems. While deep learning dominates Moving Object Segmentation (MOS), the potential of learning-free approaches remains underexplored. Unlike problems like semantic segmentation, motion can be explicitly modeled, potentially enabling efficient, interpretable, and computationally lightweight solutions. Motivated by this, we introduce a novel real-time, online, learning-free MOS method. We propose the novel Join Count Feature to extract motion cues from a local window of range images, and long-term filtering with efficient two-step association to enhance accuracy. Compared to learning-based models, we achieve superior precision and competitive IoU for saliently moving objects on SemanticKITTI. Further evaluation on HeLiMOS demonstrate stronger generalization by the proposed method across different LiDAR sensors. These results highlight the potential of learning-free methods for motion detection in 3D LiDAR data.

Abstract:
Modeling and controlling the musculoskeletal system are crucial for understanding human motor functions, optimizing human-robot interaction, and developing embodied intelligence. However, existing musculoskeletal models are mainly limited to specific body parts and muscle groups, and still face challenges in large-scale muscle coordination and the generation of diverse movements. In this study, we propose a musculoskeletal deep reinforcement learning (DRL) control model. This model integrates a Long Short-Term Memory (LSTM) network and a Multi-Head Self-Attention (MHSA) mechanism into the Proximal Policy Optimization (PPO) algorithm. The LSTM-MHSA-enhanced PPO control approach generates accurate muscle activation, motion trajectories, and torque control strategies to precisely control and replicate diverse human gaits based on target joint movements. Experimental results demonstrate that this LSTM-MHSA-enhanced PPO algorithm significantly improves the model accuracy compared to the traditional PPO algorithm, with a 43.75% and 34.14% reduction in Mean Absolute Error (MAE) for walking and running tasks, respectively. Furthermore, for complex tasks such as striking and dancing, the MAE decreases by 46.97% and 41.78%, respectively. These findings highlight that integrating LSTM and MHSA into PPO algorithm not only enhances gait simulation accuracy but also improves the model’s generalization capability, particularly for complex motion patterns. This research provides an efficient tool for motion simulation and gait analysis, advancing the development of human musculoskeletal control systems.

Abstract:
The rapid advancement of drone technology has led to the widespread application of micro-object detection in Unmanned Aerial Vehicle (UAV) systems. However, with the constraint of real-time computation, critical challenges remain in addressing extreme scale variations, low-resolution signatures and dense occlusions. For object detection task, although YOLO-based detectors outperform transformer models in efficiency-accuracy balance, their limited capacity for global context modeling and feature discriminability in complex aerial environments hinders optimal performance. To overcome these limitations, we introduce UAV-MaLO, a novel framework that incorporates state space modeling principles into YOLO’s architecture. By introducing the abilities of long-range dependency modeling and adaptive spatial-frequency fusion, the proposed approach dynamically optimizes receptive fields while suppressing background interference, achieving robust micro-object localization in cluttered scenarios. Furthermore, the parallelized attention mechanism and the hierarchical feature refinement further ensure real-time processing capabilities without compromising detection precision, establishing a new paradigm for UAV deployment. Our experimental results on the VisDrone-2019-DET dataset reveal a significant improvement in various variants of average precision (AP), indicating the extraordinary performance of our UAV-MaLO.

Abstract:
We propose a new approach to solving the problem of intention estimation in human-robot teleoperation for assembly tasks, which includes task estimation and action prediction. Our approach uses probabilistic graphical models to represent the joint distribution of the task and the actions to be taken to complete the task. Both model learning and inference are implemented with Pyro, a state-of-the-art probabilistic programming language. The distinctive feature from the traditional hidden Markov model type of probabilistic methods is that our model takes the time information into account and explicitly models the individual distributions of all the variables under consideration. By doing this, we fully utilize the power of probabilistic programming, and achieve accurate distribution hence uncertainty estimations. Working with a pretrained action recognition module, the proposed model can be trained solely on a tiny instruction manual of the assembly tasks and can be retrained with minimal overhead whenever the manual is changed or augmented, avoiding the need for the costly data reannotation and retraining by the end-to-end learning based methods. We also compare our method with a transformer based model trained directly on the instruction manual, and our method shows superior accuracy in both intention estimation and their distribution estimations. We additionally identify failure cases of both our method and the transformer-based method, and envision methods for improvement.

Abstract:
Accurate dynamics modeling is essential for quadrotors to achieve precise trajectory tracking in various applications. Traditional physical knowledge-driven modeling methods face substantial limitations in unknown environments characterized by variable payloads, wind disturbances, and external perturbations. On the other hand, data-driven modeling methods suffer from poor generalization when handling outof-distribution (OoD) data, restricting their effectiveness in unknown scenarios. To address these challenges, we introduce the Physics-Informed Wind-Adaptive Network (PI-WAN), which combines knowledge-driven and data-driven modeling methods by embedding physical constraints directly into the training process for robust quadrotor dynamics learning. Specifically, PI-WAN employs a Temporal Convolutional Network (TCN) architecture that efficiently captures temporal dependencies from historical flight data, while a physics-informed loss function applies physical principles to improve model generalization and robustness across previously unseen conditions. By incorporating real-time prediction results into a model predictive control (MPC) framework, we achieve improvements in closed-loop tracking performance. Comprehensive simulations and real-world flight experiments demonstrate that our approach outperforms baseline methods in terms of prediction accuracy, tracking precision, and robustness to unknown environments.

Abstract:
Localization is one of the core parts of modern robotics. Classic localization methods typically follow the retrieve-then-register paradigm, achieving remarkable success. Recently, the emergence of end-to-end localization approaches has offered distinct advantages, including a streamlined system architecture and the elimination of the need to store extensive map data. Although these methods have demonstrated promising results, current end-to-end localization approaches still face limitations in robustness and accuracy. Bird’s-Eye-View (BEV) image is one of the most widely adopted data representations in autonomous driving. It significantly reduces data complexity while preserving spatial structure and scale consistency, making it an ideal representation for localization tasks. However, research on BEV-based end-to-end localization remains notably insufficient. To fill this gap, we propose BEVDiffLoc, a novel framework that formulates LiDAR localization as a conditional generation of poses. Leveraging the properties of BEV, we first introduce a specific data augmentation method to significantly enhance the diversity of input data. Then, the Maximum Feature Aggregation Module and Vision Transformer are employed to learn robust features while maintaining robustness against significant rotational view variations. Finally, we incorporate a diffusion model that iteratively refines the learned features to recover the absolute pose. Extensive experiments on the Oxford Radar RobotCar and NCLT datasets demonstrate that BEVDiffLoc outperforms the baseline methods. Our code is available at https://github.com/nubot-nudt/BEVDiffLoc.

Abstract:
We introduce the Grasp EveryThing (GET) gripper, a novel 1-DoF, 3-finger design for securely grasping objects of many shapes and sizes. Mounted on a standard parallel jaw actuator, the design features three narrow, tapered fingers arranged in a two-against-one configuration, where the two fingers converge into a V-shape. The GET gripper is more capable of conforming to object geometries and forming secure grasps than traditional designs with two flat fingers. Inspired by the principle of self-similarity, these V-shaped fingers enable secure grasping across a wide range of object sizes. Further to this end, fingers are parametrically designed for convenient resizing and interchangeability across robotic embodiments with a parallel jaw gripper. Additionally, we incorporate a rigid fingernail for ease in manipulating small objects. Tactile sensing can be integrated into the standalone finger via an externally-mounted camera. A neural network was trained to estimate normal force from tactile images with an average validation error of 1.3 N across a diverse set of geometries. In grasping 15 objects and performing 3 tasks via teleoperation, the GET fingers consistently outperformed standard flat fingers. All finger designs, compatible with multiple robotic embodiments, both incorporating and lacking tactile sensing, are available on GitHub.

Abstract:
Micro blimps exhibit significant potential for applications in environmental monitoring and disaster rescue. Nonetheless, traditional propulsion methods for micro blimps encounter challenges such as complex mechanical structures, intricate attitude control, and large volumes. This paper present a novel compact and lightweight bio-inspired flapping-wing-driven micro robotic blimp actuated by piezoelectric (PZT), featuring a simplified structure and achieving three-degree-of-freedom (DOF) motion control with only two flapping-wing thruster units. We present a high-voltage drive-sense-control circuit and adaptive control strategy, enabling wireless remote control, onboard attitude sensing, and closed-loop yaw control. The proposed micro robotic blimp, powered by an onboard battery, measures 15 cm in major axis and weighs 1.53 g, achieves a maneuvering speed of 17 cm/s, and angular velocity reaches 12°/s with a yaw angle control accuracy of 0.5°. As the smallest and lightest known self-powered micro blimp capable of stable yaw control, the platform demonstrates excellent endurance and environmental stealth characteristics and advances the design of micro aerial vehicles by offering a novel and efficient approach.

Abstract:
When manipulating an object, a robot must recognize not only the parts it directly grasps but also the surfaces in contact with the environment, which we refer to as extrinsic contact surfaces. These surfaces directly affect how the object interacts with its environment, and accurate surface estimation is critical for precise robotic manipulation. This study presents a novel framework for extrinsic contact surface reconstruction using vision-based tactile sensing. By leveraging marker-based tracking and analyzing kinematic constraints, we classify contact types and estimate the locations of both point and line contacts. To reconstruct the extrinsic contact surface, we compare three data integration methods: Mixed Vector Approach (MVA), Orthogonal Distance Regression (ODR), and Random Sample Consensus (RANSAC). Experimental results demonstrate that MVA achieves the highest accuracy in most cases by effectively integrating contact data while minimizing randomness. Experiments conducted on various object geometries validated the robustness of the proposed method, achieving an average positional error of 4.15 mm and an angular deviation of 4.58°. The results confirm that extrinsic contact sensing enables more efficient and precise object shape estimation, providing a promising approach for robotic manipulation.

Abstract:
LiDAR and 4D radar are widely used in autonomous driving and robotics. While LiDAR provides rich spatial information, 4D radar offers velocity measurement and remains robust under adverse conditions. As a result, increasing studies have focused on the 4D radar-LiDAR fusion method to enhance the perception. However, the misalignment between different modalities is often overlooked. To address this challenge and leverage the strengths of both modalities, we propose a LiDAR detection framework enhanced by 4D radar motion status and cross-modal uncertainty. The object movement information from 4D radar is first captured using a Dynamic Motion-Aware Encoding module during feature extraction to enhance 4D radar predictions. Subsequently, the instance-wise uncertainties of bounding boxes are estimated to mitigate the cross-modal misalignment and refine the final LiDAR predictions. Extensive experiments on the View-of-Delft (VoD) dataset highlight the effectiveness of our method, achieving state-of-the-art performance with the mAP of 74.89% in the entire area and 88.70% within the driving corridor while maintaining a real-time inference speed of 30.02 FPS.

Abstract:
Deploying visual servo controller to novel scenes with uncertain parameters requires additional manual effort for calibration. Traditional methods tackle this problem by online estimating the Jacobian matrix. However, they struggle in challenging scenes due to intrinsic limitations. For instance, image-based uncalibrated visual servo requires tracking a fixed set of points, which is impractical in texture-less scenes. Position-based uncalibrated visual servo necessitates absolute scale of translation, which requires depth sensor or model-based pose estimator, introducing extra hardware cost or model complexity. Recent advances in neural network-based visual servoing have shown improvement in convergence, precision and generalization compared to traditional methods. However, the uncalibrated neural visual servo remains underexplored. In this paper, we propose a structured Jacobian estimator for neural-based visual servo controller, enabling zero-shot transfer to novel environments with unknown extrinsic and scene scale. Stability of pose error is analyzed under the bounded calibration error assumption. Moreover, we propose an automatic control gain scheduler to accelerate the convergence while maintaining high success rate and precision. The scheduling behavior is analyzed through greedy optimal control. Our method is validated with simulated and real-world experiments.

Abstract:
Visual Odometry (VO) estimates the pose and motion trajectory of the camera based on visual input, serving as a fundamental technique for robotic positioning and navigation. However, existing VO methods face challenges in visual degradation in extreme environments, e.g., high dynamic range or fast-motion conditions. Although event-based sensing schemes offer partial solutions to this problem, they are limited by unstable features and noise. Recently, a novel brain-inspired vision sensor, Tianmouc, has been reported, incorporating two complementary pathways: a cognition-oriented pathway (COP) for precise color intensity and an action-oriented pathway (AOP) for fast spatiotemporal sensing, considered a promising visual input for VO tasks. Here, we develop Complementary Pathway Spatial Enhanced Visual Odometry (CSVO) to cope with extreme scenarios by fusing the COP and AOP information of Tianmouc. To leverage the dynamic range expansion brought about by dual-pathway fusion, as well as the low-latency spatial difference data in AOP to address high-speed motion, we design an asynchronous dual-pathway feature encoder considering synchronous multimodal fusion and asynchronous cross-modal feature matching. To train and evaluate CSVO, we transform two conventional VO datasets, TartanAir and Apollo, to Tianmouc modality through simulation and collect a real- world Tianmouc-VO dataset in challenging scenes. Our results demonstrate state-of-the-art performance over existing methods on these datasets. Our work sheds light on the generalizability of agents working in extreme scenarios. The codes and data sets are available at https://github.com/Tianmouc/CSVO.

Abstract:
Brain-computer interface (BCI) is an important technology in developing the closed-loop brain training system for cognitive functional rehabilitation. Most of existing BCI paradigms have not ensured desired immersiveness of mind and body, thereby limiting participants’ engagement in training tasks. In this paper, we propose a sensory-immersive BCI paradigm for decision-making with a novel motion-panoramic virtual-reality system, aiming for deep involvement of both mind and body in brain functional training. This paradigm integrates visual, auditory and motion multi-sensory stimulation by using the Gait Real-time Analysis Interactive Lab system to implement the modified ultimatum game for decision making. The designed paradigm is validated through three experimental studies, including the event-related potentials analysis, power spectral density analysis and the brain network analysis. They demonstrate that the designed paradigm can achieve better performance in motor-cognitive interaction and multi-sensory coordination, by effectively enhancing brain activation in visual, auditory, and motor processing regions, which can result in more effective activation of decision-making areas like the prefrontal cortex. Compared to the existing paradigm, our paradigm can increase the number of high-intensity functional connections in the brain regions of participants by 62.8% (from 86 to 140), and the number of effective functional connections increased by 90.5% (from 252 to 480).

Abstract:
Reinforcement learning (RL) holds significant promise for adaptive traffic signal control. While existing RL-based methods demonstrate effectiveness in reducing vehicular congestion, their predominant focus on vehicle-centric optimization leaves pedestrian mobility needs and safety challenges unaddressed. In this paper, we present a deep RL framework for adaptive control of eight traffic signals along a real-world urban corridor, jointly optimizing both pedestrian and vehicular efficiency. Our single-agent policy is trained using real-world pedestrian and vehicle demand data derived from Wi-Fi logs and video analysis. The results demonstrate significant performance improvements over traditional fixed-time signals, reducing average wait times per pedestrian and per vehicle by up to 67% and 52% respectively, while simultaneously decreasing total wait times for both groups by up to 67% and 53%. Additionally, our results demonstrate generalization capabilities across varying traffic demands, including conditions entirely unseen during training, validating RL’s potential for developing transportation systems that serve all road users.

Abstract:
The existence of variable factors within the environment can cause a decline in camera localization accuracy, as it violates the fundamental assumption of a static environment in Simultaneous Localization and Mapping (SLAM) algorithms. Recent semantic SLAM systems towards dynamic environments either rely solely on 2D semantic information, or solely on geometric information, or combine their results in a loosely integrated manner. In this research paper, we introduce 3DS-SLAM, 3D Semantic SLAM, tailored for dynamic scenes with visual 3D object detection. The 3DS-SLAM is a tightly-coupled algorithm resolving both semantic and geometric constraints sequentially. We designed a 3D part-aware hybrid transformer for point cloud-based object detection to identify dynamic objects. Subsequently, we propose a dynamic feature filter based on HDBSCAN clustering & Weighted Cluster selection to extract objects with significant absolute depth differences. When compared against ORB-SLAM2, CFP-SLAM, and DYNA-SLAM, 3DS-SLAM exhibits an average improvement of 98.01%, 28.54%, and 50.92%, respectively, across the dynamic sequences of the TUM RGB-D dataset. Furthermore, it surpasses the performance of the other four leading SLAM systems designed for dynamic environments. The code and pre-trained models are available at https://github.com/sai-krishnaghanta/3DS-SLAM

Abstract:
3D Multi-Object Tracking (MOT) provides the trajectories of surrounding objects, assisting robots or vehicles in smarter path planning and obstacle avoidance. Existing 3D MOT methods based on the Tracking-by-Detection framework typically use a single motion model to track an object throughout its entire tracking process. However, objects may change their motion patterns due to variations in the surrounding environment. In this paper, we introduce the Interacting Multiple Model filter in IMM-MOT, which accurately fits the complex motion patterns of individual objects, overcoming the limitation of single-model tracking in existing approaches. In addition, we incorporate a Damping Window mechanism into the trajectory lifecycle management, leveraging the continuous association status of trajectories to control their creation and termination, reducing the occurrence of overlooked low-confidence true targets. Furthermore, we propose the Distance-Based Score Enhancement module, which enhances the differentiation between false positives and true positives by adjusting detection scores, thereby improving the effectiveness of the Score Filter. On the NuScenes Val dataset, IMM-MOT outperforms most other single-modal models using 3D point clouds, achieving an AMOTA of 73.8%. Our project is available at https://github.com/Ap01lo/IMM-MOT.

Abstract:
Global navigation satellite system (GNSS) positioning can be significantly degraded due to multipath and non-line-of-sight (NLOS) signals in urban areas. Cellular vehicle-to-everything (C-V2X) technology provides new opportunities to enhance GNSS performance from a single intelligent vehicle by leveraging roadside GNSS (RSG) and C-V2X. Inspired by this, we propose an RSG-aided GNSS/LiDAR/IMU (RSG-GLIO) method to achieve reliable odometry and mapping, which leverages the high-quality double-differenced (DD) measurements provided by nearby RSG, effectively mitigating shared random errors such as multipath and NLOS. Our RSG-GLIO first estimates the absolute state of the vehicle using onboard sensors. Utilizing this initial positioning estimate, the proposed method introduces a coarse-to-fine selection scheme to identify consistent DD observations from available RSG measurements. Finally, the consistent roadside DD constraints are jointly optimized into factor graph optimization (FGO). Static and dynamic data are extensively evaluated using multiple RSG receivers deployed in the Hong Kong C-V2X testbed to evaluate the effectiveness of roadside-aided positioning. The results demonstrate a significant 36.6% improvement in terms of absolute positioning accuracy compared to the state-of-the-art GLIO method. Furthermore, we showcase the potential for employing RSG as low-cost base stations in dense urban areas. The data of our work is publicly accessible at https://github.com/DarrenWong/RSG-GLIO.

Abstract:
Achieving natural locomotion across diverse environments with prosthetic limbs remains a significant challenge for amputees. Intelligent prosthetics leverage motion planning techniques using phase variables to emulate natural gait aligned with human movement intentions. However, traditional phase variable-based planning, which utilizes geometric human motion models, often lacks robustness when encountering external disturbances. Additionally, models derived from human walking data can only approximate a limited set of discrete tasks, hindering the construction of a comprehensive model. In this study, we present an advanced prosthetic motion planning approach that integrates Dynamic Motion Primitives (DMPs) to ensure robust performance across multiple tasks. We demonstrate that DMPs with human-in-the-loop effectively simulate human joint movement trajectories under various task conditions. Furthermore, we introduce a novel Multi-Task Dynamic Motion Primitives with Singular Value Decomposition (DMPs-SVD) method, which incorporates multiple feature trajectory learning. This approach constructs a coherent task model using a limited dataset of typical human walking patterns, enabling joint motion planning across diverse task scenarios. Experimental results validate the viability and efficacy of the proposed human-in-loop DMPs and DMPs-SVD techniques in prosthetic applications.

Abstract:
Accurate localization plays an important role in high-level autonomous driving systems. Conventional map matching-based localization methods solve the poses by explicitly matching map elements with sensor observations, generally sensitive to perception noise, therefore requiring costly hyperparameter tuning. In this paper, we propose an end-to-end localization neural network which directly estimates vehicle poses from surrounding images, without explicitly matching perception results with HD maps. To ensure efficiency and interpretability, a decoupled BEV neural matching-based pose solver is proposed, which estimates poses in a differentiable sampling-based matching module. Moreover, the sampling space is hugely reduced by decoupling the feature representation affected by each DoF of poses. The experimental results demonstrate that the proposed network is capable of performing decimeter level localization with mean absolute errors of 0.19m, 0.13m and 0.39° in longitudinal, lateral position and yaw angle while exhibiting a 68.8% reduction in inference memory usage.

Abstract:
Morphing quadrotors are capable of adapting to constrained environments through geometric reconfiguration. However, existing systems are limited by mechanical complexity and rigid links, which affect both safety and performance in such environments. In this paper, we propose a strut-actuated tensegrity aerial vehicle that integrates shape adaptation with collision resilience. By incorporating deformable struts and a cable network, our vehicle enables real-time morphological adjustments during flight while maintaining stability. We present a hierarchical planning framework that ensures the entire vehicle remains confined within an icosahedral space, thereby guaranteeing full-body safety. An on-manifold Model Predictive Controller (MPC) is employed to track these optimized trajectories and compensate for inertia shifts during shape deformation. Simulation results validate the effectiveness of the proposed framework, demonstrating its capability to navigate in restricted scenarios.

Abstract:
The remote center of motion (RCM) constraint is a vital requirement in the design of robotic systems for transrectal ultrasound (TRUS) probe-scanning. This paper presents the design and development of a novel RCM-constrained manipulator specifically tailored for TRUS probe-scanning applications. The proposed system features a six-degree-of-freedom (6-DoF) parallel-serial hybrid mechanism that enables the TRUS probe to perform pivot and spin rotations while maintaining the RCM constraint. Subsequently, the kinematic model incorporating the RCM constraint is derived. Additionally, a geometry-aware path planning method is then introduced, considering variations in the desired rotation targets. This method parameterizes distance metrics on SO(3) (a Lie group) using coordinate-free Riemannian geometry, enabling the dynamic optimization of rotation orders to minimize the calculated Riemannian metrics. Furthermore, a smooth rotational trajectory generation method is proposed, constructing rotation curves between the ordered matrices on SO(3) while minimizing angular acceleration. Both simulations and experimental results validate the effectiveness and practicality of the proposed manipulator and its path planning method.

Abstract:
The recent development of foundation models for monocular depth estimation such as Depth Anything paved the way to zero-shot monocular depth estimation. Since it returns an affine-invariant disparity map, the favored technique to recover the metric depth consists in fine-tuning the model. However, this stage is not straightforward, it can be costly and time-consuming because of the training and the creation of the dataset. The latter must contain images captured by the camera that will be used at test time and the corresponding ground truth. Moreover, the fine-tuning may also degrade the generalizing capacity of the original model. Instead, we propose in this paper a new method to rescale Depth Anything predictions using 3D points provided by sensors or techniques such as low-resolution LiDAR or structure-from-motion with poses given by an IMU. This approach avoids fine-tuning and preserves the generalizing power of the original depth estimation model while being robust to the noise of the sparse depth, of the camera-LiDAR calibration or of the depth model. Our experiments highlight enhancements relative to zero-shot monocular metric depth estimation methods, competitive results compared to fine-tuned approaches and a better robustness than depth completion approaches. Code available at github.com/ENSTA-U2IS-AI/depth-rescaling.

Abstract:
Modeling expert driving behavior is crucial for the successful implementation of human-like autonomous driving. In this paper, we propose a new sampling-based Maximum Entropy Deep Inverse Reinforcement Learning (MEDIRL) framework. It leverages naturalistic human driving data to train the reward model and thus evaluates driving behaviors from the reward of sampled candidate trajectories. The proposed framework utilizes deep neural networks to learn the feature-reward mapping, which offers superior fitting capabilities compared to traditional linear reward functions. A polynomial trajectory sampler for long-term decision making and a dynamic window trajectory sampler for short-term planning are adopted to simplify the calculation of partition function in the MEDIRL algorithm. In addition, the proposed framework offers a solution to the probability estimation of driving behaviors by calculating the likelihood of sampled candidate trajectories based on their reward values. Comparative experiments are conducted on the NGSIM US-101 Highway dataset, and the experimental results demonstrate the superiority of the proposed model in personalizing reward functions, as well as the applicability of the proposed method in modeling driving behaviors across various time horizons.

Abstract:
Virtual reality (VR) technology has enormous applications in education, entertainment, and healthcare. Haptic feedback can significantly enhance the immersive experience in VR. However, most commercial hand/fingertip wearable VR haptic devices rely on bulky rigid structures, which are limited in the offered stimuli and cause fatigue. This study introduces a novel soft wearable fingertip device, T-Touch, that provides both thermal and multi-frequency haptic feedback for more realistic VR experiences. A flexible electrohydraulic actuator (EHA) is adopted for multi-frequency mechanical stimuli, and a flexible thermoelectric array (Flex-TEA) is utilized for distinct thermal stimuli. The EHA and Flex-TEA can be independently controlled to activate simultaneously or independently, thereby rendering ON/OFF contact stimuli, vibrations, controlled temperature stimuli, or any combination of the three modalities. Our T-Touch device features a compact form factor of 35 mm × 25 mm × 22 mm and weighs only ∼8 g. It can generate mechanical stimuli with the maximum stroke of ∼1 mm, a force of 0.47 N, at a bandwidth >10 Hz, and can render precise thermal stimuli in the range of 20 to 40 °C. The main performance of the EHA and Flex-TEA modules is characterized in extensive experiments and the effects of the key design and actuation parameters are investigated to optimise performance. Preliminary user tests verify the efficacy of our T-Touch design in immersive VR applications.

Abstract:
Series elastic actuators (SEAs) achieve output torque and stiffness control by managing the deformation of a spring arranged between the motor and the output. SEAs are ideal for tasks involving human-robot interaction and unstructured environments. The stiffness, size, and torque capacity of the spring are crucial for the performance of an SEA. To increase the torque capacity of an SEA while maintaining a compact size, this paper proposes a novel helical flexure as the spring in an SEA. The helical flexure is formed on the thin wall of a tube to generate high torque output with minimal reaction forces and moments. The resulting compliant tube can be combined with other transmission components to reduce the number of components and allow the inner passage of cables and shafts. Simulation comparisons and experimental testing verify the merits of the proposed helical flexure. An SEA prototype is fabricated to demonstrate the performance of the new helical flexure during zero-torque and high-torque motion.

Affiliations: School of Control Science and Engineering, Shandong University, China; School of Electrical Engineering, Shandong University, Jinan, China; School of Information Science and Engineering, Shandong University, Qingdao, China; Department of Electronic and Electrical Engineering, Southern University of Science and Technology, Shenzhen, China; The Department of Medical Physics and Biomedical Engineering, UCL Hawkes Institute, University College London, London, UK

Abstract:
In this paper, we propose a novel unsupervised intraoperative liver deformation correction method, called Learning Coherent point drift Network (LCNet), for image-guided liver surgery (IGLS). We first estimate the correspondences between the preoperative and intraoperative point sets in the optimal transport (OT) module by leveraging both original points and extracted features. Afterwards, we compute the point-wise displacement vector by solving the involved matrix equation in the Transformation module, where the point localisation noise is explicitly considered and modeled. Additionally, we present three variants of the proposed approach, i.e., LCNet, LCNet-ED and LCNet-WD, where better registration performances of LCNet against the other two demonstrate the superiority of the utilised Chamfer loss. We have extensively evaluated LCNet on the MedShapeNet dataset consisting of 615 different liver shapes of real patients, and the 3Dircadb dataset comprising 20 liver models of real patients. Extensive experimental results under different deformation and noise magnitudes demonstrate that LCNet outperforms existing state-of-the-art registration algorithms and holds significant application potential in IGLS. For example, when the overlapping ratio between the preoperative and intraoperative point sets is 25%, the deformation magnitude is 8 mm, the maximum point localization noise magnitude is 2 mm and the rotation angle lies in the range of [−45°, 45°], LCNet achieves a root-mean-square error (RMSE) value being 3.21 mm on MedShapeNet dataset, significantly outperforming those of Lepard and RoITr being 5.41 mm (p < 0.001) and 4.90 mm (p < 0.001) respectively.

Abstract:
The paper proposes a novel Economic Model Predictive Control (EMPC) scheme for Autonomous Surface Vehicles (ASVs) to simultaneously address path following accuracy and energy constraints under environmental disturbances. By formulating lateral deviations as energy-equivalent penalties in the cost function, our method enables explicit trade-offs between tracking precision and energy consumption. Furthermore, a motion-dependent decomposition technique is proposed to estimate terminal energy costs based on vehicle dynamics. Compared with the existing EMPC method, simulations with real-world ocean disturbance data demonstrate the controller’s energy consumption with a 0.06% energy increase while reducing cross-track errors by up to 18.61%. Field experiments conducted on an ASV equipped with an Intel N100 CPU in natural lake environments validate practical feasibility, achieving 0.22 m average cross-track error at nearly 1 m/s and 10 Hz control frequency. The proposed scheme provides a computationally tractable solution for ASVs operating under resource constraints.

Abstract:
Monocular depth estimation and ego-motion estimation are significant tasks for scene perception and navigation in stable, accurate and efficient robot-assisted endoscopy. To tackle lighting variations and sparse textures in endoscopic scenes, multiple techniques including optical flow, appearance flow and intrinsic image decomposition have been introduced into the existing methods. However, the effective training strategy for multiple modules are still critical to deal with both illumination issues and information interference for self-supervised depth estimation in endoscopy. Therefore, a novel framework with multistep efficient finetuning is proposed in this work. In each epoch of end-to-end training, the process is divided into three steps, including optical flow registration, multiscale image decomposition and multiple transformation alignments. At each step, only the related networks are trained without interference of irrelevant information. Based on parameter-efficient finetuning on the foundation model, the proposed method achieves state-of-the-art performance on self-supervised depth estimation on SCARED dataset and zero-shot depth estimation on Hamlyn dataset, with 4% ∼ 10% lower error. The evaluation code of this work has been published on https://github.com/BaymaxShao/EndoMUST.

Abstract:
This paper proposes Constrained Sampling Cluster Model Predictive Path Integral (CSC-MPPI), a novel constrained formulation of MPPI designed to enhance trajectory optimization while enforcing strict constraints on system states and control inputs. Traditional MPPI, which relies on a probabilistic sampling process, often struggles with constraint satisfaction and generates suboptimal trajectories due to the weighted averaging of sampled trajectories. To address these limitations, the proposed framework integrates a primal-dual gradient-based approach and Density-Based Spatial Clustering of Applications with Noise (DBSCAN) to steer sampled input trajectories into feasible regions while mitigating risks associated with weighted averaging. First, to ensure that sampled trajectories remain within the feasible region, the primal-dual gradient method is applied to iteratively shift sampled inputs while enforcing state and control constraints. Then, DB-SCAN groups the sampled trajectories, enabling the selection of representative control inputs within each cluster. Finally, among the representative control inputs, the one with the lowest cost is chosen as the optimal action. As a result, CSC-MPPI guarantees constraint satisfaction, improves trajectory selection, and enhances robustness in complex environments. Simulation and real-world experiments demonstrate that CSC-MPPI outperforms traditional MPPI in obstacle avoidance, achieving improved reliability and efficiency. The experimental videos are available at https://cscmppi.github.io

Abstract:
Cross-modal knowledge distillation (CMKD) in monocular 3D object detection transfers LiDAR’s accurate depth information to compensate for the limitations of camera model. However, current methods directly align the intermediate features of the teacher and student networks, in which the modality gap between LiDAR and camera hinders their effectiveness. To mitigate this issue, we design two modules, namely, Consistent Alignment Module (CAM) and Deformable Adapter Module (DAM) to reduce the modality gap of CMKD. The CAM transforms intermediate features of LiDAR and camera into some consistent features through a lightweight Target Head. It is based on the observation that some high-level features such as heatmaps and depths are highly correlated in CMKD, though modality gap appears between LiDAR and camera. Therefore, these features can be effectively transferred from teacher to student in CMKD. The DAM introduces a deformable adapter for the intermediate features of the student network to reduce background noise in CMKD. This helps to dynamically align its intermediate features with the teacher network. We then propose a Consistent Feature Alignment network (MonoCFA) for CMKD to boost monocular 3D object detection. Our network integrates the two designed modules at different levels of the teacher and student networks, in order to align the intermediate features of LiDAR and camera more accurately and reliably. Our model can be widely applied to existing monocular 3D object detection models. For validation, we choose the representative MonoDLE, GUPNet, and DID-M3D as base models. Experiments on the KITTI benchmark show that our method significantly outperforms the three base models by 39%, 15.5%, and 15%, respectively, and achieves state-of-the-art when compared to other CMKD models.

Abstract:
Microrobotic swarms have shown promising features due to their collective and flexible behaviours, while achieving precise swarm control and autonomous navigation in complex environments remains a challenge. Here, we propose a Transformer-based reinforcement learning strategy that integrates Proximal Policy Optimization for autonomous swarm control in obstacle environments. By combining domain randomization, this strategy enables direct transfer from simulation to real-world without fine tuning. Experimental results demonstrate robust control performance in avoiding static obstacles and tracking the dynamic target, which is not validated in training. The swarm autonomously navigates and adjusts its velocity and trajectory in obstacle environments with an intact swarm pattern. Our work presents a scalable strategy for the deployment of microrobotic swarms with adaptive navigation capability through complex, constrained environments.

Abstract:
Imitation learning (IL) has shown immense promise in enabling autonomous dexterous manipulations, including in learning surgical tasks. To fully unlock the potential of IL for surgery, access to clinical datasets is needed, which unfortunately lack the kinematic data required for current IL approaches. A promising source of large-scale surgical demonstrations is monocular surgical videos available online, making monocular pose estimation a crucial step toward enabling large-scale robot learning. Towards this end, we propose SurgiPose, a differentiable rendering-based approach to estimate kinematic information from monocular surgical videos, eliminating the need for direct access to ground-truth kinematics. Our method infers tool trajectories and joint angles by optimizing tool pose parameters to minimize the discrepancy between rendered and real images. To evaluate the effectiveness of our approach, we conduct experiments on two robotic surgical tasks—tissue lifting and needle pickup—using the da Vinci Research Kit Si (dVRK Si). We train imitation learning policies with both ground-truth measured kinematics and with estimated kinematics from video and compare their performance. Our results show that policies trained on estimated kinematics achieve comparable success rates to those trained on ground-truth data, demonstrating the feasibility of using monocular video-based kinematic estimation for surgical robot learning. By enabling kinematic estimation from monocular surgical videos, our work lays the foundation for large-scale learning of autonomous surgical policies from online surgical data.

Abstract:
4D radar-based object detection has garnered great attention for its robustness in adverse weather conditions and capacity to deliver rich spatial information across diverse driving scenarios. Nevertheless, the sparse and noisy nature of 4D radar point clouds poses substantial challenges for effective perception. To address the limitation, we present CORENet, a novel cross-modal denoising framework that leverages LiDAR supervision to identify noise patterns and extract discriminative features from raw 4D radar data. Designed as a plug-and-play architecture, our solution enables seamless integration into voxel-based detection frameworks without modifying existing pipelines. Notably, the proposed method only utilizes LiDAR data for cross-modal supervision during training while maintaining full radar-only operation during inference. Extensive evaluation on the challenging Dual-Radar dataset, which is characterized by elevated noise level, demonstrates the effectiveness of our framework in enhancing detection robustness. Comprehensive experiments validate that CORENet achieves superior performance compared to existing mainstream approaches. The code is available at https://github.com/charlesuv/corenet.git.

Abstract:
Distributed task allocation in the UAV swarm is sensitive to excessive communication overhead and frequent transmissions. Combining reinforcement learning and task allocation demonstrates great potential in enhancing algorithm performance and optimizing communication. However, existing studies rely on ideal communication assumptions and the nonphysical environment, making training and validation impractical in applying networked swarms. This paper proposes the Communication-Aware Task Allocation, which aims to train a gating mechanism policy to coordinate the transmission timing, improving robustness and timelessness of the task allocation. First, the policy learning problem is formalized as a POMDP, for which the channel access and other features are designed for observations, actions are inter-agent adaptive gating mechanisms, and the shared reward reflects global task conflicts. Second, to address the asynchronous learning under the CTDE, an asynchronous experience collection and splicing method is proposed to align trajectories. Then, the MOCPPO is proposed, which combines a primal-dual operator with proximal policy optimization, updating the optimal Lagrange multiplier and strategy parameters to simultaneously minimize task conflicts and communication overhead. Finally, sim-to-real experiments are conducted in the HIL environment, and results illustrate the best trade-off optimization of the proposed method over all state-of-the-art approaches.

Abstract:
As one of the most fundamental control modes in robotics, force/torque (F/T) control plays an essential role in a wide range of applications. However, classical F/T control fails to offer effective means to regulate the convergence sequence of the controlled states, which is beneficial in many real-world tasks, e.g., unknown surface contact, where the force should preferably converge later than the alignment angles to ensure sufficient contact and avoid dangerous misalignment. In this work, a novel nested fast terminal sliding mode control approach is proposed. This approach establishes a hierarchical structure for the controlled states, such that the Lyapunov stabilities of controlled states can be achieved in both a sequential and a time-synchronized manner within finite time, which is named as ‘Sequen-Sync’. Extensive experiments are conducted for various tasks in two different environments. The experimental results show that the proposed approach successfully achieves Sequen-Sync stability, which leads to improved contact quality and enhanced safety.

Abstract:
To address the challenges of localization drift and perception-planning coupling in unmanned aerial vehicles (UAVs) operating in open-top scenarios (e.g., collapsed buildings, roofless mazes), this paper proposes EAROL, a novel framework with a downward-mounted tilted LiDAR configuration (20° inclination), integrating a LiDAR-Inertial Odometry (LIO) system and a hierarchical trajectory-yaw optimization algorithm. The hardware innovation enables constraint enhancement via dense ground point cloud acquisition and forward environmental awareness for dynamic obstacle detection. A tightly-coupled LIO system, empowered by an Iterative Error-State Kalman Filter (IESKF) with dynamic motion compensation, achieves high level 6-DoF localization accuracy in feature-sparse environments. The planner, augmented by environment, balancing environmental exploration, target tracking precision, and energy efficiency. Physical experiments demonstrate 81% tracking error reduction, 22% improvement in perceptual coverage, and near-zero vertical drift across indoor maze and 60-meter-scale outdoor scenarios. This work proposes a hardware-algorithm co-design paradigm, offering a robust solution for UAV autonomy in post-disaster search and rescue missions. We will release our software and hardware as an open-source package3 for the community. Video: https://youtu.be/7av2ueLSiYw.

Abstract:
Repetitive motion control of redundant manipulators typically requires precise kinematic models to construct Jacobian matrices. However, model-based approaches are inherently limited when manipulator parameters are unavailable or only partially known. This paper introduces a novel data-driven discrete zeroing neurodynamics (DDZN) model for repetitive motion control. Specifically, a Jacobian matrix estimation method based on data-driven technology is proposed, which eliminates the need for prior models by leveraging historical input-output information. By integrating the Jacobian matrix estimation with a discrete zeroing neurodynamics (DZN) model, the approach enables simultaneous trajectory tracking and repeatable configuration recovery without relying on structural parameters. Theoretical analysis verifies the performance of DDZN model under noise environment. Furthermore, abundant experiment results validate its reliability and superior performance compared with various models.

Abstract:
Over the past decade, there has been a remarkable surge in utilizing quadrotors for various purposes due to their simple structure and aggressive maneuverability. One of the key challenges is online time-optimal trajectory generation and control technique. This paper proposes an imitation learning-based online solution to efficiently navigate the quadrotor through multiple waypoints with near-time-optimal performance. The neural networks (WN&CNets) are trained to learn the control law from the dataset generated by the time-consuming CPC algorithm and then deployed to generate the optimal control commands online to guide the quadrotors. To address the challenge of limited training data and the hover maneuver at the final waypoint, we propose a transition phase strategy that utilizes MINCO trajectories to help the quadrotor ‘jump over’ the stop-and-go maneuver when switching waypoints. Our method is demonstrated in both simulation and real-world experiments, achieving a maximum speed of 5.6m/s while navigating through 7 waypoints in a confined space of 5.5m × 5.5m × 2.0m [video3]. The results show that with a slight loss in optimality, the WN&CNets significantly reduce the processing time and enable online control for multi-point flight tasks.

Abstract:
Due to non-destructive testing (NDT) techniques being both expensive and inconvenient in dynamic detection scenarios, innovative alternatives are urgently needed to address cost-efficiency and deployment challenges. We first design TacRoller, a tactile sensor roller for automated characterization of surface defects in composite materials, to address the dilemma. It collects tactile images of defects on the composite’s plies by capturing changes caused by deformation of the outer elastomer through the internal camera. It reduces the cost of inspection by 80% to 90% compared to NDT equipment like radiographic testing while ensuring detection efficiency. It takes 58.86 seconds to complete a 35 cm×18 cm × 0.5 mm dry-woven fabric. Moreover, we collect a total of 2,744 images of samples of dry-woven fabric unidirectional prepreg through TacRoller to form a dataset, including wrinkles, foreign objects and debris (FODs), broken fibre, voids and healthy textures. Subsequently, we propose a multi-order gated aggregation (MOGA)-U-Net to tackle critical challenges of noise sensitivity and multi-scale defect recognition in tactile images, enabling robust segmentation and multi-category classification tasks. The results show that the MOGA-U-Net achieves a test dice coefficient of 76.0% and classification accuracy of 98.9%, outperforming DeepLabV3 and other benchmarks. By providing a scalable and effective NDT substitute, our system realises autonomous defect identification and classification on composites surface, thus improving quality control in the production of composites.

Abstract:
Physical parameters of the intracellular environment such as mass density, intracellular pressure and elasticity have significant effects on the physiological activities of the cell and intracellular operation results. However, the significantly different measurement principles of the above parameters make it a challenging task for in situ measurement of them for the same cell, which significantly limits the study of their comprehensive regulation mechanisms to cell physiological activities and intracellular operation results. For the first time, a robotic in situ measurement system of multiple intracellular physical parameters is proposed based on a self-developed three-micropipettes system in this paper. Using this system, the mass density, elasticity and intracellular pressure of the same cell are measured automatically in sequence, according to a robotic in situ measurement process. Experimental results on sheep oocytes demonstrate an 83.3% measurement success rate at an average speed of 97.75 s/cell. The measurement results of the above three parameters are close to the reported results of individual, while with a significantly shorter operation time than theirs combined in references. Our system lays a solid foundation for the future research on the comprehensive regulation mechanism of these parameters to cell physiological activities and intracellular operation results.

Abstract:
The suspension system, through effective damping of vibrations and shocks, can enhance the stability of wheeled robots traversing challenging terrain. Because the suspension system decouples the rigid correspondence between terrain changes and robot vibrations, considering suspension modeling in trajectory planning offers the advantage of more accurate prediction of the robot’s response to terrain. This improved predictive capability facilitates the planning of safer trajectories and may reduce tracking errors in the subsequent control process. In this work, inspired by the structure of Physics-Informed Neural Network (PINN), we propose a physics-informed planning method that considers the vibrational effects of complex nonlinear suspension systems. In addition, we design a two-stage process to accelerate training. By incorporating PINN, our method can better guarantee the physical feasibility of the planned trajectories. The proposed approach has been evaluated on a real robot platform. Compared to state-of-the-art baseline methods, our proposed approach achieves a 15.38% reduction in hazardous planning for mobile robots in wild environments.

Abstract:
In this paper, we investigate a multi-robot task allocation problem where a team of heterogeneous robots operates in a discrete workspace to achieve a set of tasks expressed by linear temporal logic formulas. In contrast to existing works, we further consider inter-task-time-order constraints, which are imposed on the start or end times of each task. Solving such problems generally requires combinatorial search, which is not scalable. Inspired by the efficiency of max-plus algebra in handling time constraints, we propose a novel approach called MaxAuc, which integrates Auction-based task allocation with Max-plus algebra in a novel manner. Specifically, max-plus computations are performed to approximate task priorities in the auction without explicitly solving the constraint optimization problem. Our numerical results demonstrate that MaxAuc is highly scalable with respect to both the number of robots and the number of tasks, while maintaining a tolerable performance trade-off compared to the baseline’s optimal yet exhaustive solution.

Abstract:
In complex multi-robot application scenarios, particularly in dynamically adversarial, hazardous, or disaster environments, traditional cooperation paradigms face significant challenges due to unreliable or absent communication links. Achieving efficient cooperation in the absence of communication has become a key bottleneck limiting the performance of multirobot systems. In this paper, we propose a Global Intent Prediction and Decomposition (GIPD) framework that enables robots to perform cooperative behavior without relying on communication. Each robot independently infers a globally consistent intent based solely on its local observations, ensuring implicit alignment across the system. Given the inferred global intent, robots autonomously determine their responsibilities and select the most appropriate tasks. They then base their local decision-making on the global intent, selected tasks, and individual observations, thereby facilitating effective execution and cooperation. We validate our approach using the MPE and SMAC benchmarks. Additionally, real-world experiments involving multiple ships demonstrate the effectiveness and practical applicability of the proposed GIPD method.

Abstract:
This paper studies the impact of finite word-length effects on the Markov parameters of tensegrity robots during digital simulations. First, the round-off noise models are introduced, where round-off noise is applied to the system’s inputs, outputs, and states. The deterministic and stochastic definitions of the Markov parameters are then presented. It is proven that stochastic Markov parameters remain invariant under finite word-length effects in linear time-invariant (LTI) systems, regardless of the round-off noise in inputs, outputs, or states. The nonlinear tensegrity dynamics and a linearization approach are introduced, with a tensegrity morphing airfoil studied as an illustrative example. The results indicate that using twisted input and output signals, the Markov parameters can converge correctly through white noise experiments. This supports the theoretical findings. The proposed approach allows accurate Markov parameter generation via simulation tests, which can be further used as model reduction or linearization of tensegrity robots to eliminate the distortions caused by round-off errors.

Abstract:
Autonomous navigation social robots need to track pedestrian movements in real-time with high precision to optimize path planning and avoid collisions. However, the main challenge of pedestrian tracking lies in the significant variations in human posture, which differ from rigid-body structures like vehicles. In this paper, we propose an Efficient LiDAR-based 3D Pedestrian Tracking Network (ELPTNet). First, our ELPTNet employs a 3D object detector to extract directional 3D pedestrian bounding boxes from LiDAR point clouds. Then, our ELPTNet employs a Constant Acceleration (CA) model and prediction confidence for target trajectory prediction. During the data association process, it integrates geometric, appearance, and motion features to enhance the robustness and real-time performance of 3D MOT when targets are temporarily occluded. Experimental results demonstrate that our ELPTNet achieves the highest ranking on the large-scale JRDB dataset for the 3D tracking task, outperforming previous state-of-the-art (SOTA) methods with improvements of 8.4% in MOTA and 6.6% in HOTA. Additionally, our ELPTNet attains an inference speed of 61 frames per second (FPS) on a single CPU. Therefore, our method enables accurate and real-time tracking of multiple pedestrians. The code is publicly available at https://github.com/jinzhengguang/ELPTNet.

Abstract:
Elephants cannot intentionally control individual trunk muscles but instead generate coordinated movements. This coordination, known as kinematic synergy, allows muscles to work together, reducing the controlled degrees of freedom (DOF) while enhancing motion efficiency. In this paper, we develop an elephant trunk robot that generates coordinated motions with only a few actuators inspired by kinematic synergy. First, we analyze elephant trunk motion using video data and extract the principal components that closely approximate actual movements. Next, we design a robot based on an underactuated tendon-driven transmission to reproduce the selected principal components and a control system to reduce tracking errors. Especially, we focus on designing tendons to confine the motion within the space spanned by the principal components. Finally, we implement the control system and evaluate the robot’s performance in replicating trunk motion with a limited number of actuators. By generating coordinated motion, we showed the effectiveness of the transmission system design method and the improvement in the motion accuracy by the control system.

Abstract:
Traditional data-driven control methods often require large amounts of training data, posing significant challenges for continuum robots. Recently, neural ordinary differential equation (NODE) methods have demonstrated impressive capabilities for data-efficient modeling of continuum robots. However, existing NODE-based control methods still face limitations in terms of convergence and robustness. In this paper, we propose a data-driven iterative learning control system for continuum robots, leveraging NODE for modeling. Within this framework, by incorporating online parameter learning, the proposed control system continuously adapts to various uncertainties associated with continuum robots, resulting in improved convergence and robustness in repetitive tasks. The effectiveness of the proposed method is validated through simulations and physical experiments, and comparative analysis highlights its superior accuracy over existing approaches.

Abstract:
The deep learning models has significantly advanced dexterous manipulation techniques for multi-fingered hand grasping. However, the contact information-guided grasping in cluttered environments remains largely underexplored. To address this gap, we have developed ContactDexNet, a method for generating multi-fingered hand grasp samples in cluttered settings through contact semantic map. We introduce a contact semantic conditional variational autoencoder network (CoSe-CVAE) for creating comprehensive contact semantic map from object point cloud. We utilize grasp detection method to estimate hand grasp poses from the contact semantic map. Finally, an unified grasp evaluation model PointNetGPD++ is designed to assess grasp quality and collision probability, substantially improving the reliability of identifying optimal grasps in cluttered scenarios. Our grasp generation method has demonstrated remarkable success, outperforming state-of-the-art (SOTA) methods by at least 4.7%, with 81.0% average grasping success rate in real-world single-object grasping using a known hand, and by at least 9.0% when using an unknown hand. Moreover, in cluttered scenes, our method attains a 76.7% success rate, outperforming the SOTA method by 6.3%. We also proposed the multi-modal multi-fingered grasping dataset generation method. Our multi-fingered hand grasping dataset outperforms previous datasets in scene diversity, modality diversity. More details and supplementary materials can be found at https://sites.google.com/view/contact-dexnet.

Abstract:
Attention serves as a critical antecedent to social presence, which fundamentally influences acceptance, trust, and overall interaction quality in human-robot interaction (HRI). This paper investigates the development of a gaze modulation framework that enables robots to strategically influence human attention through two complementary Q-learning-based modules: Gaze-Garnering Modulation (GGM) and Gaze-Avoidance Modulation (GAM). To measure gaze feedback, we introduce a novel metric—the Dynamic Gaze Engagement Index (DGEI)—that integrates attention ratio with stationary gaze entropy (SGE) to evaluate not just the quantity but also the quality of visual attention. This feedback allows the system to continuously adapt to each individual’s unique attentional patterns and thresholds, providing personalised interaction. In two experiments, 20 participants interacted with a Pepper robot that dynamically adjusted its behaviours (lights, movements, and voice volume) based on real-time gaze feedback. Results demonstrated that GGM significantly enhanced gaze engagement, fostering strong mutual interaction, while GAM effectively redirected attention when appropriate, with participants reporting lower perceived gaze engagement in this condition. Post-experiment questionnaires using the "Psycho-behavioural Interaction - Perceived Attentional Engagement" section of the Networked Minds Social Presence Inventory (NMSPI) revealed significant differences between conditions (t(18)=2.47, p=0.0238), validating the attention modulation by each module and corroborating the behavioural observations. These findings underscore the importance of adaptive robotic behaviours in facilitating dynamic and unobtrusive interactions.

Abstract:
Traditional robot navigation systems primarily utilize occupancy grid maps and laser-based sensing technologies, as demonstrated by the popular move_base package in ROS. Unlike robots, humans navigate not only through spatial awareness and physical distances but also by integrating external information, such as elevator maintenance updates from public notification boards and experiential knowledge, like the need for special access through certain doors. With the development of Large Language Models (LLMs), which possesses text understanding and intelligence close to human performance, there is now an opportunity to infuse robot navigation systems with a level of understanding akin to human cognition. In this study, we propose using osmAG (Area Graph in OpenStreetMap textual format), an innovative semantic topometric hierarchical map representation, to bridge the gap between the capabilities of ROS move_base and the contextual understanding offered by LLMs. Our methodology employs LLMs as an actual copilot in robot navigation, enabling the integration of a broader range of informational inputs, while maintaining the robustness of traditional robotic navigation systems. Our code, demo, map, and experiment results can be accessed at https://github.com/xiexiexiaoxiexie/Intelligent-LiDAR-Navigation-LLM-as-Copilot.

Abstract:
In recent years, the demand for social robots has grown, requiring them to adapt their behaviors based on users’ states. Accurately assessing user experience (UX) in human-robot interaction (HRI) is crucial for achieving this adaptability. UX is a multi-faceted measure encompassing aspects such as sentiment and engagement, yet existing methods often focus on these individually. This study proposes a UX estimation method for HRI by leveraging multimodal social signals. We construct a UX dataset and develop a Transformer-based model that utilizes facial expressions and voice for estimation. Unlike conventional models that rely on momentary observations, our approach captures both short- and long-term interaction patterns using a multi-instance learning framework. This enables the model to capture temporal dynamics in UX, providing a more holistic representation. Experimental results demonstrate that our method outperforms third-party human evaluators in UX estimation.

Abstract:
Wheeled-legged robots offer significant mobility advantages, yet their control is complicated by the coupled dynamics of the wheel and leg systems. To address this challenge, we propose a whole-body control framework built upon a decoupled architecture. In this structure, a two-wheeled inverted pendulum (TWIP) template exclusively manages wheel motion, freeing the whole-body controller to focus solely on the leg dynamics. To validate the generality of our approach, we conducted extensive simulations across various robot configurations, including both closed-loop and open-loop leg structures. The results demonstrate the robot’s ability to maintain stability across several challenging scenarios: a high-speed (5 m/s) slalom on flat ground, a low-speed (0.5 m/s) slalom on terrain with 10 cm height variations, and immediate stabilization after a 2 m free-fall. These findings highlight the potential of decoupled control as a promising direction for developing more agile and resilient robotic systems.

Abstract:
In robot-assisted minimally invasive surgery, optimal camera positioning is crucial for effective visualization and manipulation of tissues, which impacts the success of procedures. Traditional camera control can increase cognitive workload, lead to suboptimal camera viewpoints, and complicate surgical tasks. We propose an autonomous camera system that uses situational awareness and real-time prediction of user intent from kinematic data, and adjusts the camera position dynamically during a simulated suturing task. Such a system reduces the need to manually adjust the camera, allowing users to stay focused on the procedure. We demonstrated the framework in a user study with eight non-expert participants. They used the da Vinci Research Kit to control a simulated camera and instruments in a suturing task. We compared the performances in the suturing task with the autonomous and the teleoperated camera control. The autonomous system reduced execution time by 43% , shortened path length by 30% , and decreased completion cost by 35%. These results serve as a proof of concept for a situationally aware camera system and suggest that autonomous camera control can improve efficiency and simplify surgical workflow.

Abstract:
Soft robotics leverages highly elastic, deformable materials to enable sensitive human-centered applications in medicine, rehabilitation, and assistance, as well as industrial applications like robotic grasping and load-bearing. One particular type of soft robotic actuator - the pneumatically-actuated, soft robotic, telescopic structure (PASTS) - is a relatively new concept that utilises compliant material and geometry reminiscent of traditional telescopes to produce linear motion and force exertion. Previous works on telescopic soft actuators have focused on specific applications rather than fundamental mechanics, creating a clear knowledge gap in understanding their design, dynamics, and dependencies. This paper provides an in-depth study of the fundamental design parameters of soft telescopic actuators, and examines the impacts of certain critical dimensions and geometries on the physical behaviour and capabilities of these soft actuators. It revolves around the exploration of the influence of telescopic structure’s length, wall thickness and number of rings on its inflation and deflation behaviour, lifting and lowering speed, and motion smoothness. By experimentally verifying these relationships using several design variations, it demonstrates that telescopic structures are capable of repeatable linear extension within 0.85 mm of precision. It also determines the varying degrees±by which each critical parameter affects the desirable properties of the actuator, allowing for telescopic structures to be tailored and optimised for specific applications.

Abstract:
Image stitching in heavy occlusion scenarios faces the dual challenges of accurate alignment and occlusion removal. On one hand, occlusion causes the loss of key texture and structural information in the image. On the other hand, it affects the image’s integrity. Existing stitching methods perform well in cases with small occlusion coverage, but they often fail in heavy occlusion. This failure is mainly due to three reasons: 1) they cannot identify occluded regions, 2) they cannot suppress interference from the occluded regions, 3) they cannot remove the occluded regions. To address these issues, we propose an unsupervised deep image stitching and de-occlusion method. First, to solve the issue of occluded region identification, we design an Occlusion-Aware Feature Weighted module (OAFW) that explicitly distinguishes between occluded and non-occluded regions by learning the occlusion masks of the images. Second, to address the issue of interference from occlusion, we use the learned occlusion masks to filter out features from the occluded regions. To further suppress the impact of occlusion-induced errors, we design a Mask-Guided Dual-Granularity Alignment loss function (MGDGA) that only calculates alignment errors for non-occluded regions, effectively reducing occlusion error interference during network training. Finally, to resolve the content gap in the occluded regions, we replace the pixels in the occluded areas with those from the aligned overlapping regions and incorporate a Progressive Content Inpainting module (PCI) to recover the missing content in the non-overlapping regions caused by occlusion, ultimately achieving a complete and natural de-occlusion stitched image. Experimental results show that our method improves the mean squared error metric by 17.45% compared to the state-of-the-art stitching method.

Abstract:
The development of bionic hands is a crucial area robotics research, aiming to achieve human-like dexterity, adaptability, and efficiency in manipulation tasks. This study presents the Skeleton Bionic Hand (SkB-Hand), which features a dual elastic-tendon mechanism for extensor and volar plate in a single finger. The skeleton structure ensures the SkB-Hand is lightweight (< 600g) and low-cost (< 150USD) while maintaining performance in various grasping tasks. We evaluated the SkB-Hand through experiments such as the Kapandji test, GRASP Taxonomy, and dynamic scenarios. Results showed the SkB-Hand scored 7/10 in the Kapandji test and 33/33 in the GRASP Taxonomy test. The hand demonstrated resistance to deformation under external forces while ensuring flexibility. Integrated into the Techman Robot (TM-Robot), the SkB-Hand provide its potential for general robotic grasping tasks in daily life.

Abstract:
This paper presents an algorithm to improve state estimation for legged robots. Among existing model-based state estimation methods for legged robots, the contact-aided invariant extended Kalman filter defines the state on a Lie group to preserve invariance, thereby significantly accelerating convergence. It achieves more accurate state estimation by leveraging contact information as measurements for the update step. However, when the model exhibits strong nonlinearity, the estimation accuracy decreases. Such nonlinearities can cause initial errors to accumulate and lead to large drifts over time. To address this issue, we propose compensating for errors by augmenting the Kalman filter with an artificial neural network serving as a nonlinear function approximator. Furthermore, we design this neural network to respect the Lie group structure to ensure invariance, resulting in our proposed Invariant Neural-Augmented Kalman Filter (InNKF). The proposed algorithm offers improved state estimation performance by combining the strengths of model-based and learning-based approaches. Project webpage: https://seokju-lee.github.io/innkf_webpage

Abstract:
Ground and aerial robots, with distinct sensing perspectives, acquire heterogeneous point clouds that exhibit limited overlap, presenting significant challenges for collaborative mapping. To address these challenges, this article proposes a robust LiDAR-based aerial-ground collaborative mapping framework for large-scale outdoor environments. Firstly, to perform reliable cross-source place recognition and detect loop closure between aerial-ground robots, a deep network that fuses multi-level bird’s-eye view (BEV) and geometric features is developed to ensure consistent feature extraction and emphasize overlaps between heterogeneous point clouds. Next, an overlap-aware registration method is proposed to align point clouds within a detected loop closure. This method can strategically perform point cloud sparsification based on overlap ratio estimation, and mitigate the adverse effects of interfering points in non-overlapping regions. Furthermore, a graph optimization is implemented to consider all loop closure constraints simultaneously and ensure global map consistency. Comparative experiments on public and self-collected datasets demonstrate the superiority of the proposed approach. We open source code on GitHub1 to benefit the community.

Abstract:
This paper presents a novel approach for robot navigation in environments containing deformable obstacles. By integrating Learning from Demonstration (LfD) with Dynamical Systems (DS), we enable adaptive and efficient navigation in complex environments where obstacles consist of both soft and hard regions. We introduce a dynamic modulation matrix within the DS framework, allowing the system to distinguish between traversable soft regions and impassable hard areas in real-time, ensuring safe and flexible trajectory planning. We validate our method through extensive simulations and robot experiments, demonstrating its ability to navigate deformable environments. Additionally, the approach provides control over both trajectory and velocity when interacting with deformable objects, including at intersections, while maintaining adherence to the original DS trajectory and dynamically adapting to obstacles for smooth and reliable navigation.

Abstract:
Large Language Models (LLMs) are now transforming the way robots learn to work in unpredictable environments, such as homes or small enterprises. A growing number of approaches are combining LLMs with Behavior Trees (BTs). Not only do user commands need to be interpreted into BTs that contain the task’s goal, but external disturbances also need to be handled during the process when BT planners dynamically expand BTs based on action databases. However, in these approaches, the action database is manually pre-built and requires the capability for incremental learning and expansion. To address this issue, we propose a human-in-the-loop learning mechanism. First, we design a context for the LLM and then use it to generate action knowledge through in-context learning. In addition, we introduce human-in-the-loop. User feedback is utilized to guide the LLM to correct and refine the action knowledge, ensuring its accuracy and safety. Finally, the generated action knowledge can be directly used for adaptive manipulation without the need for knowledge transfer effort, enabling the robot to complete tasks and handle external disturbances. Experiments across various tasks are conducted and the experimental results validate our method.

Abstract:
Differential-driven wheeled robots (DWR) represent the quintessential type of mobile robots and find extensive applications across the robotic field. Most high-performance control approaches for DWR explicitly utilize the linear and angular velocities of the trajectory as control references. However, existing research on time-optimal path parameterization (TOPP) for mobile robots usually neglects the angular velocity and joint velocity constraints, which can result in degraded control performance in practical applications. In this article, a systematic and practical TOPP algorithm named TOPP-DWR is proposed for DWR and other mobile robots. First, the non-uniform B-spline is adopted to represent the initial trajectory in the task space. Second, the piecewise-constant angular velocity, as well as joint velocity, linear velocity, and linear acceleration constraints, are incorporated into the TOPP problem. During the construction of the optimization problem, the aforementioned constraints are uniformly represented as linear velocity constraints. To boost the numerical computational efficiency, we introduce a slack variable to reformulate the problem into second-order-cone programming (SOCP). Subsequently, comparative experiments are conducted to validate the superiority of the proposed method. Quantitative performance indexes show that TOPP-DWR achieves TOPP while adhering to all constraints. Finally, field autonomous navigation experiments are carried out to validate the practicability of TOPP-DWR in real-world applications.

Abstract:
This paper presents a novel Expressive Facial Control (ExFace) method based on Diffusion Transformers, which achieves precise mapping from human facial blendshapes to bionic robot motor control. By incorporating an innovative model bootstrap training strategy, our approach not only generates high-quality facial expressions but also significantly improves accuracy and smoothness. Experimental results demonstrate that the proposed method outperforms previous methods in terms of accuracy, frames per second (FPS), and response time. Furthermore, we develop the ExFace dataset driven by human facial data. ExFace shows excellent real-time performance and natural expression rendering in applications such as robot performances and human-robot interactions, offering a new solution for bionic robot interaction.

Abstract:
Proprioceptive sensing plays a crucial role in robotics, enabling closed-loop control approaches that are essential for autonomous applications. In Soft Robotics, the development and integration of sensors is even more challenging due to the compliant nature of soft bodies. Moreover, underwater environments pose additional difficulties, as sensors require to be properly embedded and sealed into the soft body of the robot. The novelty of this work lies in benchmarking different sensing technologies on a continuum soft robot to systematically assess their suitability as effective sensing approaches, both in air and underwater environments. This work presents two proprioceptive sensors (FBG optical sensor and IMUs system) embedded in an octopus-inspired robotic arm and are then tested using our proposed experimental protocol. The results underscore the system’s ability to reliably and repeatably capture data and provide a valuable guideline for the community to adopt in order to test novel sensing modalities in soft robotics. These developments are pivotal in advancing the deployment of soft robotic systems in both above- and underwater settings, facilitating tasks ranging from infrastructure inspection to marine life studies.

Abstract:
Human motion retargeting for humanoid robots, transferring human motion data to robots for imitation, presents significant challenges but offers considerable potential for real-world applications. Traditionally, this process relies on human demonstrations captured through pose estimation or motion capture systems. In this paper, we explore a text-driven approach to obtain imitation motion data more flexibly and simply. To address the inherent discrepancies between the generated motion representations and the kinematic constraints of humanoid robots, we propose an angle signal network based on norm-position and rotation loss (NPR Loss). It generates joint angles, which serve as inputs to a reinforcement learning based whole-body motion control policy. The policy ensures tracking of the generated motions while maintaining the robot’s stability during execution. Our experimental results demonstrate the efficacy of this approach, successfully transferring text-driven human motion to a real humanoid robot NAO.

Abstract:
Trajectory prediction is an essential component of the perception stack in autonomous mobile robots (AMRs). AMRs operate in complex environments where their movements are influenced by various environment elements, such as racks and storage locations. Therefore, accurate and efficient trajectory prediction for intralogistics requires detailed environment modeling that goes beyond the lane-based context mainly used in road traffic methods. We propose the addition of a new environment context encoder module that can be seamlessly integrated into state-of-the-art autonomous driving systems. Our approach, tailored to the specific challenges of intralogistics, achieves highly accurate predictions using compact and efficient baseline networks.

Affiliations: Robotics Institute, Carnegie Mellon University (CMU), Pittsburgh, USA; Automation and Robotics Laboratory, Universidade de Brasília, Brazil; Munich Institute of Robotics and Machine Intelligence (MIRMI), Technical University of Munich (TUM), Germany; Indian Institute of Technology (IIT), Kharagpur, India; Indian Institute of Science (IISc), Bangalore, India; Mohamed Bin Zayed University of Artificial Intelligence (MBUZAI), Abu Dhabi, UAE

Abstract:
Learning from Demonstration (LfD) techniques enable robots to learn and generalize tasks from user demonstrations, eliminating the need for coding expertise among end-users. One established technique to implement LfD in robots is to encode demonstrations in a stable Dynamical System (DS). However, finding a stable dynamical system entails solving an optimization problem with bilinear matrix inequality (BMI) constraints, a non-convex problem which, depending on the number of scalar constraints and variables, demands significant computational resources and is susceptible to numerical issues such as floating-point errors. To address these challenges, we propose a novel compositional approach that enhances the applicability and scalability of learning stable DSs with BMIs.

Abstract:
Deformable continuum robots (DCRs) present unique planning challenges due to nonlinear deformation mechanics and partial state observability, violating the Markov assumptions of conventional reinforcement learning (RL) methods. While Jacobian-Based approaches offer theoretical foundations for rigid manipulators, their direct application to DCRs remains limited by time-varying kinematics and underactuated deformation dynamics. This paper proposes Jacobian Exploratory Dual-Phase RL (JEDP-RL), a framework that decomposes planning into phased Jacobian estimation and policy execution. During each training step, we first perform small-scale local exploratory actions to estimate the deformation Jacobian matrix, then augment the state representation with Jacobian features to restore approximate Markovianity. Extensive SOFA surgical dynamic simulations demonstrate JEDP-RL’s three key advantages over proximal policy optimization (PPO) baselines: 1) Convergence speed: 3.2× faster policy convergence, 2) Navigation efficiency: requires 25% fewer steps to reach the target, and 3) Generalization ability: achieve 92% success rate under material property variations and achieve 83% (33% higher than PPO) success rate in the unseen tissue environment.

Abstract:
Tele-homecare has become a promising approach to meet the growing demand for elderly and disability care. In such a context, ensuring human-robot interaction safety during teleoperation poses a critical challenge. Existing teleoperation control approaches focus solely on the robot’s end-effector trajectory, failing to handle inevitable or even desirable contacts on other robot links. This paper proposes a teleoperated shared-control strategy to deal with this challenge. A lightweight exoskeleton is developed to teleoperate the robot and give force feedback to the operator. Additionally, an exoskeleton-based shared-control strategy is proposed to integrate operator commands with real-time proximity sensing information, allowing the robot to avoid collisions while executing tasks. To react to inevitable contact, the force feedback function is incorporated into the proposed strategy to enable the operator to experience intuitive contact. Comparative experiments and a demonstration are designed to evaluate the feasibility and reliability of the proposed strategy in a tele-homecare scenario. Compared to the traditional teleoperation strategy, the proposed method can greatly reduce the contact forces on the robot’s links, indicating the potential of the proposed strategy in advancing safety in tele-homecare systems.

Abstract:
For scene understanding in unstructured environments, an accurate and uncertainty-aware metric-semantic mapping is required to enable informed action selection by autonomous systems. Existing mapping methods often suffer from overconfident semantic predictions, and sparse and noisy depth sensing, leading to inconsistent map representations. In this paper, we therefore introduce EvidMTL, a multitask learning framework that uses evidential heads for depth estimation and semantic segmentation, enabling uncertainty-aware inference from monocular RGB images. To enable uncertainty-calibrated evidential multi-task learning, we propose a novel evidential depth loss function that jointly optimizes the belief strength of the depth prediction in conjunction with evidential segmentation loss. Building on this, we present EvidKimera, an uncertainty-aware semantic surface mapping framework, which uses evidential depth and semantics prediction for improved 3D metric-semantic consistency. We train and evaluate EvidMTL on the NYUDepthV2 and assess its zero-shot performance on ScanNetV2, demonstrating superior uncertainty estimation compared to conventional approaches while maintaining comparable depth estimation and semantic segmentation. In zero-shot mapping tests on ScanNetV2, EvidKimera outperforms Kimera by 30% in semantic surface mapping accuracy and consistency, highlighting the benefits of uncertainty-aware mapping and underscoring its potential for real-world robotic applications.

Affiliations: Center for Artificial Intelligence and Robotics, Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Jianghuai Advanced Technology Center, Hefei, China; School of Advanced Manufacturing, Sun Yat-Sen University, Shenzhen, China; School of Electrical and Automation Engineering, Hefei University of Technology, Hefei, China; Department of Automation, Navigation and Control Research Center, Tsinghua University, Beijing, China

Abstract:
Catching flying objects with a cushioning process is a skill commonly performed by humans, yet it remains a significant challenge for robots. In this paper, we present a framework that combines optimization and learning to achieve compliant catching on mobile manipulators (CCMM). First, we propose a high-level capture planner for mobile manipulators (MM) that calculates the optimal capture point and joint configuration. Next, the pre-catching (PRC) planner ensures the robot reaches the target joint configuration as quickly as possible. To learn compliant catching strategies, we propose a network that leverages the strengths of LSTM for capturing temporal dependencies and positional encoding for spatial context (P-LSTM). This network is designed to effectively learn compliant strategies from human demonstrations. Following this, the post-catching (POC) planner tracks the compliant sequence output by the P-LSTM while avoiding potential collisions due to structural differences between humans and robots. We validate the CCMM framework through both simulated and real-world ball-catching scenarios, achieving a success rate of 98.70% in simulation, 92.59% in real-world tests, and a 28.7% reduction in impact torques. The open source code will be released for the reference of the community1.

Abstract:
This paper introduces a systematic approach to identifying a physically feasible set of robot dynamics parameters. The framework consists of four steps: 1) Identification of robot dynamics parameters using least squares combined with a linear friction model. 2) Construction of a weighting matrix based on the least squares identification error, and performing weighted least squares identification combined with the linear friction model. 3) Introduction of a nonlinear friction model to fit joint friction. 4) Optimization of the remaining robot dynamics parameters to adhere to physical feasibility constraints. Various combinations of identification methods with linear or nonlinear friction models are analyzed experimentally, using a 6-DoF industrial robot and a 7-DoF collaborative robot, respectively, to demonstrate the effectiveness of the proposed recognition framework. Experimental results affirm that the proposed method provides accurate estimates of the robot joint torques while maintaining the physical feasibility of the dynamics.

Abstract:
Independent mobility is essential for visually impaired individuals, yet existing mobility aids such as white canes and guide dogs have limitations in accessibility and effectiveness. To address this, we propose an assistive robotic guidance system that replicates key functionalities of guide dogs. Our system operates without pre-built maps and dynamically recognizes road structures using LiDAR, allowing the robot to navigate based on the user’s orientation commands. The system consists of a modular architecture with separate guidance and mobility modules, ensuring adaptability across various environments and robotic platforms. Real-world experiments using both wheeled and quadrupedal robots demonstrate high accuracy in path recognition and effective navigation based on user-provided directional inputs. The results validate the feasibility of our approach in providing accessible and reliable mobility support for the visually impaired.

Abstract:
The human palm demonstrates spatial reconfigurability during the gripping process and forms a spherical grasping envelope. Based on these observations, this study designs a reconfigurable spherical palm that incorporates a spatial scissor mechanism, which only requires a single actuator to reshape the palm into a range of spherical forms. We conduct a kinematic analysis and modelling of the structure, abstracting three key parameters and analysing their influence on the motion characteristics of the palm. Through multi-objective optimisation, a set of dimensional parameters is derived to balance workspace, human-like motion, and mechanical performance. The performance of the reconfigurability and the grasping capability of the proposed palm is compared to a planar folding palm by superquadrics, and the results show that the spherical design and the reconfigurable characteristics provide larger grasping arrangement and stronger grasping capability of the palm on most of the testing surfaces.

Abstract:
Soft robots have become increasingly popular for complex manipulation tasks requiring gentle and safe contact. However, their softness makes accurate control challenging, and high-fidelity sensing is a prerequisite to adequate control performance. To this end, many flexible and embedded sensors have been created over the past decade, but they inevitably increase the robot’s complexity and stiffness. This study demonstrates a novel approach that uses simple bending strain gauges embedded inside a modular arm to extract complex information regarding its deformation and working conditions. The core idea is based on physical reservoir computing (PRC): A soft body’s rich nonlinear dynamic responses, captured by the inter-connected bending sensor network, could be utilized for complex multi-modal sensing with a simple linear regression algorithm. Our results show that the soft modular arm reservoir can accurately predict body posture (bending angle), estimate payload weight, determine payload orientation, and even differentiate two payloads with only minimal difference in weight — all using minimal digital computing power.

Abstract:
Legged aerial-terrestrial robots have garnered significant research attention in recent years due to their enhanced environmental adaptability through combined aerial and terrestrial locomotion. However, existing passive spring-legged aerial robots exhibit limited motion versatility, demonstrating single stance gait during ground impacts, which constrains their task adaptability and creates substantial challenges in hybrid trajectory optimization and switching control. To address these difficulties, this work presents a systematic solution to achieve diverse hybrid locomotion. We innovatively establish the differential flatness property for spring-legged quadrotors in both aerial and terrestrial domains, and propose a unified hybrid trajectory optimization framework that generates smooth, agile, and dynamically feasible multi-modal trajectories incorporating diverse stance gait patterns. Furthermore, a hybrid nonlinear model predictive controller with a trajectory extension strategy is developed to enhance hybrid tracking precision and mode transition execution. Compared to existing methods, we achieve a 27% reduction in tracking error during hybrid locomotion while maintaining high-precision foot placement. The source code will be released to benefit the community1

Abstract:
Abstract— Robotic needle positioning tasks in neurosurgery often face challenges due to insufficient perception of planar guidance images during surgery. In this work, we propose an Augmented Reality (AR) interface to help perform the robotic needle positioning tasks by learning from demonstration (LfD). Enhanced immersion in the workflow is achieved by displaying surgical scenes and calculated navigation information. The framework utilizes mixed interactive interfaces in virtual and real environments, enhancing demonstration efficiency and quality. A head-mounted display and an optical tracking system are utilized to perform the visualization and needle tracking. Gaussian Mixture Model (GMM) and Gaussian Mixture Regression (GMR) are employed to learn a robust and smooth trajectory policy from demonstrations. Experiments on robot reproduction of the needle positioning task achieved a final positioning error of 0.6 mm and an average trajectory error of 1.07 mm. Comparative user studies with haptic device-based teleoperation exhibit a low completion time of 62.76 s and reduced workload of the proposed system.

Abstract:
We explore how scalable robot data can address real-world challenges for generalized robotic manipulation. Introducing AgiBot World, a large-scale platform comprising over 1 million trajectories across 217 tasks in five deployment scenarios, we achieve an order-of-magnitude increase in data scale compared to existing datasets. Accelerated by a standardized collection pipeline with human-in-the-loop verification, AgiBot World guarantees high-quality and diverse data distribution. It is extensible from grippers to dexterous hands and visuo-tactile sensors for fine-grained skill acquisition. Building on top of data, we introduce Genie Operator-1 (GO-1), a novel generalist policy that leverages latent action representations to maximize data utilization, demonstrating predictable performance scaling with increased data volume. Policies pre-trained on our dataset achieve an average performance improvement of 30% over those trained on Open X-Embodiment, both in in-domain and out-of-distribution scenarios. GO-1 exhibits exceptional capability in real-world dexterous and long-horizon tasks, achieving over 60% success rate on complex tasks and outperforming prior RDT approach by 32%. By open-sourcing the dataset, tools, and models, we aim to democratize access to large-scale, high-quality robot data, advancing the pursuit of scalable and general-purpose intelligence.

Abstract:
Driving decision-making in mixed traffic, characterized by high-dynamic interactions and stochastic behaviors of human-driven vehicles, poses significant challenges for autonomous driving systems. To address these issues, we propose a novel Transformer-based Spatial Temporal Fusion (TSTF) module integrated with an auxiliary contrastive learning task within a multi-agent reinforcement learning (MARL) framework. The TSTF module captures interaction-aware behaviors and long-term temporal dependencies that tackle mixed cooperative driving scenarios, while the auxiliary contrastive learning task refines feature representations to enhance exploration efficiency and decision stability. Experimental evaluations on the MetaDrive platform demonstrate that the proposed approach outperforms baseline algorithms in safety, adaptability and robustness to dynamic traffic scenarios. The results highlight the effectiveness of the TSTF module in enabling robust and context-aware collaborative driving behaviors, offering a scalable solution for real-world mixed traffic. This work advances MARL by addressing key challenges in interaction modeling and driving decision-making under uncertainty, with significant implications for the development of intelligent transportation systems.

Abstract:
Achieving controlled jumping behaviour for a quadruped robot is a challenging task, especially when introducing passive compliance in mechanical design. This study addresses this challenge via imitation-based deep reinforcement learning with a progressive training process. To start, we learn the jumping skill by mimicking a coarse jumping example generated by model-based trajectory optimization. Subsequently, we generalize the learned policy to broader situations, including various distances in both forward and lateral directions, and then pursue robust jumping in unknown ground unevenness. In addition, without tuning the reward much, we learn the jumping policy for a quadruped with parallel elasticity. Results show that using the proposed method, i) the robot learns versatile jumps by learning only from a single demonstration, ii) the robot with parallel compliance reduces the landing error by 11.1%, saves energy cost by 15.2% and reduces the peak torque by 15.8%, compared to the rigid robot without parallel elasticity, iii) the robot can perform jumps of variable distances with robustness against ground unevenness (maximal ±4cm height perturbations) using only proprioceptive perception.

Abstract:
This paper introduces a novel method for end-to-end crowd detection that leverages object density information to enhance existing transformer-based detectors. We present CrowdQuery (CQ), whose core component is our CQ module that predicts and subsequently embeds an object density map. The embedded density information is then systematically integrated into the decoder. Existing density map definitions typically depend on head positions or object-based spatial statistics. Our method extends these definitions to include individual bounding box dimensions. By incorporating density information into object queries, our method utilizes density-guided queries to improve detection in crowded scenes. CQ is universally applicable to both 2D and 3D detection without requiring additional data. Consequently, we are the first to design a method that effectively bridges 2D and 3D detection in crowded environments. We demonstrate the integration of CQ into both a general 2D and 3D transformer-based object detector, introducing the architectures CQ2D and CQ3D. CQ is not limited to the specific transformer models we selected. Experiments on the STCrowd dataset for both 2D and 3D domains show significant performance improvements compared to the base models, outperforming most state-of-the-art methods. When integrated into a state-of-the-art crowd detector, CQ can further improve performance on the challenging CrowdHuman dataset, demonstrating its generalizability. The code is released at https://github.com/mdaehl/CrowdQuery.

Abstract:
The utilization of Unmanned Ground Vehicles (UGVs) for patrolling industrial sites has expanded significantly. These UGVs typically are equipped with perception systems, e.g., computer vision, with limited range due to sensor limitations or site topology. High-level control of the UGVs requires Coverage Path Planning (CPP) algorithms that navigate all relevant waypoints and promptly start the next cycle.In this paper, we propose the novel Fast-Revisit Coverage Path Planning (FaRe-CPP) algorithm using a greedy heuristic approach to propose waypoints for maximum coverage area and a random search-based path optimization technique to obtain a path along the proposed waypoints with minimum revisit time. We evaluated the algorithm in a simulated environment using Gazebo and a camera-equipped TurtleBot3 against a number of existing algorithms. Compared to their average path lengths and revisit times, our FaRe-CPP algorithm showed a reduction of at least 21% and 33%, respectively, in these highly relevant performance indicators.

Abstract:
This study investigates the robust finite-time stabilization of an autonomous underwater vehicle (AUV) with disturbance rejection, where the finite-time can be predetermined. The AUV is modeled as a rigid body moving within fluids, and the systems dynamics involves uncertain parameters arising from the hydrodynamic coupling between the AUV and fluid, along with unknown external disturbances. Direct compensation for the system’s dynamics, a common approach in controller design, is ineffective for AUVs with uncertainties. Therefore, a robust anti-disturbance control method without compensating for unknown dynamics is proposed. To begin, a time rescaling method is introduced to convert the specified finite-time stabilization of the system into an asymptotic stabilization for a time rescaling system. Then, an exponential PID controller is designed for the time rescaling system to handle unknown constant disturbances while a high-gain control strategy is used to suppress the uncertain dynamics, which also enhances the robustness against the uncertain dynamics. The specified finite-time stabilization control law is ultimately derived from the designed exponential control law of the time rescaling system. Numerical simulations are conducted to verify the results.

Abstract:
Deploying mobile robots safely among humans requires the motion planner to account for the uncertainty in the other agents’ predicted trajectories. This remains challenging in traditional approaches, especially with arbitrarily shaped predictions and real-time constraints. To address these challenges, we propose a Dynamic Risk-Aware Model Predictive Path Integral control (DRA-MPPI), a motion planner that incorporates uncertain future motions modelled with potentially non-Gaussian stochastic predictions. By leveraging MPPI’s gradient-free nature, we propose a method that efficiently approximates the joint Collision Probability (CP) among multiple dynamic obstacles for several hundred sampled trajectories in real-time via a Monte Carlo (MC) approach. This enables the rejection of samples exceeding a predefined CP threshold or the integration of CP as a weighted objective within the navigation cost function. Consequently, DRA-MPPI mitigates the freezing robot problem while enhancing safety. Real-world and simulated experiments with multiple dynamic obstacles demonstrate DRA-MPPI’s superior performance compared to state-of-the-art approaches, including Scenario-based Model Predictive Control (S-MPC), Frenét planner, and vanilla MPPI. Videos of the experiments can be found at https://autonomousrobots.nl/paper_websites/dra-mppi.

Abstract:
In robotic operation, asymmetric tasks requiring dual-arm cooperation are the highly challenging research direction. Autonomous operation generally has a low success rate or poor generalization because of its excessive dependence on the accuracy of sub-goals from asymmetric tasks. Although teleoperation can significantly improve the performances above, during operation, the operators are prone to neglect crucial intermediate sub-goals that are conducive to fine-grained dual-arm cooperation. Therefore, we propose a dual-arm shared control framework which firstly introduces the Sub-goal Generation module to sufficiently concentrate on the intermediate states, thus improving the ability of fine-grained dual-arm cooperation and reducing the adjustment quantity during robotic asymmetric task operation. Also, we integrate the Trajectory Prediction module that computes the future trajectory based on the historical movement information to enhance the robot motion smoothness. Finally, through dynamic combination of Sub-goal Generation module, Trajectory Prediction module and operator movement in the shared control framework, we effectively decrease the sensitivity to the accuracy of sub-goals, thus significantly improving the success rate. In simulation, we conduct the comparative experiments with autonomous operation and teleoperation on four common asymmetric tasks to validate the advantages of our shared control framework. The effect of each element in our framework is verified by ablation study. Certainly, our shared control framework can also be applied in real-world scenario.

Abstract:
This paper investigates how different levels of information impact user trust and cognitive load in teleoperated robotic systems. Participants performed tasks under three conditions: (C1) minimal information after initial visual feedback, (C2) verbal guidance through a graphical interface, and (C3) a combination of visual and verbal guidance via a graphical user interface (GUI) that shows the direction in which the user should move the robot to complete the task. Measurements included physiological responses such as galvanic skin response (GSR), eye blink rate, and facial temperature, along with task performance. The findings revealed that increased cognitive load reduced user trust and performance. When only minimal information was provided, participants experienced the highest cognitive load and lowest trust levels. Verbal guidance significantly reduced cognitive load and increased trust, whereas the combination of visual and verbal guidance caused cognitive overload, counteracting the expected increase in trust. This study underscores the importance of balancing information quantity and quality to enhance user experience and the efficiency of teleoperated robotic systems.

Abstract:
Robots operating in complex and uncertain environments face considerable challenges. Advanced robotic systems often rely on extensive datasets to learn manipulation tasks. In contrast, when humans are faced with unfamiliar tasks, such as assembling a chair, a common approach is to learn by watching video demonstrations. In this paper, we propose a novel method for learning robot policies by Retrieving-from-Video (RfV), using analogies from human demonstrations to address manipulation tasks. Our system constructs a video bank comprising recordings of humans performing diverse daily tasks. To enrich the knowledge from these videos, we extract mid-level information, such as object affordance masks and hand motion trajectories, which serve as additional inputs to enhance the robot model's learning and generalization capabilities. We further feature a dual-component system: a video retriever that taps into an external video bank to fetch task-relevant video based on task specification, and a policy generator that integrates this retrieved knowledge into the learning cycle. This approach enables robots to craft adaptive responses to various scenarios and generalize to tasks beyond those in the training data. Through rigorous testing in multiple simulated and real-world settings, our system demonstrates a marked improvement in performance over conventional robotic systems, showcasing a significant breakthrough in the field of robotics.

Abstract:
Rehabilitation robotics has attracted increasing attention due to its ability to provide continuous, precise, and adaptive treatment programs for stroke patients during their recovery. Accurately assessing lower-limb motor function is crucial in effectively implementing robot-assisted rehabilitation. This study proposes a novel application of a hybrid learning framework that leverages a single-ear-worn inertial measurement unit (IMU) combined with deep learning techniques to predict the Berg Balance Scale (BBS) scores. Participants performed a 3-meter Timed Up and Go (TUG) test while wearing the e-AR sensor. The collected 6-axis IMU data were processed through a CNN-LSTM framework, where we integrated time-domain, frequency-domain, and static features to enhance the model’s regression performance. Experimental results demonstrate that our proposed method achieves a mean absolute error (MAE) of 1.074, surpassing previous studies’ reported results and outperforming traditional machine learning and conventional deep learning algorithms when applied to ear-worn sensor data. The proposed framework is simple to operate yet accurate, making it suitable for patients’ self-assessment even in a home environment.

Abstract:
Continuum robots offer exceptional adaptability and dexterity for tasks in confined and complex spaces. However, their flexible structures inherently limit payload capacity and structural robustness. To overcome this trade-off, many previous studies have proposed variable stiffness continuum robots, but these robots have slow response times that hinder rapid tasks. In this regard, we propose a novel continuum robot with electro-permanent magnet (EPM)-based ball joints. This robot has two key components: an EPM actuator and a variable stiffness ball joint (VSBJ). The EPM-embedded VSBJ constitutes discrete segments of the robot and dynamically transitions between low- and high-stiffness states by controlling the EPM’s magnetic field. This provides the robot with fast-response and variable-stiffness capabilities. In experiments, the VSBJ achieved a 205-fold stiffness variation ratio with a response time of 0.019 s, surpassing the performance of many previously studied robots. Leveraging these capabilities, the proposed robot demonstrated single or multiple bending motions by controlling the stiffness of discrete segments. Finally, the potential of the fast-response variable-stiffness continuum robot for real-world applications was confirmed.

Abstract:
The six-degree-of-freedom (6-DOF) robotic arm has gained widespread application in human-coexisting environments. While previous research has predominantly focused on functional motion generation, the critical aspect of expressive motion in human-robot interaction remains largely unexplored. This paper presents a novel real-time motion generation planner that enhances interactivity by creating expressive robotic motions between arbitrary start and end states within predefined time constraints. Our approach involves three key contributions: first, we develop a mapping algorithm to construct an expressive motion dataset derived from human dance movements; second, we train motion generation models in both Cartesian and joint spaces using this dataset; third, we introduce an optimization algorithm that guarantees smooth, collision-free motion while maintaining the intended expressive style. Experimental results demonstrate the effectiveness of our method, which can generate expressive and generalized motions in under 0.5 seconds while satisfying all specified constraints.

Abstract:
In this paper, we consider a first step to bridge a gap in coordinating task planning robots. Specifically, we study the automatic construction of languages that are maximally flexible while being sufficiently explicative for coordination. To this end, we view language as a machinery for specifying temporal-state constraints of plans. Such a view enables us to reverse-engineer a language from the ground up by mapping these composable constraints to words. Our language expresses a plan for any given task as a "plan sketch" to convey just-enough details while maximizing the flexibility to realize it, leading to robust coordination with optimality guarantees among other benefits. We formulate and analyze the problem, provide approximate solutions, and validate our approach under various scenarios to shed light on its applications.

Abstract:
Energy efficiency is a fundamental goal in robotic control. Various components within a robot, such as mechanical systems, computational units, and sensors, consume energy, all powered by the battery unit. Each component features several actuators and individual controllers that optimize energy usage locally, often without regard to one another. In this paper, we highlight a significant phenomenon indicating a considerable dependency between the mechanical and computational parts of the robot as energy consumers and the battery state of charge (SOC) as the energy provider. We demonstrate that as the battery SOC fluctuates, the behavior of energy consumption also varies, necessitating a unified controller with awareness of this relationship. Motivated by this observation, we propose a battery-aware co-optimization strategy for the mechanical and computational units, leveraging configuration space exploration to optimize the motor speed and the CPU frequency under different environmental conditions and battery SOC levels. Experimental results demonstrate the effectiveness of our approach in extending the operational lifetime of a robot under varying battery SOC and workload conditions, enhancing the energy efficiency of a case study rover by up to 53.93% w.r.t. selected baselines and similar past approaches.

Abstract:
To achieve en bloc resection of bladder tumor and the anterior tumor resection in transurethral resection of bladder tumor (TURBT), a cystoscope transurethral continuum robotic system has been proposed. A continuum cystoscope in the system needs to bend more than 180° and its base has translation, axial rotation, and tilt degrees of freedom to achieve full bladder accessibility. Under the constant-curvature assumption, the analytical solution of inverse kinematics already exists for multi-segment continuum robots with variable segment lengths and continuum robots with two inextensible segments. However, there is a lack of analytical inverse kinematics solution for continuum robots with features of the continuum cystoscope. Therefore, this paper proposes a novel and efficient inverse kinematics solving algorithm for the continuum cystoscope used in TURBT. The proposed method simplifies the inverse kinematics problem by constructing a robot plane coordinate and uses geometric relationships to derive a non-linear constraint equation containing only one intermediate variable. By solving this non-linear equation, the solution to the entire inverse kinematics problem is obtained. Additionally, based on this inverse kinematics algorithm, the length of the continuum segment is designed to ensure full bladder accessibility. In the comparative experiments with the Jacobian-based method, which involves 12500 target poses, the proposed method solves 100% of the inverse kinematics problems with a much greater computational efficiency.

Abstract:
Due to the bounded thrust, motion acceleration, and external disturbances inherent in quadrotor UAVs, traditional hierarchical control methods for multiple UAVs aerial transportation systems (MUATS) with cable-suspended payloads often struggle to guarantee dynamic performance and payload wrench feasibility. To address these challenges, this paper proposes a robust wrench-feasible control framework. First, we design a payload controller based on an extended state observer to estimate and compensate for the total disturbance acting on the payload. Next, a piecewise wrench adjustment strategy (PWAS) is proposed to ensure that the payload wrench balances feasibility and path-tracking ability. Finally, we propose an adaptive cable configuration strategy (ACCS) inspired by capacity margin. When external disturbances approach the system’s wrench capacity, ACCS can dynamically adjust cable configuration to maximize the capacity margin. Experimental results demonstrate the effectiveness and superiority of the proposed method.

Abstract:
Understanding hand motion from a single RGB image is challenging due to occlusions and high articulation. This paper presents 3D-AMTA, a transformer-based framework with Auto Mask and Token-specific Attention for occlusion-aware 3D hand pose estimation (HPE). We propose two novel architectural enhancements: auto mask for high-occlusion scenarios, and token-specific attention for fine-grained hand articulations. These modules seamlessly integrate into transformer-based architectures that enhance real-time performance in interactive systems. To enable efficient deployment on robotic and embedded platforms, we propose 3D-AMTA-Mobile, a lightweight variant optimized for on-device processing. It achieves 267 FPS on NVIDIA RTX 2080Ti-GPU while maintaining high accuracy, making it well-suited for resource-constrained robotic applications. Extensive evaluations on FreiHAND and HO3D demonstrate that our approach consistently outperforms state-of-the-art methods in terms of accuracy, efficiency, and inference speed. These advancements contribute to robust hand perception for interactive robotics and AR-based teleoperation.

Abstract:
Integrating robotics into everyday scenarios like tutoring or physical training requires robots capable of adaptive, socially engaging, and goal-oriented interactions. While Large Language Models show promise in human-like communication, their standalone use is hindered by memory constraints and contextual incoherence. This work presents a multimodal, cognitively inspired framework that enhances LLM-based autonomous decision-making in social and task-oriented Human-Robot Interaction. Specifically, we develop an LLM-based agent for a robot trainer, balancing social conversation with task guidance and goal-driven motivation. To further enhance autonomy and personalization, we introduce a memory system for selecting, storing and retrieving experiences, facilitating generalized reasoning based on knowledge built across different interactions. A preliminary HRI user study and offline experiments with a synthetic dataset validate our approach, demonstrating the system’s ability to manage complex interactions, autonomously drive training tasks, and build and retrieve contextual memories, advancing socially intelligent robotics.

Abstract:
We propose SGLoc, a novel localization system that directly regresses camera poses from 3D Gaussian Splatting (3DGS) representation by leveraging semantic information. Our method utilizes the semantic relationship between 2D image and 3D scene representation to estimate the 6DoF pose without prior pose information. In this system, we introduce a multilevel pose regression strategy that progressively estimates and refines the pose of query image from the global 3DGS map, without requiring initial pose priors. Moreover, we introduce a semantic-based global retrieval algorithm that establishes correspondences between 2D (image) and 3D (3DGS map). By matching the extracted scene semantic descriptors of 2D query image and 3DGS semantic representation, we align the image with the local region of the global 3DGS map, thereby obtaining a coarse pose estimation. Subsequently, we refine the coarse pose by iteratively optimizing the difference between the query image and the rendered image from 3DGS. Our SGLoc demonstrates superior performance over baselines on 12scenes and 7scenes datasets, showing excellent capabilities in global localization without initial pose prior. Code will be available at https://github.com/IRMVLab/SGLoc.

Abstract:
Caregiving is a vital role for domestic robots, especially the repositioning care has immense societal value, critically improving the health and quality of life of individuals with limited mobility. However, repositioning task is a challenging area of research, as it requires robots to adapt their motions while interacting flexibly with patients. The task involves several key challenges: (1) applying appropriate force to specific target areas; (2) performing multiple actions seamlessly, each requiring different force application policies; and (3) motion adaptation under uncertain positional conditions. To address these, we propose a deep neural network (DNN)-based architecture utilizing proprioceptive and visual attention mechanisms, along with impedance control to regulate the robot's movements. Using the dual-arm humanoid robot Dry-AIREC, the proposed model successfully generated motions to insert the robot's hand between the bed and a mannequin's back without applying excessive force, and it supported the transition from a supine to a lifted-up position. The project page is here: https://sites.google.com/view/caregiving-robot-airec/repositioning

Abstract:
Generating intelligent robot behavior in contact-rich settings is a research problem where zeroth-order methods currently prevail. A major contributor to the success of such methods is their robustness in the face of non-smooth and discontinuous optimization landscapes that are characteristic of contact interactions, yet zeroth-order methods remain computationally inefficient. It is therefore desirable to develop methods for perception, planning and control in contact-rich settings that can achieve further efficiency by making use of first and second order information (i.e., gradients and Hessians). To facilitate this, we present a joint formulation of collision detection and contact modelling which, compared to existing differentiable simulation approaches, provides the following benefits: i) it results in forward and inverse dynamics that are entirely analytical (i.e. do not require solving optimization or root-finding problems with iterative methods) and smooth (i.e. twice differentiable), ii) it supports arbitrary collision geometries without needing a convex decomposition, and iii) its runtime is independent of the number of contacts. Through simulation experiments, we demonstrate the validity of the proposed formulation as a "physics for inference" that can facilitate future development of efficient methods to generate intelligent contact-rich behavior.

Abstract:
Enabling robotic systems to perform long-horizon manipulation planning in real-world environments based on multimodal embodied perception and comprehension remains a longstanding challenge. Recent advancements in large language models (LLMs) have spurred the development of LLM-based planners; however, these approaches often rely on human-provided textual representations or extensive prompt engineering, lacking the ability to quantitatively interpret the environment. To overcome these limitations, we propose a novel framework that leverages LLMs and vision-language models (VLMs) to perform abstract reasoning and extract task-relevant representations from the environment using grounding mechanisms. To further enhance robotic capabilities, we introduce a systematic approach to constructing robotic skill libraries, enabling efficient generation of feasible and optimal actions. Unlike prior work, our LLM-based task planner reformulates user instructions into Planning Domain Description Language (PDDL) problems and employs Behavior Trees to represent the hierarchical structure of tasks, offering interpretable and modular task execution. Extensive evaluations on diverse real-world long-horizon manipulation tasks demonstrate the effectiveness of the proposed method, achieving an average success rate exceeding 80%. Furthermore, the framework functions as a high-level planner, empowering robots with substantial autonomy in unstructured environments by leveraging multimodal sensor inputs.

Abstract:
Model Predictive Control (MPC) enables agile and robust locomotion in quadruped robots but is sensitive to model uncertainties and environmental variations. This paper presents a Tube-Based Robust MPC (TR-MPC) framework for quadruped locomotion under uncertainties, modeled as parameter mismatches and additive disturbances. TR-MPC constructs an Invariant Ellipsoid to bound errors induced by uncertainties, ensuring convergent error trajectories. A Semi-Definite Programming (SDP) problem with Linear Matrix Inequality (LMI) constraints is solved offline to minimize the ellipsoid size, while a linear feedback term stabilizes error dynamics, guaranteeing stability within uncertainty bounds. Simulations and experiments demonstrate TR-MPC’s robustness: the robot achieves stable trotting under a 14 kg load (123% of its weight) and recovers from a 1.4 m/s impact while carrying 10 kg (88% of its weight). This framework significantly enhances robustness in dynamic and uncertain environments.

Abstract:
As a critical branch of robotics, mobile platforms have seen extensive applications in industrial automation, social services, and military industry sectors in recent years. However, conventional wheeled platforms exhibit limited obstacle-crossing capability, while legged robots, despite superior terrain traversability, demand excessive power consumption for high payloads and face significant challenges in maintaining platform stability on complex terrain due to intricate control requirements and hardware complexity. This study presents a wheel-footed hybrid robot that integrates a compound wheel-foot mechanism to achieve high payload capacity, exceptional terrain adaptability, and enhanced stability, enabling adaptive embodied intelligence in complex scenarios. First, a novel mechanical architecture and hardware system for the wheel-foot module were designed and constructed. Then, focusing on high dynamic response platform stabilization control, an autonomous planning framework and active stability control system were developed, accompanied by kinematic modeling of the prototype. Finally, experimental validation was conducted on the prototype, demonstrating the ability to carry an adult weighing approximately 60 kg while maintaining platform horizontality (maximum posture errors: 1.5° on slopes, 1.2° over speed bumps, 6.1° during stair climbing), verifying the practicality of both the mechanical design and control strategy.

Abstract:
This paper presents FGLoc6D, a novel approach for robust global localisation and online 6DoF pose estimation of ground robots in forest environments by leveraging deep semantically-guided re-localisation and cross-view factor graph optimisation. The proposed method addresses the challenges of aligning aerial and ground data for pose estimation, which is crucial for accurate point-to-point navigation in GPS-degraded environments. By integrating information from both perspectives into a factor graph framework, our approach effectively estimates the robot’s global position and orientation. Additionally, we enhance the repeatability of deep-learned keypoints for metric localisation in forests by incorporating a semantically-guided regression loss. This loss encourages greater attention to wooden structures, e.g., tree trunks, which serve as stable and distinguishable features, thereby improving the consistency of keypoints and increasing the success rate of global registration, a process we refer to as re-localisation. The re-localisation module along with the factor-graph structure, populated by odometry and ground-to-aerial factors over time, allows global localisation under dense canopies. We validate the performance of our method through extensive experiments in three forest scenarios, demonstrating its global localisation capability and superiority over alternative state-of-the-art in terms of accuracy and robustness in these challenging environments. Experimental results show that our proposed method can achieve drift-free localisation with bounded positioning errors, ensuring reliable and safe robot navigation through dense forests.

Abstract:
The ability to use random objects as tools in a generalizable manner is a missing piece in robots’ intelligence today to boost their versatility and problem-solving capabilities. State-of-the-art robotic tool usage methods focused on procedurally generating or crowd-sourcing datasets of tools for a task to learn how to grasp and manipulate them for that task. However, these methods assume that only one object is provided and that it is possible, with the correct grasp, to perform the task; they are not capable of identifying, grasping, and using the best object for a task when many are available, especially when the optimal tool is absent. In this work, we propose GeT-USE, a two-step procedure that learns to perform real-robot generalized tool usage by learning first to extend the robot’s embodiment in simulation and then transferring the learned strategies to real-robot visuomotor policies. Our key insight is that by exploring a robot’s embodiment extensions (i.e., building new end-effectors) in simulation, the robot can identify the general tool geometries most beneficial for a task. This learned geometric knowledge can then be distilled to perform generalized tool usage tasks by selecting and using the best available real-world object as tool. On a real robot with 22 degrees of freedom (DOFs), GeT-USE outperforms state-of-the-art methods by 30-60% success rates across three vision-based bimanual mobile manipulation tool-usage tasks.

Abstract:
For amputees with powered transfemoral prosthetics, navigating obstacles or complex terrain remains challenging. This study addresses this issue by using an inertial sensor on the sound ankle to guide obstacle-crossing movements. A genetic algorithm computes the optimal neural network structure to predict the required angles of the thigh and knee joints. A gait progression prediction algorithm determines the actuation angle index for the prosthetic knee motor, ultimately defining the necessary thigh and knee angles and gait progression. Results show that when the standard deviation of Gaussian noise added to the thigh angle data is less than 1, the method can effectively eliminate noise interference, achieving 100% accuracy in gait phase estimation under 150 Hz, with thigh angle prediction error being 8.71% and knee angle prediction error being 6.78%. These findings demonstrate the method’s ability to accurately predict gait progression and joint angles, offering significant practical value for obstacle negotiation in powered transfemoral prosthetics.

Abstract:
LiDAR map matching (LMM) faces two key challenges: the enormous number of point clouds imposes constraints on storage and computation, and traditional two-stage frameworks suffer from initial guess errors during degeneration. This paper presents LHMM, a hybrid-map framework that first compresses the prior map and then performs tightly coupled pose estimation within a Maximum A Posteriori (MAP) estimation formulation. First, a skeletonization-based prior map compression method is proposed, which retains only stable structural features, reducing the map storage while enabling fast runtime association through a dual-mode map representation. Second, constraints from IMU, skeleton-feature prior map, and local voxel map are jointly optimized within a unified MAP formulation, recovering the full system state in a single step and preventing error cascades. The local map benefits from a hole-aware keyframe mechanism, focusing on regions with environmental changes or areas with partial map coverage, thereby reducing computation compared to full mapping. Extensive evaluations across multiple datasets demonstrate that LHMM not only reduces storage and computational overhead but also outperforms state-of-the-art methods in terms of localization accuracy and robustness. We will open-source the code1.

Abstract:
This paper presents TransSoft, a novel soft robotic hand with a reconfigurable design for grasping objects of varying properties. While recent soft robotic hands have improved grasping capabilities, they often struggle with a limited range of manipulable object categories and tasks due to hardware constraints. TransSoft addresses these limitations with a scalable, low-cost, and highly adaptable structure that significantly expands the diversity of graspable objects and executable tasks. Unlike previous designs, TransSoft features a unique kinematic structure that enhances radial reconfigurability, allowing it to adjust grasping strategies dynamically based on object size, shape, and material properties. The hand is cost-effective, built from off-the-shelf components in three hours for just 200. We evaluated TransSoft through extensive real-world grasping experiments, and the results demonstrate its superior adaptability and grasping performance. Supplementary materials, including design details and experiment results, are available on our project website†.

Abstract:
The soft robotic hand exhibits a wide range of manipulation capabilities, which are attributed to the dexterity of its soft fingers and their coordinated movements. Therefore, designing a versatile soft hand requires careful consideration of both the characteristics of the individual fingers, such as degree of freedom (DOF), and their strategic arrangements to improve performance for specific target tasks. This work presents a modularized high DOF tendon-driven soft finger and a customized design of a soft robotic hand for diverse dexterous manipulation tasks. Furthermore, an all-in-one module is developed that integrates both the 4-way tendon-driven soft finger body and drive parts. Its high DOF enables multidirectional actuations with a wide actuation range, thereby expanding possible manipulation modes. The modularity of the system expands the design space for finger arrangements, which enables the diverse configuration of robotic hands and facilitates the customization of task-oriented platforms. To achieve sophisticated control of these complex configurations, we employ neural network-planned trajectories, enabling the precise execution of complicated tasks. The performance of a single finger is validated, including dexterity and payload, and several real-world manipulation tasks are demonstrated, including writing, grasping, rotating, and spreading, using motion primitives of diverse soft hands with distinctive finger arrangements. These demonstrations showcase the system’s versatility and precision in various tasks. We expect that our system will contribute to the expansion of possibilities in the field of soft robotic manipulation.

Abstract:
Unmanned aerial-aquatic vehicles (UAAVs) can operate both in the air and underwater, giving them broad application prospects. Inspired by the dual-function wings of puffins, we propose a UAAV with amphibious wings to address the challenge posed by medium differences on the vehicle’s propulsion system. The amphibious wing, redesigned based on a fixed-wing structure, features a single degree of freedom in pitch and requires no additional components. It can generate lift in the air and function as a flapping wing for propulsion underwater, reducing disturbance to marine life and making it environmentally friendly. Additionally, an artificial central pattern generator (CPG) is introduced to enhance the smoothness of the flapping motion. This paper presents the prototype, design details, and practical implementation of this concept.

Abstract:
This paper presents a soft Cable-Driven Parallel Robot for gastrointestinal surgery. The robot consists of an inflatable scaffold and a hydraulic variable stiffness end-effector and features six degrees of freedom. Experiments involving passage through a colon model, knot tying, and retraction have demonstrated its flexibility and the concept of navigating through the colon in a soft configuration, then increasing rigidity at the lesion site to collaborate with the endoscope in performing surgery. Meanwhile, the robot can sense the contact force through hydraulic pressure variations within the end-effector shaft, providing haptic feedback, which reduces the effects of Coulomb friction. When using the robot to calculate pressing forces, it achieved accuracy with a mean error of 0.051 N and a standard deviation (STD) of 0.066 N. For lifting forces, it achieved a mean error of 0.066 N and a STD of 0.083 N. These results demonstrate the potential of the robot for tissue palpation applications.

Abstract:
Biological machines that use biological cells and soft materials in combination to obtain a sense of the environment driven by bioenergy and generate driving force are called biohybrid actuators. With the development of tissue engineering and organoid technology, researchers have applied biohybrid actuators technology to the research of precision medicine and targeted drug delivery, but the research on feedback and evaluation of biohybrid actuation performance is limited to visual and simulation calculations. Therefore, we hope to develop an intelligent crawling skeleton for sensing function, which can be used to evaluate the actuation ability of muscle actuators, and eventually realize the high-precision control of biohybrid actuators. In this work, an intelligent crawling skeleton based on three-dimensional liquid metal is proposed to detect and feedback the crawling of C2C12 muscle actuators. Three-dimensional muscle tissue was composed of mixing hydrogels and cells, and the functionalization of muscle rings was promoted using static mechanical forces and external electric field stimulation. The composite crawling skeleton is fabricated by inverting mold and soft lithography technology. The skeleton can adapt to large deformations above 90 degrees and is more sensitive to deformations by adjusting materials with different elastic modulus. Inspired by the tendon-bone structure, the intelligent crawling skeleton can obtain the deformation degree of the biohybrid actuator in the crawling process according to the characteristics of the deformation from the muscle tissue, and put forward a good idea for the feedback and closed-loop control of the biohybrid actuators.

Abstract:
Human fingertips are densely distributed with sensory nerve endings, allowing them to perceive various physical characteristics, including pressure, roughness, etc. In this work, we develop a rigid-flexible coupled bionic robotic finger with perception decoupling and slip detection capabilities. Particularly, slip perception is important in grasping operations. Timely prediction of slippage and adjusting gripping force can improve gripping stability. Fiber Bragg gratings (FBGs) are embedded within both the rigid skeleton and flexible shell of the bionic fingertip. The fibers within the flexible shell are capable of sensing slight pressure, while the optical fibers embedded in the rigid skeleton can measure temperature changes. Firstly, this paper introduces the principles of distributed fiber optic sensors and the morphological design of the bionic fingertip. Then, the fabrication process of the bionic fingertip is described. Finally, we verify the multimodal sensory capabilities of the bionic fingertip through a series of experiments. The results demonstrate that the bionic finger can successfully sense whether the slip has occurred during grasping process. In summary, this rigid-flexible bionic finger is expected to play a significant role in dexterous manipulation, fruit picking and so on.

Abstract:
Modular Aerial Robot Systems (MARS) consist of multiple drone units that can self-reconfigure to adapt to various mission requirements and fault conditions. However, existing fault-tolerant control methods exhibit significant oscillations during docking and separation, impacting system stability. To address this issue, we propose a novel fault-tolerant control reallocation method that adapts to an arbitrary number of modular robots and their assembly formations. The algorithm redistributes the expected collective force and torque required for MARS to individual units according to their moment arm relative to the center of MARS mass. Furthermore, we propose an agile trajectory planning method for MARS of arbitrary configurations, which is collision-avoiding and dynamically feasible. Our work represents the first comprehensive approach to enable fault-tolerant and collision avoidance flight for MARS. We validate our method through extensive simulations, demonstrating improved fault tolerance, enhanced trajectory tracking accuracy, and greater robustness in cluttered environments. The videos and source code of this work are available at https://github.com/RuiHuangNUS/MARS-FTCP/

Abstract:
We present the Duke Humanoid, an open-source 10-degrees-of-freedom humanoid, as an extensible platform for locomotion research. The design mimics human physiology, with symmetrical body alignment in the frontal plane to maintain static balance with straight knees. We develop a reinforcement learning policy that can be deployed zero-shot on the hardware for velocity-tracking walking tasks. Additionally, to enhance energy efficiency in locomotion, we propose an end-to-end reinforcement learning algorithm that encourages the robot to leverage passive dynamics. Our experimental results show that our passive policy reduces the cost of transport by up to 50% in simulation and 31% in real-world tests. Our website is http://generalroboticslab.com/DukeHumanoidv1/.

Abstract:
Generating robot demonstrations through simulation is widely recognized as an effective way to scale up robot data. Previous work often trained reinforcement learning agents to generate expert policies, but this approach lacks sample efficiency. Recently, a line of work has attempted to generate robot demonstrations via differentiable simulation, which is promising but heavily relies on reward design, a labor-intensive process. In this paper, we propose DiffGen, a novel framework that integrates differentiable physics simulation, differentiable rendering, and a vision-language model to enable automatic and efficient generation of robot demonstrations. Given a simulated robot manipulation scenario and a natural language instruction, DiffGen can generate realistic robot demonstrations by minimizing the distance between the embedding of the language instruction and the embedding of the simulated observation after manipulation in representation space. The embeddings are obtained from the vision-language model, and the optimization is achieved by calculating and descending gradients through the differentiable simulation, differentiable rendering, and vision-language model components. Experiments demonstrate that with DiffGen, we could efficiently and effectively generate robot data with minimal human effort or training time. The videos of the results can be accessed at https://sites.google.com/view/diffgen.

Abstract:
Achieving generalizability in solving out-of-distribution tasks is one of the ultimate goals of learning robotic manipulation. Recent progress of Vision-Language Models (VLMs) has shown that VLM-based task planners can alleviate the difficulty of solving novel tasks, by decomposing the compounded tasks as a plan of sequentially executing primitive-level skills that have been already mastered. It is also promising for robotic manipulation to adapt such composable generalization ability, in the form of composable generalization agents (CGAs). However, the community lacks of reliable design of primitive skills and a sufficient amount of primitive-level data annotations. Therefore, we propose RH20T-P, a primitive-level robotic manipulation dataset, which contains about 38k video clips covering 67 diverse manipulation tasks in real-world scenarios. Each clip is manually annotated according to a set of meticulously designed primitive skills that are common in robotic manipulation. Furthermore, we standardize a plan-execute CGA paradigm and implement an exemplar baseline called RA-P on our RH20T-P, whose positive performance on solving unseen tasks validates that the proposed dataset can offer composable generalization ability to robotic manipulation agents. Project homepage: https://sites.google.com/view/rh20t-primitive/main.

Abstract:
Cross-embodiment robotic manipulation synthesis for complicated tasks is challenging, partially due to the scarcity of paired cross-embodiment datasets and the impediment of designing intricate controllers. Inspired by robotic learning via guided human expert demonstration, we here propose a novel cross-embodiment robotic manipulation algorithm via CycleVAE and human behavior transformer. First, we utilize unsupervised CycleVAE together with a bidirectional subspace alignment algorithm to align latent motion sequences between cross-embodiments. Second, we propose a casual human behavior transformer design to learn the intrinsic motion dynamics of human expert demonstrations. During the test case, we leverage the proposed transformer for the human expert demonstration generation, which will be aligned using CycleVAE for the final human-robotic manipulation synthesis. We validated our proposed algorithm through extensive experiments using a dexterous robotic manipulator with the robotic hand. Our results successfully generate smooth trajectories across intricate tasks, outperforming prior learning-based robotic motion planning algorithms. These results have implications for performing unsupervised cross-embodiment alignment and future autonomous robotics design. Complete video demonstrations of our experiments can be found in https://sites.google.com/view/humanrobots/home.

Abstract:
Negative pressure (NP) therapy with sliding suction is an effective method for limb lymphedema. Due to the caregiver shortage and the patients increase, the robotic NP therapeutic system with a variable-sized suction head can be used to help the lymphedema therapy. However, the varying complexity of different limb regions can affect the accuracy of the suction path. Moreover, the moving suction path should maintain smoothness to ensure therapeutic efficacy. Therefore, finding a smooth sliding path with highly accurate suction poses on the unstructured limb surface poses a significant challenge for robotic therapy. In this paper, a smooth sliding path planning method is proposed for robotic continuous suction in limb lymphedema therapy. The easily-sealed region is identified by comparing point normals to the lip’s suction angle, simplifying path planning to a 2D plane due to lip and limb flexibility. The conjugate gradient method optimizes the path with centroid distance and smoothness constraints. Finally, after the generation of suction poses under the constraints of the lip shape, a smooth sliding path along with lip pressure commands, is obtained to regulate the robot in performing continuous suction therapy. In the experiment, the manipulator with a variable-sized head has been used to finish 10 sliding suctions from different planning path. From the result, the robot could complete 6 times sliding suctions on the phantom arm.

Abstract:
This paper presents a novel approach to multi-robot collision avoidance that integrates global path planning with local navigation strategies, utilizing attentive graph neural networks to manage dynamic interactions among agents. We introduce a local navigation model that leverages pre-planned global paths, allowing robots to adhere to optimal routes while dynamically adjusting to environmental changes. The model’s robustness is enhanced through the introduction of noise during training, resulting in superior performance in complex, dynamic environments. Our approach is evaluated against established baselines, including NH-ORCA, DRL-NAV, and GA3C-CADRL, across various structurally diverse simulated scenarios. The results demonstrate that our model achieves consistently higher success rates, lower collision rates, and more efficient navigation, particularly in challenging scenarios where baseline models struggle. This work offers an advancement in multi-robot navigation, with implications for robust performance in complex, dynamic environments with varying degrees of complexity, such as those encountered in logistics, where adaptability is essential for accommodating unforeseen obstacles and unpredictable changes.

Abstract:
Single-medium, multi-degree-of-freedom robots often face limitations in aerial-aquatic tasks due to structural and weight constraints, which compromise their mobility in both air and water. To address this, we introduce a 6-degree-of-freedom fully actuated aerial-aquatic robot that employs thrust vectoring for enhanced performance. This innovative design incorporates four servos and four motors to facilitate coordinated operation. In air mode, the robot achieves decoupled control of attitude and position through servo angle feedforward compensation combined with dual-loop control. In underwater mode, it ensures high maneuverability by utilizing a dynamic model similar to a "weightless" state, employing single-loop control. Experimental results demonstrate that the robot can perform fully actuated movements in both air and water, successfully navigate the air-water boundary, and deploy sensors on inclined surfaces. These capabilities highlight the robot’s significant future application prospects.

Abstract:
Online extrinsic calibration is crucial for building "power-on-and-go" moving platforms, like robots and AR devices. However, blindly performing online calibration for unobservable parameter may lead to unpredictable results. In the literature, extensive studies have been conducted on the extrinsic calibration between IMU and camera, from theory to practice. It is well-known that the observability of extrinsic parameter can be guaranteed under sufficient motion excitation. Furthermore, the impacts of degenerate motions are also investigated. Despite these successful analyses, we identify an issue with respect to the existing observability conclusion. This paper focuses on the observability investigation for straight line motion, which is a common-seen and fundamental degenerate motion in applications. We analytically prove that pure translational straight line motion can lead to the unobservability of the rotational extrinsic parameter between IMU and camera (at least one degree of freedom). By correcting the existing observability conclusion, our novel theoretical finding disseminates more precise principle to the research community and provides explainable calibration guideline for practitioners. Our analysis is validated by rigorous theory and experiments.

Abstract:
Minimally Invasive Surgery (MIS) reduces surgical risks and recovery times by enabling precise interventions around lesion sites. However, accessing these sites requires flexible instruments, which often compromise the precision afforded by rigid devices. This study proposes a thermally drawn fiber-based, tendon-driven soft robotic instrument with a two-stage hybrid motion control framework to enhance accuracy. Several learning-based inverse kinematic (IK) models were developed, with the LSTM-based model showing the best performance and used to guide open-loop large-scale motion, while a closed-loop controller refines accuracy via real-time feedback. The system is validated by path-following tasks and a simulated endometrial ablation in a vaginal phantom. Results show that the IK model realizes stable open-loop control with Euclidean error below 2 mm, while hybrid control further reduces errors to below 1 mm. This combination offers a promising MIS solution with high precision in difficult-to-reach surgical sites.

Abstract:
Human locomotion exhibits extraordinary adaptability and robustness, yet the mechanisms by which lower limbs adjust to sudden environmental disruptions remain poorly understood. To address this, we employed the bioinspired human-sized EPA-Hopper II robot to examine how lower-limb joints recover from an abrupt drop in ground height, mimicking unexpected perturbations encountered in natural settings. Our study investigates the roles of the monoarticular soleus (SOL) and biarticular gastrocnemius (GAS) muscle configurations, focusing on how their compliance influences the robot’s hopping stability. Experiments reveal that a coordinated interplay between SOL and GAS markedly improves recovery from disturbances, enhancing energy distribution and joint synchronization. Detailed kinematic and power analyses show that GAS facilitates energy transfer across joints, while SOL’s spring-like properties support rapid recovery. These results highlight how bioinspired muscle arrangements enable robust locomotion through intrinsic mechanical interactions. By leveraging a robotic platform to probe these dynamics, this work deepens our understanding of biological locomotion and informs the design of bioinspired bipedal robots and prosthetics capable of thriving in unpredictable environments.

Abstract:
In water tunnels, autonomous navigation of autonomous underwater vehicles (AUVs) is challenging under accumulated localization errors and severe acoustic perturbations constraints. An AUV water tunnel navigation (AUV-WTN) framework is proposed to address these challenges. AUV-WTN integrates a forward-looking sonar (FLS) image segmentation method based on the refined mask R-CNN (RM R-CNN) network with real-time trajectory planning that employs the dynamic trajectory homotopy method (DTHM). RM R-CNN is optimized to combine a mixed-frequency block (MFB) along with a weighted loss function. Additionally, precise region of interest pooling (PrRoI Pooling) is combined to mitigate the impact of false targets, blurred edges, and noise on segmentation accuracy. DTHM is proposed to reduce trajectory drift by dynamically updating path generation based on segmented FLS images. Experimental results demonstrate that RM R-CNN outperforms state-of-the-art (SOTA) methods, achieving a 10.9% improvement over Mask R-CNN in mask segmentation. The simulation platform and real AUV experiments indicate that the capability of AUV-WTN framework is effective in generating precise paths and ensuring collision-free navigation in tunnel environments.

Abstract:
Pneumatic soft everting robotic structures have the potential to facilitate human transfer tasks due to their ability to grow underneath humans without sliding friction and their utility as a flexible sling when deflated. Tubular structures naturally yield circular cross-sections when inflated, whereas a robotic sling must be both thin enough to grow between a human and their resting surface and wide enough to cradle the human. Recent works have achieved flattened cross-sections by including rigid components into the structure, but this reduces conformability to the human. We present a method of mechanically programming the cross-section of soft everting robotic structures using flexible strips that constrain radial expansion between points along the outer membrane. Our method enables simultaneously wide and thin inflated profiles, and maintains the full multi-axis flexibility of traditional slings when deflated. We develop and validate a model relating geometric design specifications to fabrication parameters, and experimentally characterize their effects on growth rate. Finally, we prototype a soft growing robotic sling system and demonstrate its use for assisting a single caregiver in bed-to-chair patient transfer.

Abstract:
Speech recognition in Human-Robot Interaction (HRI) fully relies on audio-based Automatic Speech Recognition. However, speech recognition that relies solely on audio faces significant challenges in noisy environments and may lead to poor performance in such environments. One approach to address this is to also use lipreading in combination with traditional speech recognition. Recent work has shown that audiovisual speech recognition (AVSR) can achieve a Word Error Rate (WER) of only 0.9% on the dataset LRS3. In this paper, we assess the potential of combining audio with lipreading on a social robot platform, Pepper, which has not yet been widely tested for AVSR. Given that prior research has focused on non-robotic domains, it remains unclear whether such models can generalize well to social robot environments. We systematically evaluate and compare the performance of established offline and real-time audiovisual models with their audio-only counterparts. The experiments were conducted in both a controlled laboratory setting and a dynamic and noisy public environment. We evaluated the data using WER and also measured the inference latency of real-time models via Real-Time Factor and Words Per Second rates. The results demonstrate real-time performance for audio-only speech recognition across all latency metrics and near real-time performance for models that combine audio with lipreading. We also explored factors that might influence the inference performance of these models to understand how much video contributes to the audio. This includes factors related to (1) environmental and temporal variations, (2) model behavior, and (3) implementation choices. Our findings indicate that for now the audio-only models outperform the audiovisual models on a social robot platform, in contrast to what has been reported in the benchmarked literature. We conclude that more work is still needed to benefit from lipreading in HRI.

Abstract:
RGB-D has gradually become a crucial data source for understanding complex scenes in assisted driving. However, existing studies have paid insufficient attention to the intrinsic spatial properties of depth maps. This oversight significantly impacts the attention representation, leading to prediction errors caused by attention shift issues. To this end, we propose a novel learnable Depth interaction Pyramid Transformer (DiPFormer) to explore the effectiveness of depth. Firstly, we introduce Depth Spatial-Aware Optimization (Depth SAO) as offset to represent real-world spatial relationships. Secondly, the similarity in the feature space of RGB-D is learned by Depth Linear Cross-Attention (Depth LCA) to clarify spatial differences at the pixel level. Finally, an MLP Decoder is utilized to effectively fuse multi-scale features for meeting real-time requirements. Comprehensive experiments demonstrate that the proposed DiPFormer significantly addresses the issue of attention misalignment in both road detection (+7.5%) and semantic segmentation (+4.9% / +1.5%) tasks. DiPFormer achieves state-of-the-art performance on the KITTI (97.57% F-score on KITTI road and 68.74% mIoU on KITTI-360) and Cityscapes (83.4% mIoU) datasets.

Abstract:
Compliance is a useful parametrization of tactile information that humans often utilize in manipulation tasks. It can be used to inform low-level contact-rich actions or characterize objects at a high-level. In robotic manipulation, existing approaches to estimate compliance have struggled to generalize across both object shape and material. Using camera-based tactile sensors, proprioception, and force measurements, we present a novel approach to estimate object compliance as Young’s modulus E from parallel grasps. We evaluate our method over a novel dataset of 285 common objects, including a wide array of shapes and materials with Young’s moduli ranging from 5.0 kPa to 250 GPa. Combining analytical and data-driven approaches, we develop a hybrid system using a multi-tower neural network to analyze a sequence of tactile images from grasping. This system is shown to estimate the Young’s modulus of unseen objects within an order of magnitude at 74.2% accuracy across our dataset. This is an improvement over purely analytical and data-driven baselines which exhibit 28.9% and 65.0% accuracy respectively. Importantly, this estimation system performs irrespective of object geometry and demonstrates increased robustness across material types. Code is available on GitHub and collected data is available on HuggingFace.

Abstract:
This work aims to interpret human behavior to anticipate potential user confusion when a robot provides explanations for failure, allowing the robot to adapt its explanations for more natural and efficient collaboration. Using a dataset [1] that included facial emotion detection, eye gaze estimation, and gestures from 55 participants in a user study [2], we analyzed how human behavior changed in response to different types of failures and varying explanation levels. Our goal is to assess whether human collaborators are ready to accept less detailed explanations without inducing confusion. We formulate a data-driven predictor to predict human confusion during robot failure explanations. We also propose and evaluate a mechanism, based on the predictor, to adapt the explanation level according to observed human behavior. The promising results from this evaluation indicate the potential of this research in adapting a robot’s explanations for failures to enhance the collaborative experience.

Abstract:
6-DoF pose estimation is a fundamental task in computer vision with wide-ranging applications in augmented reality and robotics. Existing single RGB-based methods often compromise accuracy due to their reliance on initial pose estimates and susceptibility to rotational ambiguity, while approaches requiring depth sensors or multi-view setups incur significant deployment costs. To address these limitations, we introduce SplatPose, a novel framework that synergizes 3D Gaussian Splatting (3DGS) with a dual-branch neural architecture to achieve high-precision pose estimation using only a single RGB image. Central to our approach is the Dual-Attention Ray Scoring Network (DARS-Net), which innovatively decouples positional and angular alignment through geometry-domain attention mechanisms, explicitly modeling directional dependencies to mitigate rotational ambiguity. Additionally, a coarse-to-fine optimization pipeline progressively refines pose estimates by aligning dense 2D features between query images and 3DGS-synthesized views, effectively correcting feature misalignment and depth errors from sparse ray sampling. Experiments on three benchmark datasets demonstrate that SplatPose achieves state-of-the-art 6-DoF pose estimation accuracy in single RGB settings, rivaling approaches that depend on depth or multi-view images.

Abstract:
Achieving zero-shot peg insertion, where inserting an arbitrary peg into an unseen hole without task-specific training, remains a fundamental challenge in robotics. This task demands a highly generalizable perception system capable of detecting potential holes, selecting the correct mating hole from multiple candidates, estimating its precise pose, and executing insertion despite uncertainties. While learning-based methods have been applied to peg insertion, they often fail to generalize beyond the specific peg-hole pairs encountered during training. Recent advancements in Vision-Language Models (VLMs) offer a promising alternative, leveraging large-scale datasets to enable robust generalization across diverse tasks. Inspired by their success, we introduce a novel zero-shot peg insertion framework that utilizes a VLM to identify mating holes and estimate their poses without prior knowledge of their geometry. This approach assumes a known peg pose and a leveled surface for insertion. Extensive experiments demonstrate that our method achieves 90.2% accuracy, significantly outperforming baselines in identifying the correct mating hole across a wide range of previously unseen peg-hole pairs, including 3D-printed objects, toy puzzles, and industrial connectors. Furthermore, we validate the effectiveness of our approach in a real-world connector insertion task on a backpanel of a PC, where our system successfully detects holes, identifies the correct mating hole, estimates its pose, and completes the insertion with a success rate of 88.3%. These results highlight the potential of VLM-driven zero-shot reasoning for enabling robust and generalizable robotic assembly.

Abstract:
Conventional soft sensors often suffer from challenges such as crosstalk, hysteresis, and limited sensitivity, which hinder their performance and broader applicability. This paper presents a multi-axis piezoresistive soft force sensor with a square-column-shaped sensing structure designed to reduce the spatial footprint and mitigate partial axial coupling effects. By integrating a Wheatstone bridge-based resistive compensation strategy, the sensor achieves self-decoupling in multi-axis force measurements. Furthermore, a generalized Preisach hysteresis model is implemented to effectively compensate for hysteresis-induced nonlinearities and input-output loop effects, significantly enhancing sensing accuracy and precision. Extensive experimental validations confirm the effectiveness of the proposed self-decoupling and hysteresis compensation methodologies, demonstrating notable improvements in sensor reliability and performance. The findings of this study establish a comprehensive framework for advancing multi-dimensional soft force sensing technologies, with promising implications for high-precision engineering and biomedical applications.

Abstract:
3D object generation from a single unposed RGB image is essential for robotic perception, as reconstructing complete geometry and texture is essential for precise manipulation, grasping, and scene understanding, which is key for autonomous navigation and dexterous interaction. Recent advancements in image-to-3D employ Gaussian Splatting with pre-trained 2D or 3D diffusion models, but a disparity exists: 2D models generate high-fidelity textures yet lack geometric consistency, while 3D models ensure structural coherence but produce overly smooth textures. To address this, we introduce a two-stage frequency-based distillation loss integrated with Gaussian Splatting, leveraging geometric priors from a 3D diffusion model’s low-frequency spectrum for structural consistency and a 2D diffusion model’s high-frequency details for sharper textures. Our approach achieves state-of-the-art 3D reconstruction quality, significantly improving robotic perception pipelines. Additionally, we demonstrate the easy adaptability of our method for highly accurate object pose estimation and tracking, which is critical for precise robotic grasping, manipulation, and scene understanding. Additional results can be found in the supplementary file.

Abstract:
Robot navigation in large, complex, and unknown indoor environments is a challenging problem. The existing approaches, such as traditional sampling-based methods, struggle with resolution control and scalability, while imitation learning-based methods require a large amount of demonstration data. Active Neural Time Fields (ANTFields) have recently emerged as a promising solution by using local observations to learn cost-to-go functions without relying on demonstrations. Despite their potential, these methods are hampered by challenges such as spectral bias and catastrophic forgetting, which diminish their effectiveness in complex scenarios. To address these issues, our approach decomposes the planning problem into a hierarchical structure. At the high level, a sparse graph captures the environment’s global connectivity, while at the low level, a planner based on neural fields navigates local obstacles by solving the Eikonal PDE. This physics-informed strategy overcomes common pitfalls like spectral bias and neural field fitting difficulties, resulting in a smooth and precise representation of the cost landscape. We validate our framework in large-scale environments, demonstrating its enhanced adaptability and precision compared to previous methods, and highlighting its potential for online exploration, mapping, and real-world navigation. https://sites.google.com/view/mntfields/home

Abstract:
This paper presents a novel dual tiltrotor UAV design featuring foldable wings and a strategically positioned center of gravity (CG) to enable passive perching and multi-modal flight. Traditional UAVs rely on additional mechanical components for operations such as takeoff and perching, which increase weight and complexity. Inspired by the mechanics of a balanced bird toy, our design achieves stability in horizontal flight and secure power-off perching on branches or cables. The proposed blade-tip plane-based controller facilitates belly/back takeoff without landing gear, enabling seamless transitions between hovering and horizontal flight within 2 seconds. Wind resistance tests were conducted indoors to assess disturbance rejection capabilities during perching, while the transition performance was evaluated outdoors.

Abstract:
In this paper, we present SeGMan, a hybrid motion planning framework that integrates sampling-based and optimization-based techniques with a guided forward search to address complex, constrained sequential manipulation challenges, such as pick-and-place puzzles. SeGMan incorporates an adaptive subgoal selection method that adjusts the granularity of subgoals, enhancing overall efficiency. Furthermore, proposed generalizable heuristics guide the forward search in a more targeted manner. Extensive evaluations in mazelike tasks populated with numerous objects and obstacles demonstrate that SeGMan is capable of generating not only consistent and computationally efficient manipulation plans but also outperform state-of-the-art approaches. https://sites.google.com/view/segman-lira/

Abstract:
Integrating Large Language Models (LLMs) into modern robotic systems presents significant computational and energy constraint challenges, particularly for human-centered robotic applications. This paper presents a novel hardware optimization technique for deploying LLMs on resource-constrained embedded devices, achieving an up to 77% reduction in computational latency through an FPGA implementation in comparison to other popular embedded computing devices (e.g., CPU and GPUs). Additionally, we demonstrate our methodology by deploying a LLaMA 2-7B model on a Unitree Go2 robotic dog integrated with the proposed FPGA platform. The proposed optimization framework preserves real-time interaction capabilities while significantly reducing computational and energy overhead, facilitating efficient natural language processing for human-robot interaction in safety-critical and dynamic environments. Experimental results demonstrate that the FPGA-based LLaMA 2-7B implementation achieves up to 6.06-fold and 1.95-fold higher throughput compared to baseline CPU and GPU implementations while maintaining comparable inference accuracy. Furthermore, the proposed FPGA design surpasses existing state-of-the-art FPGA implementations, delivering a 30% improvement in computational efficiency.

Abstract:
A LiDAR-IMU fusion system utilizing adaptive scanning is developed for high-resolution deformation monitoring of underground coal mine infrastructure, such as sealed walls. The system integrates data from a LiDAR scanner and an IMU, employing a penalty function-based scanning strategy to optimize point cloud quality. Following feature extraction and state estimation, a 3D point cloud model of the sealed wall is constructed. Deformation monitoring is achieved through point cloud segmentation, registration, and error analysis across multiple time intervals. A methodology for optimizing equipment placement on walls of varying dimensions is proposed to efficiently capture deformation details. Two metrics, PATD and PARE, are introduced to evaluate system performance. Calibration experiments using standardized boards and blocks are designed to determine optimal monitoring parameters, including distance, height, and sampling frequency. Simulated deformation experiments under real-world conditions validate the system’s rationality and accuracy.

Abstract:
Extrinsic calibration is essential for multi-sensor fusion, existing methods rely on structured targets or fully-excited data, limiting real-world applicability. Online calibration further suffers from weak excitation, leading to unreliable estimates. To address these limitations, we propose a reinforcement learning (RL)-based extrinsic calibration framework that formulates extrinsic calibration as a decisionmaking problem, directly optimizes SE(3) extrinsics to enhance odometry accuracy. Our approach leverages a probabilistic Bingham distribution to model 3D rotations, ensuring stable optimization while inherently retaining quaternion symmetry. A trajectory alignment reward mechanism enables robust calibration without structured targets by quantitatively evaluating estimated tightly-coupled trajectory against a reference trajectory. Additionally, an automated data selection module filters uninformative samples, significantly improving efficiency and scalability for large-scale datasets. Extensive experiments on UAVs, UGVs, and handheld platforms demonstrate that our method outperforms traditional optimization-based approaches, achieving high-precision calibration even under weak excitation conditions. Our framework simplifies deployment on diverse robotic platforms by eliminating the need for high-quality initial extrinsics and enabling calibration from routine operating data. The code is available at https://github.com/APRIL-ZJU/learn-to-calibrate.

Abstract:
With the increasing use of surgical robots in clinical practice, enhancing their ability to process multimodal medical images has become a key research challenge. Although traditional medical image fusion methods have made progress in improving fusion accuracy, they still face significant challenges in real-time performance, fine-grained feature extraction, and edge preservation. In this paper, we introduce TTTFusion, a Test-Time Training (TTT)-based image fusion strategy that dynamically adjusts model parameters during inference to efficiently fuse multimodal medical images. By adapting the model during the test phase, our method optimizes the parameters based on the input image data, leading to improved accuracy and better detail preservation in the fusion results. Experimental results demonstrate that TTTFusion significantly enhances the fusion quality of multimodal images compared to traditional fusion methods, particularly in fine-grained feature extraction and edge preservation. This approach not only improves image fusion accuracy but also offers a novel technical solution for real-time image processing in surgical robots.

Abstract:
We present GOEN, an advanced navigation and path planning framework specifically engineered to tackle the complexities of dynamic and unstructured environments through real-time 3D pointcloud processing. Our approach integrates pointcloud downsampling, collision risk assessment, and obstacle endpoint extraction to generate intermediate waypoints. These waypoints are refined iteratively through multi-stage safety validation and cubic-spline interpolation, resulting in kinematically feasible and collision-free trajectories. The system exhibits minimal computational overhead, achieving a planning latency below 10 milliseconds, thereby demonstrating suitability for deployment in resource-constrained scenarios. Extensive empirical evaluations in simulated environments and on quadruped robots demonstrate the framework’s robustness in dynamically identifying optimal navigation paths across unstructured terrains. Compared with baseline navigation methods, it has significant advantages in indicators representing navigation performance. Experimental validation confirms GOEN’s capability to balance trajectory optimality, safety constraints, and real-time responsiveness, providing an innovative solution for autonomous navigation in complex environments.

Abstract:
To address the challenges posed by lighting variations and fruit occlusion in open orchard environments, which significantly affect the performance of apple-harvesting robots, this study proposes an apple detection method based on the fusion of infrared thermal images and visible-light images. Firstly, An edge feature-based registration technique was employed to achieve precise alignment of infrared and visible-light images. Subsequently, an improved YOLOv8s model integrated with the SeAFusion framework was utilized to facilitate efficient apple detection. Experimental results revealed that the proposed method achieved 94.9% mean accuracy and 89.5% mean recall across diverse illumination scenarios (normal/ strong/ backlight), surpassing visible-light-only detection by 0.6%, 4.8%, and 3.5% in apple count accuracy under respective conditions. The proposed method established a robust framework for vision-based harvesting robots, significantly improving operational reliability in complex orchard environments and providing technical foundations for scalable agricultural automation.

Abstract:
With the increasing demand for human-computer interaction (HCI), flexible wearable gloves have emerged as a promising solution in virtual reality, medical rehabilitation, and industrial automation. However, the current technology still has problems like insufficient sensitivity and limited durability, which hinder its wide application. This paper presents a highly sensitive, modular, and flexible capacitive sensor based on line-shaped electrodes and liquid metal (EGaIn), integrated into a sensor module tailored to the human hand’s anatomy. The proposed system independently captures bending information from each finger joint, while additional measurements between adjacent fingers enable the recording of subtle variations in inter-finger spacing. This design enables accurate gesture recognition and dynamic hand morphological reconstruction of complex movements using point clouds. Experimental results demonstrate that our classifier based on Convolution Neural Network (CNN) and Multilayer Perceptron (MLP) achieves an accuracy of 99.15% across 30 gestures. Meanwhile, a transformer-based Deep Neural Network (DNN) accurately reconstructs dynamic hand shapes with an Average Distance (AD) of 2.076±3.231 mm, with the reconstruction accuracy at individual key points surpassing SOTA benchmarks by 9.7% to 64.9%. The proposed glove shows excellent accuracy, robustness and scalability in gesture recognition and hand reconstruction, making it a promising solution for next-generation HCI systems.

Abstract:
Online path planning for magnetic microrobots actuated by electromagnetic system in dynamic flow field presents significant challenges due to time-varying fluid dynamics, energy constraints, and collision risks. Traditional path planning approaches, which often rely on static flow assumptions or simplified geometric models, struggle to balance energy efficiency, path continuity, and adaptability in real-world scenarios. This paper introduces an end-to-end path planner for energy-efficient and collision-free navigation of magnetic helical microrobots, integrating flow field feature extraction and reinforcement learning (RL) framework. Our method employs a transformer encoder to capture contextual correlations of flow field and uses a Soft Actor-Critic (SAC) framework to optimize energy consumption while ensuring dynamic obstacle avoidance. Simulations and experiments in dynamic flow environments validate our approach, demonstrating 14.7% lower energy consumption and robust collision avoidance in several different test scenarios.

Abstract:
Reinforcement learning (RL) agents need to explore their environment to learn optimal behaviors and achieve maximum rewards. However, exploration can be risky when training RL directly on real systems, while simulation-based training introduces the tricky issue of the sim-to-real gap. Recent approaches have leveraged safety filters, such as control barrier functions (CBFs), to penalize unsafe actions during RL training. However, the strong safety guarantees of CBFs rely on a precise dynamic model. In practice, uncertainties always exist, including internal disturbances from the errors of dynamics and external disturbances such as wind. In this work, we propose a novel safe RL framework built on a robust CBF, where the discrepancy between the nominal and true dynamic models is quantified through a combination of disturbance observation and residual model learning. We demonstrate our results on the Safety-gym benchmark for Point and Car robots on all tasks where we can outperform state-of-the-art approaches that use only residual model learning or a disturbance observer (DOB). We further validate the efficacy of our framework using a physical F1/10 racing car.Videos: https://sites.google.com/view/res-dob-cbf-rl

Abstract:
Smart sensors and Vehicle-To-Everything (V2X) modules are commonly utilized in automotive perception systems, which primarily provide processed object lists rather than raw data. However, high-level fusion approaches suffer from significant information loss and representational misalignment due to the inherently abstract and sparse nature of these high-level outputs. We propose a novel cross-level fusion paradigm that enables bidirectional information flow between object lists and raw vision features within an end-to-end Transformer framework for 3D object detection and tracking. Our approach extracts inherent positional and dimensional cues from object lists to generate two outputs: structured query features that are fused with the initial learnable queries in the Transformer decoder, and soft Gaussian attention masks that guide feature extraction. This integrated mechanism not only improves tracking accuracy by synergistically combining object priors with fine-grained vision data but also promotes hardware economy and AI model sustainability by adapting legacy sensors to evolving sensor setups. To overcome the lack of dedicated datasets, we develop a pseudo object list generation pipeline that simulates realistic sensor tracking behavior. Experiments on the nuScenes dataset demonstrate significant performance gains over vision-only baselines and robust generalization across diverse noise levels, validating the efficacy of our cross-level fusion strategy. The code is available at: https://github.com/CesarLiu/DNF.git.

Abstract:
In this paper, we address the problem of tracking high-speed agile trajectories for Unmanned Aerial Vehicles (UAVs), where model inaccuracies can lead to large tracking errors. Existing Nonlinear Model Predictive Controller (NMPC) methods typically neglect the dynamics of the low-level flight controllers such as underlying PID controller present in many flight stacks, and this results in suboptimal tracking performance at high speeds and accelerations. To this end, we propose a novel NMPC formulation, LoL-NMPC, which explicitly incorporates low-level controller dynamics and motor dynamics in order to minimize trajectory tracking errors while maintaining computational efficiency. By leveraging linear constraints inside low-level dynamics, our approach inherently accounts for actuator constraints without requiring additional reallocation strategies. The proposed method is validated in both simulation and real-world experiments, demonstrating improved tracking accuracy and robustness at speeds up to 98.57 km h−1 and accelerations of 3.5 g. Our results show an average 21.97 % reduction in trajectory tracking error over standard NMPC formulation, with LoL-NMPC maintaining real-time feasibility at 100 Hz on an embedded ARM-based flight computer.

Abstract:
Ensuring the safety of various real-world applications based on reinforcement learning (RL), such as quadcopter control, robotic manipulators, and autonomous robots, remains a critical challenge, despite RL’s remarkable success in solving complex decision-making tasks. Existing on-policy Lagrangian optimization methods in safe RL typically use a single policy to balance the trade-off between safety and return without taking the potential benefits of adopting multiple policies into account. In this paper, a new on-policy method is proposed, named Safe-Adjusted Policy Optimization(SAPO), which is a dual-policy framework designed to address safety constraint violations in RL. By incorporating a cost-oriented policy to dynamically adjust a reward-oriented policy, the SAPO effectively resolves the trade-off between safety and return. Moreover, to enhance performance in carrying out high-dimensional tasks, the Kullback-Leibler (KL) divergence and a Gaussian kernel are employed in the distance functions to facilitate the training. In addition, a quadcopter-safe-navigation task is designed to overcome the drawback of previous research on quadcopter-safe-navigation with RL that only pays attention to reward function design without considering policy-level optimization. Finally, experimental results verify the feasibility of the designed task. Meanwhile, indicated by the test on real device, the proposed algorithm is easy to be implemented, offers performance guarantees, and outperforms existing safe RL baselines.

Abstract:
This paper presents a novel spherical target-based LiDAR-camera extrinsic calibration method designed for outdoor environments with multi-robot systems, considering both target and sensor corruption. The method extracts the 2D ellipse center from the image and the 3D sphere center from the pointcloud, which are then paired to compute the transformation matrix. Specifically, the image is first decomposed using the Segment Anything Model (SAM). Then, a novel algorithm extracts an ellipse from a potentially corrupted sphere, and the extracted ellipse’s center is corrected for errors caused by the perspective projection model. For the LiDAR pointcloud, points on the sphere tend to be highly noisy due to the absence of flat regions. To accurately extract the sphere from these noisy measurements, we apply a hierarchical weighted sum to the accumulated pointcloud. Through experiments, we demonstrated that the sphere can be robustly detected even under both types of corruption, outperforming other targets. We evaluated our method using three different types of LiDARs (spinning, solid-state, and non-repetitive) with cameras positioned in three different locations. Furthermore, we validated the robustness of our method to target corruption by experimenting with spheres subjected to various types of degradation. These experiments were conducted in both a planetary test and a field environment. Our code is available at https://github.com/sparolab/MARSCalib.

Abstract:
Loop closure is critical in Simultaneous Localization and Mapping (SLAM) systems to reduce accumulative drift and ensure global mapping consistency. However, conventional methods struggle in perceptually aliased environments, such as narrow pipes, due to vector quantization, feature sparsity, and repetitive textures, while existing solutions often incur high computational costs. This paper presents Bag-of-Word-Groups (BoWG), a novel loop closure detection method that achieves superior precision-recall, robustness, and computational efficiency. The core innovation lies in the introduction of word groups, which captures the spatial co-occurrence and proximity of visual words to construct an online dictionary. Additionally, drawing inspiration from probabilistic transition models, we incorporate temporal consistency directly into similarity computation with an adaptive scheme, substantially improving precision-recall performance. The method is further strengthened by a feature distribution analysis module and dedicated post-verification mechanisms. To evaluate the effectiveness of our method, we conduct experiments on both public datasets and a confined-pipe dataset we constructed. Results demonstrate that BoWG surpasses state-of-the-art methods—including both traditional and learning-based approaches—in terms of precision-recall and computational efficiency. Our approach also exhibits excellent scalability, achieving an average processing time of 16 ms per image across 17,565 images in the Bicocca25b dataset. The source code is available at: https://github.com/EdgarFx/BoWG.

Abstract:
When pushing the speed limit for aggressive off-road navigation on uneven terrain, it is inevitable that vehicles may become airborne from time to time. During time-sensitive tasks, being able to fly over challenging terrain can also save time, instead of cautiously circumventing or slowly negotiating through. However, most off-road autonomy systems operate under the assumption that the vehicles are always on the ground and therefore limit operational speed. In this paper, we present a novel approach for in-air vehicle maneuver during high-speed off-road navigation. Based on a hybrid forward kinodynamic model using both physics principles and machine learning, our fixed-horizon, sampling-based motion planner ensures accurate vehicle landing poses and their derivatives within a short airborne time window using vehicle throttle and steering commands. We test our approach in extensive in-air experiments both indoors and outdoors, compare it against an error-driven control method, and demonstrate that precise and timely in-air vehicle maneuver is possible through existing ground vehicle controls.

Abstract:
Long bone fractures are common clinical conditions, yet the development of robot systems for closed reduction surgery remains in its early stages. The key challenge in this field is the lack of an efficient and precise path planning algorithm. To address this issue, this study proposes an improved A-star (A) algorithm for path planning to enhance the accuracy and efficiency of fracture reduction. The algorithm begins by expanding a random node using the fundamental A algorithm. An artificial potential field (APF) algorithm is then incorporated to optimize the generation of sample node and enhance obstacle avoidance. Additionally, a cylindrical bounding box method is employed for collision detection, and a B-spline curve is utilized to smooth the generated path. The experimental validation is conducted on a fracture reduction robot system, demonstrating that the optimized path achieves clinically acceptable accuracy, significantly enhancing the precision and reliability of the reduction procedure.

Abstract:
This paper presents a novel approach for time-optimal path parameterization based on reachability analysis for robotic systems with viscous friction in the dynamics and jerk constraints. The main step of the method is the backward propagation of controllable sets through a linear second-order system. In order to avoid the unbounded growth of the number of constraints, the sets are approximated by a ray shooting algorithm. Using a convex relaxation, the required set expansion can be solved with second-order cone programming. Evaluation results for a 6-degree of freedom (DOF) robot arm highlight the advantages of the method for computing jerk-limited trajectories.

Abstract:
Floorplan reconstruction provides structural priors essential for reliable indoor robot navigation and high-level scene understanding. However, existing approaches either require time-consuming offline processing with a complete map, or rely on expensive sensors and substantial computational resources. To address the problems, we propose FloorplanSLAM, which incorporates floorplan reconstruction tightly into a multi-session SLAM system by seamlessly interacting with plane extraction, pose estimation, back-end optimization, and loop & map merging, achieving real-time, high-accuracy, and long-term floorplan reconstruction using only a stereo camera. Specifically, we present a robust plane extraction algorithm that operates in a compact plane parameter space and leverages spatially complementary features to accurately detect planar structures, even in weakly textured scenes. Furthermore, we propose a floorplan reconstruction module tightly coupled with the SLAM system, which uses continuously optimized plane landmarks and poses to formulate and solve a novel optimization problem, thereby enabling real-time and high-accuracy floorplan reconstruction. Note that by leveraging the map merging capability of multi-session SLAM, our method supports long-term floorplan reconstruction across multiple sessions without redundant data collection. Experiments on the VECtor and the self-collected datasets indicate that Floorplan-SLAM significantly outperforms state-of-the-art methods in terms of plane extraction robustness, pose estimation accuracy, and floorplan reconstruction fidelity and speed, achieving real-time performance at 25–45 FPS without GPU acceleration, which reduces the floorplan reconstruction time for a 1000 m2 scene from 16 hours and 44 minutes to just 9.4 minutes.

Abstract:
Collision-free robotic manipulation is extremely important for all safety-critical applications of robots. Especially for large-scale automation in modern manufacturing facilities where numerous hardware and software systems collaborate in relatively structured environments, accomplishing effectiveness, efficiency, and safety in not only repetitive tasks but also their sporadic reconfigurations is ideal. Yet, existing online and offline Motion Planning (MP) algorithms do not meet such a unique combination of harsh demands, since most of the advances in MP aim for a subset of the requirements. To bridge the gap, we introduce a novel implicit neural function (EASEIR) designed for efficient offline safe set composition for robotic manipulators operating in structured environments. Addressing the challenges of managing high-dimensional configuration spaces (C-space), EASEIR leverages Implicit Neural Representations (INR) to relate coordinates of a discretized robot operation space with collision sets in C-space. EASEIR then utilizes the mapping to actively compose a collision-free set in response to arbitrary occupancy of the operation space by obstacles. The proposed method comprises three core modules: (a) Latent Key Generator (LKG) that maps the coordinates of the space to intermediate latent keys, (b) Latent Key Decoder (LKD) that reconstructs collision sets from the keys, and (c) Full Set Compositor (FSC) that generates a full collision-free set using set operations. On a 6 Degrees of Freedom (DoF) arm, EASEIR generates safe configuration sets nearly 43 times faster than the state-of-the-art analytical method while maintaining comparable accuracy (∼ 0.2% collision) during evaluations in a simulation environment.

Abstract:
Robotic manipulation of deformable objects remains challenging due to the high dimensional configuration space and complex dynamics. In this work we demonstrate how the abstraction level used for modeling deformable objects can significantly impact the difficulty of the motion planning problem. We specifically focus on buckling—a nonlinear instability phenomenon that arises in response to compression of slender deformable objects. Using deformable linear objects (DLOs) as a case of study, we show that eliminating resistance to compression in the simulation model while penalizing compressed states in the planning objective increases both robustness and performance. We demonstrate our approach on a set of simulation examples and validate our results through physical robot experiments.

Abstract:
Imitation learning frameworks for robotic manipulation have drawn attention in the recent development of language model grounded robotics. However, the success of the frameworks largely depends on the coverage of the demonstration cases: When the demonstration set does not include examples of how to act in all possible situations, the action may fail and can result in cascading errors. To solve this problem, we propose a framework that uses serialized Finite State Machine (FSM) to generate demonstrations and improve the success rate in manipulation tasks requiring a long sequence of precise interactions. To validate its effectiveness, we use environmentally evolving and long-horizon puzzles that require long sequential actions. Experimental results show that our approach achieves a success rate of up to 98% in these tasks, compared to the controlled condition using existing approaches, which only had a success rate of up to 60%, and, in some tasks, almost failed completely. The source code for this project can be accessed at https://imitate.finite-state.com/.

Abstract:
Cotraining with demonstration data generated both in simulation and on real hardware has emerged as a promising recipe for scaling imitation learning in robotics. This work seeks to elucidate basic principles of this simand-real cotraining to inform simulation design, sim-and-real dataset creation, and policy training. Our experiments confirm that cotraining with simulated data can dramatically improve performance, especially when real data is limited. We show that these performance gains scale with additional simulated data up to a plateau; adding more real-world data increases this performance ceiling. The results also suggest that reducing physical domain gaps may be more impactful than visual fidelity for non-prehensile or contact-rich tasks. Perhaps surprisingly, we find that some visual gap can help cotraining – binary probes reveal that high-performing policies must learn to distinguish simulated domains from real. We conclude by investigating this nuance and mechanisms that facilitate positive transfer between sim-and-real. Focusing narrowly on the canonical task of planar pushing from pixels allows us to be thorough in our study. In total, our experiments span 50+ real-world policies (evaluated on 1000+ trials) and 250 simulated policies (evaluated on 50,000+ trials). Videos and code can be found at https://sim-and-real-cotraining.github.io/.

Abstract:
While the capabilities of autonomous driving have advanced rapidly, merging into dense traffic remains a significant challenge, many motion planning methods for this scenario have been proposed but it is hard to evaluate them. Most existing closed-loop simulators rely on rule-based controls for other vehicles, which results in a lack of diversity and randomness, thus failing to accurately assess the motion planning capabilities in highly interactive scenarios. Moreover, traditional evaluation metrics are insufficient for comprehensively evaluating the performance of merging in dense traffic. In response, we proposed a closed-loop evaluation benchmark for assessing motion planning capabilities in merging scenarios. Our approach involves other vehicles trained in large scale datasets with micro-behavioral characteristics that significantly enhance the complexity and diversity. Additionally, we have restructured the evaluation mechanism by leveraging Large Language Models (LLMs) to assess each autonomous vehicle merging onto the main lane. Extensive experiments and test-vehicle deployment have demonstrated the progressiveness of this benchmark. Through this benchmark, we have obtained an evaluation of existing methods and identified common issues. The simulation environment and evaluation process can be accessed at https://github.com/WZM5853/Bench4Merge.

Abstract:
Autonomous suturing is a critical challenge in robot-assisted surgery, where accurate segmentation and pose estimation of suturing threads are essential prerequisites. However, suturing threads are easily occluded by moving instruments and embedded in deformable tissues which make the task much more challenging. To address this, we propose a coarse-to-fine network for detailed segmentation and pose estimation of suturing threads. The coarse stage aims to capture global thread structure, while the fine stage refines the detailed structure through error residual correction. A spatial context fusion module is incorporated to improve the perception of occluded regions, and weighted balanced cross entropy loss as well as hard sample mining strategy is implemented to enhance small target segmentation performance. To deal with severe occlusions, topological constraints are utilized to effectively identify and reconstruct invisible thread segments. Experiments have been conducted on three datasets collected from different surgical scenes including phantom, endoscopy, and microsurgery. Both quantitative and qualitative results have demonstrated that our proposed framework outperforms baseline methods on segmentation and pose estimation of suturing threads, particularly in detecting occluded threads. Our proposed framework generalizes well across different surgical scenarios, showing its potential for automatic suturing.

Abstract:
Low-light image enhancement (LLIE) aims to enhance the illumination of images that are captured under dark conditions, which is critical for various applications in dim environments, such as robotics and autonomous driving. Existing convolutional neural network (CNN)-based methods usually struggle to capture long-range dependencies, while transformer-based methods, despite their effectiveness, are resource-consuming. Besides, the frequency domain includes important lightness degradation information. To this end, we propose a Mamba-based framework called MambaSFLNet to effectively address LLIE by integrating spatial and frequency features. Our approach utilizes the Visual State Space Module to establish relationships across different regions of the input image while maintaining low model complexity. Furthermore, The spatial module not only balances illumination distribution but also suppresses noise and artifacts during enhancement. In addition, the frequency module enhances image contrast and sharpness by leveraging frequency-domain information. Extensive experiments on nine widely used benchmarks demonstrate that our approach achieves superior performance and exhibits strong generalization capabilities compared to existing methods. The codes are available at https://github.com/MingyuLiu1/MambaSFLNet.git

Abstract:
Extreme weather and infrastructure vulnerabilities pose significant challenges to urban mobility, particularly at intersections where signals become inoperative. To address this growing concern, we introduce Beacon, a naturalistic driving dataset capturing traffic dynamics during blackouts at two major intersections in Memphis, TN, USA. The dataset provides detailed traffic movements, including timesteps, origin, and destination lanes for each vehicle over four hours of peak periods. We analyze traffic demand, vehicle trajectories, and density across different scenarios, demonstrating high-fidelity reconstruction under unsignalized, signalized, and mixed traffic conditions. We find that integrating robot vehicles (RVs) into traffic flow can substantially reduce intersection delays, with wait time improvements of up to 82.6%. However, this enhanced traffic efficiency comes with varying environmental impacts, as decreased vehicle idling may lead to higher overall CO2 emissions. To the best of our knowledge, Beacon is the first publicly available traffic dataset for naturalistic driving behaviors during blackouts at intersections.

Abstract:
Reliable navigation of autonomous vessels critically depends on robust situational awareness, particularly object detection. For this, an accurate, 360-degree perception of the surrounding environment is essential. However, most existing datasets lack the comprehensive multi-view data required for this full environmental coverage. This absence of large-scale, multi-view image datasets specifically designed for maritime situational awareness on vessels presents a significant challenge. To address this, we introduce the Multi-View Maritime Vision (MV2) dataset, comprising 159,386 visible-light images captured from six distinct viewpoints around a vessel. MV2 provides a complete 360-degree omnidirectional perspective, offering critical support for maritime situational awareness applications. The dataset includes object bounding boxes, along with semantic, instance, and panoptic segmentation labels, and encompasses a wide range of environmental conditions, supporting diverse computer-vision tasks. Additionally, we benchmarked state-of-the-art object-detection and panoptic-segmentation models on MV2, demonstrating its contribution to advancing maritime autonomy research. The dataset is available at https://sites.google.com/view/multi-view-maritime-vision.

Abstract:
Autonomous valet parking (AVP) aims to help the human drivers navigate to the desired location in the parking lot. Currently, the AVP task is not flexible enough to perform the open-vocabulary navigation tasks such as "navigate to the exit" or "park near the elevator". The widely used map formats for AVP like vectorized maps have some limitations including limited semantics, high cost and poor human-machine interaction, restricting the flexible application of AVP in complex scenarios. To address these problems, we propose AVP Scene Graph (AVP-SG), a hierarchical visual language mapping and navigation framework for open-vocabulary AVP tasks, which enables autonomous navigation from multi-modal human instructions. Our framework consists of two parts: a bottom-up mapping module and a top-down navigation module. In the mapping pipeline, assisted by the vision-language model (VLM) and optical character recognition (OCR) model, we first extract open-vocabulary conceptual semantics from images and project them to the elements of map. Next, by the bottom-up scheme performing feature fusion layer by layer, the scene graph is built hierarchically, consisting of slot, lane, block, and garage layer. In the top-down navigation pipeline, the navigation goal can be efficiently found by an LLM-enhanced graph retrieval approach. Experiments on real-world AVP tasks prove that the self-driving vehicle can perform open-vocabulary AVP tasks successfully utilizing the AVP-SG.

Abstract:
Plant factory cultivation is widely recognized for its ability to optimize resource use and boost crop yields. To further increase the efficiency in these environments, we propose a mixed-integer linear programming (MILP) framework that systematically schedules and coordinates dual-arm harvesting tasks, minimizing the overall harvesting makespan based on pre-mapped fruit locations. Specifically, we focus on a specialized dual-arm harvesting robot and employ pose coverage analysis of its end effector to maximize picking reachability. Additionally, we compare the performance of the dual-arm configuration with that of a single-arm vehicle, demonstrating that the dual-arm system can nearly double efficiency when fruit densities are roughly equal on both sides. Extensive simulations show a 10–20% increase in throughput and a significant reduction in the number of stops compared to non-optimized methods. These results underscore the advantages of an optimal scheduling approach in improving the scalability and efficiency of robotic harvesting in plant factories.

Abstract:
Magnetic-based tactile sensors (MBTS) combine the advantages of compact design and high-frequency operation but suffer from limited spatial resolution due to their sparse taxel arrays. This paper proposes SuperMag, a tactile shape reconstruction method that addresses this limitation by leveraging high-resolution vision-based tactile sensor (VBTS) data to supervise MBTS super-resolution. Co-designed, open-source VBTS and MBTS with identical contact modules enable synchronized data collection of high-resolution shapes and magnetic signals via a symmetric calibration setup. We frame tactile shape reconstruction as a conditional generative problem, employing a conditional variational auto-encoder to infer high-resolution shapes from low-resolution MBTS inputs. The MBTS achieves a sampling frequency of 125 Hz, whereas the shape reconstruction sustains an inference time within 2.5 ms. This cross-modality synergy advances tactile perception of the MBTS, potentially unlocking its new capabilities in high-precision robotic tasks.

Abstract:
Self-selected walking speed is a key outcome for exercise-based rehabilitation programs following lower-extremity trauma. This work introduces a novel reinforcement learning-based assist-as-needed (RL-AAN) controller for ankle exoskeletons, aimed at gait speed training. Built on an actor–critic architecture, the RL-AAN controller integrates a control objective that balances the trade-off between expected stride velocity (SV) errors and exoskeleton assistance. This approach allows the exoskeleton to progressively reduce ankle plantar- and dorsiflexion (PDF) assistance as the user’s performance improves, promoting active participation. The desired assistive torque is computed as the product of the actor output and the wearer’s biomechanical ankle PDF moment, estimated by a subject-agnostic model, thereby ensuring personalized and biomechanically relevant assistance. In a proof-of-concept study with healthy individuals walking on a self-paced treadmill with ankle weights, the RL-AAN controller outperformed a conventional Fixed-K controller—achieving greater immediate speed increases during assisted walking (14.2% vs. 10.0% relative to unassisted perturbed walking) and inducing short-term gait speed adaptation post-training, not observed with the conventional controller. These findings highlight the potential of RL-AAN control for subject-tailored gait training, with promising clinical implications for exercise-based rehabilitation in individuals with neurological or musculoskeletal gait impairments.

Abstract:
Large language models have demonstrated powerful reasoning capabilities, and their integration with robotics has revolutionized human-computer interaction and automated task planning. However, LLMs are unaware of environmental knowledge and possible state changes in the environment during planning, which makes the generated tasks unexecutable, particularly when dealing with complex long-horizon tasks involving crowded objects and dynamic relations. In this paper, we propose a LLM-based robot task planning framework with support for environmental knowledge injection, which is called DRP(Decomposition-Reflection-Prediction). The DRP framework combines LLMs with rule-based task decomposition, multi-perspective reflection and environmental prediction to generate admissible actions for complex long-horizon tasks. We only leverage few-shot prompting to implement our framework, which avoids the need for additional model training work. Experiments on VirtualHome household task dataset show that the task plans generated by our method have improved the executability by 25.23%, the subgoal success rate by 64.29%, and the success rate by 58.06%, in comparison to state-of-the-art baseline methods. The complete code of our framework has been made public at https://github.com/lab-bj/taskplanning

Abstract:
Multi-Agent Path Finding (MAPF) is a problem of finding collision-free paths for a group of agents in a shared discrete environment. The agents often need to wait in place for one or more discrete time steps to avoid each other, and they frequently have multiple locations where they can wait. While the locations may be equally good from the waiting agent’s perspective, they impact the rest of the fleet because no other agent can pass through the location in the meantime. Where exactly an agent waits is decided while planning its path and only takes into consideration the agent’s own preferences. Giving the other agents the option to influence the waiting location can improve the quality of solutions found by MAPF solvers, and in case of solvers which do not re-plan, even improve their success rate. We present the Partially Safe Interval (PSI) which allows to postpone the decision about exact waiting locations while preserving safety. PSI can be obtained by a simple post-processing procedure, and by following an exact set of rules, the waiting locations can be decided whenever an exact path is necessary or when there is only one remaining option. We demonstrate the benefits of PSI using an extension of the Prioritized Safe Interval Path Planning algorithm, which improves the average Sum of Delays by up to 4.12% and the success rate by up to 5% on benchmark maps. We also provide context for the improvement by comparing the results with the state-of-the-art suboptimal methods PIBT and LaCAM.

Abstract:
Existing multi-session visual simultaneous localization and mapping (SLAM) systems struggle severely to achieve robust localization and map merging under extreme viewpoint and illumination variations, particularly when handling completely opposite viewpoints and drastic day-night lighting changes. These challenges stem largely from the limited viewpoint/illumination invariance of conventional low-level visual features and their inability to capture a global structural context. In this paper, we make the critical observation that a life-long floorplan not only encodes rich geometric and semantic information—serving as a robust high-level structural representation—but is also inherently more robust to severe viewpoint and illumination variations than purely visual data. Building on this insight, we propose a novel hierarchical framework for multi-session SLAM that integrates a floorplan-based map as a global feature to achieve robust indoor localization and map merging under drastic viewpoint and illumination shifts. In particular, we innovatively formulate floorplan association as a maximum clique problem augmented with trajectory data to achieve robust floorplan-level global localization. We further introduce a novel coarse-to-fine localization and map merging strategy that seamlessly integrates floorplan alignment, multistage point cloud registration, and feature matching, fully leveraging the macro-level stability of global features and the micro-level precision of local features to achieve keyframe-level fine localization. Extensive experiments on both public and self-collected datasets demonstrate that our method consistently outperforms state-of-the-art (SOTA) approaches reliant solely on low-level visual or geometric features. Crucially, it delivers superior accuracy and robustness even in the face of completely opposite viewpoints and extreme day–night illumination changes. This work underscores the promise of fusing macro-level floorplan representations with conventional SLAM frameworks to advance long-term, robust indoor localization and map merging under the most challenging conditions.

Abstract:
Robot navigation systems are critical for various real-world applications such as delivery services, hospital logistics, and warehouse management. Although classical navigation methods provide interpretability, they rely heavily on expert manual tuning, limiting their adaptability. Conversely, purely learning-based methods offer adaptability but often lead to instability and erratic robot behaviors. Recently introduced parameter tuners aim to balance these approaches by integrating data-driven adaptability into classical navigation frameworks. However, the parameter tuning process currently suffers from training inefficiencies and redundant sampling, with critical regions in environment often underrepresented in training data.In this paper, we propose EffiTune, a novel framework designed to diagnose and mitigate training inefficiency for parameter tuners in robot navigation systems. EffiTune first performs robot-behavior-guided diagnostics to pinpoint critical bottlenecks and underrepresented regions. It then employs a targeted up-sampling strategy to enrich the training dataset with critical samples, significantly reducing redundancy and enhancing training efficiency. Our comprehensive evaluation demonstrates that EffiTune achieves more than a 13.5% improvement in navigation performance, enhanced robustness in out-of-distribution scenarios, and a 4× improvement in training efficiency within the same computational budget.

Abstract:
This study describes a microfluidics experiment with ready classroom applications, designed to enhance students' understanding of fluid dynamics, controlled mixing, and bubble formation. The materials employed are safe and readily accessible, such as vinegar and baking soda, combined with PDMS microfluidic chips and a high-resolution microscope, to provide real-time observation of gas-liquid interactions. A syringe pump delivers the reactants into a micro-channel through which the fluid flow behavior and bubble formation can be visualized and quantified.(/p)The focus of the experiment is on elucidating the effects of different soda concentrations on bubble generation in a controlled laminar flow. The results show a nonlinear trend between soda concentration and bubble features: lower concentrations produce fewer but larger bubbles, moderate concentrations produce small bubbles more frequently. At 0.2 M, the average bubble area was approximately 389 μm2, and at 0.4 M, there were smaller bubbles but more frequent occurrences. As concentrations increased above 0.6 M, bubbles became more uniform in size and more circular.Flow rates were varied from 3 to 15 μL/min to assess bubble behavior. Most bubbles functioned as wall bubbles in the micro-channel and were not perfectly spherical because of the influence of the local flow field and concentration gradients. The size distribution and circularity of the bubbles were measured using image analysis tools developed in Python.This affordable and visually appealing platform provides an alternative hands-on experience for students to learn the fundamental principles of microfluidics, thereby connecting classroom concepts with real-world observations. The lab activity promotes data analysis, hypothesis testing, and deepening understanding of concepts—skills essential for both academic and applied research.

Abstract:
This paper presents an active, model-based FTC (fault tolerant control) method for the dynamic positioning of underwater vehicles with thruster redundancy. Unlike conventional appraoches that rely heavily on state and parameter estimation, the proposed scheme directly utilizes the vehicle’s motion control error (MCE) to construct a residual for detecting thruster faults and failures during the steady state operation of the control system. One of the primary challenges in thruster fault identification is the unavailability of the actual control input under fault conditions. However, by conducting a detailed and rigorous analysis of the MCE variation trends associated with thruster faults, valuable information about this unknown control input can be extracted. This insight forms the foundation of the proposed FTC strategy. As for control reconfiguration, it’s straightforward since the thrust losses can be directly estimated through the fault identification process. Numerical studies with the real world vehicle model are carried out to validate the effectiveness of the proposed method.

Abstract:
In Human-Robot Collaboration (HRC), which encompasses physical interaction and remote cooperation, accurate estimation of human intentions and seamless switching of collaboration modes to adjust robot behavior remain paramount challenges. To address these issues, we propose an Intent-Driven Adaptive Generalized Collaboration (IDAGC) framework that leverages multimodal data and human intent estimation to facilitate adaptive policy learning across multi-tasks in diverse scenarios, thereby facilitating autonomous inference of collaboration modes and dynamic adjustment of robotic actions. This framework overcomes the limitations of existing HRC methods, which are typically restricted to a single collaboration mode and lack the capacity to identify and transition between diverse states. Central to our framework is a predictive model that captures the interdependencies among vision, language, force, and robot state data to accurately recognize human intentions with a Conditional Variational Autoencoder (CVAE) and automatically switch collaboration modes. By employing dedicated encoders for each modality and integrating extracted features through a Transformer decoder, the framework efficiently learns multi-task policies, while force data optimizes compliance control and intent estimation accuracy during physical interactions. Experiments highlights our framework’s practical potential to advance the comprehensive development of HRC.

Abstract:
Crouch gait is one of the key characteristics of children with cerebral palsy, and early detection of gait changes is crucial for subsequent exoskeleton-assisted therapy. This study uses the Vicon 3D motion capture system to collect experimental data on four gait phases of children with cerebral palsy and introduces a CNN-LSTM hybrid model. The model combines the spatial feature extraction strengths of CNN with the temporal sequence modeling capabilities of LSTM, enabling it to effectively identify the complex dynamic changes in gait specific to children with cerebral palsy. By integrating these two components, the model not only accurately extracts key gait features but also captures the temporal dependencies within the gait cycle, allowing for precise recognition of crouch gait. Experimental results demonstrate that the proposed model exhibits good robustness and achieves high accuracy in both overall gait recognition and distinguishing the four individual gait phases. It significantly outperforms traditional machine learning architectures.

Abstract:
Commercial plant phenotyping systems using fixed cameras cannot perceive many plant details due to leaf occlusion. In this paper, we present Botany-Bot, a system for building detailed “annotated digital twins” of living plants using two stereo cameras, a digital turntable inside a lightbox, an industrial robot arm, and 3D segmentated Gaussian Splat models. We also present robot algorithms for manipulating leaves to take high-resolution indexable images of occluded details such as stem buds and the underside/topside of leaves. Results from experiments suggest that Botany-Bot can segment leaves with 90.8% accuracy, detect leaves with 86.2% accuracy, lift/push leaves with 77.9% accuracy, and take detailed overside/underside images with 77.3% accuracy. Code, videos, and datasets are available at https://berkeleyautomation.github.io/Botany-Bot/.

Abstract:
Magnetic manipulation has been adopted as a method of actuation in both wireless capsule endoscopy and soft-tethered endoscopy, with the goal of improving gastrointestinal procedures. However, by nature of magnetic manipulation, these endoscopes are typically limited to a maximum of five degrees of freedom (DoF). With the need to introduce additional contact-based sensing modalities for subsurface investigation into these systems as well as to improve overall dexterity, it is both practically and clinically beneficial to recover the lost DoF i.e. the roll around the main axis. This paper presents a method of achieving the magnetic manipulation of an underactuated device by leveraging developable surfaces, specifically, the oloid shape. The design of a clinically relevant magnetic endoscope with all its ancillary elements, as well as contact sensors, is proposed and demonstrated in vivo. The contact sensor data from the in vivo experiments show that for sweeping motions over 100° of roll, contact between the endoscope’s sensor region and the colon wall can be maintained for 74% of the motion.

Abstract:
Long-duration, off-road, autonomous missions require robots to continuously perceive their surroundings regardless of the ambient lighting conditions. Most existing autonomy systems heavily rely on active sensing, e.g., LiDAR, RADAR, and Time-of-Flight sensors, or use (stereo) visible light imaging sensors, e.g., color cameras, to perceive environment geometry and semantics. In scenarios where fully passive perception is required and lighting conditions are degraded to an extent that visible light cameras fail to perceive, most downstream mobility tasks such as obstacle avoidance become impossible. To address such a challenge, this paper presents a Multi-Modal Passive Perception dataset, M2P2, to enable off-road mobility in low-light to no-light conditions. We design a multi-modal sensor suite including thermal, event, and stereo RGB cameras, GPS, two Inertia Measurement Units (IMUs), as well as a high-resolution LiDAR for ground truth, with a multi-sensor calibration procedure that can efficiently transform multi-modal perceptual streams into a common coordinate system. Our 10-hour, 32 km dataset also includes mobility data such as robot odometry and actions and covers well-lit, low-light, and no-light conditions, along with paved, on-trail, and off-trail terrain. Our results demonstrate that off-road mobility and scene understanding under degraded visual environments is possible through only passive perception in extreme low-light conditions. The project website can be found at https://cs.gmu.edu/˜xiao/Research/M2P2/.

Abstract:
This paper introduces a soft active surface gripper designed to manipulate fragile objects safely. This gripper consists of two fingers, each equipped with two compliant pneumatic actuators and a soft active surface. The gripper utilizes the elastic belt as its soft active surface, which is driven by a motor, and the opening angle of the elastic band is controlled by pneumatic actuators. The novel design allows for the passive deformation of both the soft active surface and the compliant pneumatic actuator, enabling adaptation to various object shapes and demonstrating superior handling capabilities for delicate items. By synchronizing the opening and closing of the pneumatic fingers with the conveying motion of the active surface, the active surface gripper realizes three degrees of freedom (DOF) for in-plane manipulation, specifically two translational movements and one rotational movement. A prototype gripper has been designed and fabricated for in-plane manipulation experiments with fragile objects, including strawberries, miniature cupcakes, and pears. Experimental results demonstrate that the gripper can execute in-plane in-hand manipulation of fragile objects with varying geometries and dimensions while maintaining secure and robust handling, preventing object slippage and preserving surface integrity without causing damage.

Abstract:
Long-term monitoring of numerous dynamic targets can be tedious for a human operator and infeasible for a single robot, e.g., to monitor wild flocks, detect intruders, search and rescue. Fleets of autonomous robots can be effective by acting collaboratively and concurrently. However, the online coordination is challenging due to the unknown behaviors of the targets and the limited perception of each robot. Existing work often deploys all robots available without minimizing the fleet size, or neglects the constraints on their resources such as battery and memory. This work proposes an online coordination scheme called LOMORO for collaborative target monitoring, path routing and resource charging. It includes three core components: (I) the modeling of multi-robot task assignment problem under the constraints on resources and monitoring intervals; (II) the resource-aware task coordination algorithm iterates between the high-level assignment of dynamic targets and the low-level multi-objective routing via the Martin's algorithm; (III) the online adaptation algorithm in case of unpredictable target behaviors and robot failures. It ensures the explicitly upper-bounded monitoring intervals for all targets and the lower-bounded resource levels for all robots, while minimizing the average number of active robots. The proposed methods are validated extensively via large-scale simulations against several baselines, under different road networks, robot velocities, charging rates and monitoring intervals.

Abstract:
In robotics, the effective integration of environ-mental data into actionable knowledge remains a significant challenge due to the variety and incompatibility of data formats commonly used in scene descriptions, such as MJCF, URDF, and SDF. This paper presents a novel approach that addresses these challenges by developing a unified scene graph model that standardizes these varied formats into the Universal Scene Description (USD) format. This standardization facilitates the integration of these scene graphs with robot ontologies through semantic reporting, enabling the translation of complex environmental data into actionable knowledge essential for cognitive robotic control. We evaluated our approach by converting procedural 3D environments into USD format, which is then annotated semantically and translated into a knowledge graph to effectively answer competency questions, demonstrating its utility for real-time robotic decision-making. Additionally, we developed a web-based visualization tool to support the semantic mapping process, providing users with an intuitive interface to manage the 3D environment.

Abstract:
This paper presents a guaranteed model-based approach for monitoring drone trajectory, providing real-time guarantees with a simplified dynamic model. ROS components are introduced for real-time implementation, enabling monitoring and adjustments in both simulations and actual systems. We extend the application of set-based simulation by formalizing timing conditions with Signal Temporal Logic (STL) and incorporating Boolean interval arithmetic to handle undetermined behaviors. The method compares model-based fault prediction using a stochastic approach with a set-based method, which manages bounded uncertainties and offers guarantees. Experimental validation, including comparisons against Monte Carlo methods, demonstrates the approach ability to ensure safety in worst-case scenarios while remaining suitable for real-time processing.

Abstract:
This paper investigates the control problem of underwater vehicles subject to time-varying external disturbances and actuator faults. A novel passive fault-tolerant control (PFTC) scheme is developed to address the coupled disturbance-fault dynamics inherent in underwater vehicle systems. The proposed dual-mode architecture comprises: 1) a robust fault-tolerant control scheme based on high-order sliding mode observers (HOSMOs) for minor fault scenarios, which effectively compensates for bounded disturbances and partial actuator degradation; 2) a conditionally triggered estimation mechanism integrated with fault-tolerant control allocation (FTCA) and HOSMOs for severe fault conditions, enabling fault estimation and model compensation via event-triggered parameter updating. The hybrid architecture ensures computational efficiency by activating the estimation module only when predefined triggering conditions are violated. Comprehensive experimental results validate the superiority of the proposed method in maintaining stability and performance under various fault conditions. This work provides a systematic solution for underwater vehicle control under coupled disturbance-fault conditions, with verified real-time performance and implementation feasibility.

Abstract:
Soft bellows actuators (SBAs), a particular type of soft pneumatic actuators (SPAs), are widely used in various applications, such as climbing robots, industrial grippers, and wearable devices. Despite their advantages of uniform motion and high efficiency, the design of SBAs often relies on experiential methods rather than standardized guidelines. This results in unclear optimization pathways and a misalignment between SBA performance and specific application requirements. This study identifies six critical parameters of linear pneumatic SBAs: Shore hardness (SH), number of units (N), thickness (t), mid-diameter (Rm), unit width (x), and unit depth (h). We explore how these parameters influence load capacity, displacement efficiency, and bending resistance. Experimental findings indicate that increasing SH, t, x, and h and decreasing N enhance load capacity. Moreover, increases in N, Rm, x, and h, along with decreases in SH and t, improve displacement efficiency. Furthermore, enhancing SH, t, and Rm and reducing N, x, and h strengthen bending resistance. Based on these insights, we design three types of SBAs tailored to specific tasks, which are implemented in a high-load pneumatic gripper, a high-efficiency displacement table, and a pneumatic worm-inspired climbing robot. This research contributes to the targeted design of SBAs, offering a novel approach for the effective optimization and performance prediction of particular SPAs, thereby facilitating the broader application of soft robots.

Abstract:
This paper presents a novel strategy for offline estimation of the spatial motion of a multi-link mechanism using Inertial Measurement Unit (IMU) sensors. Accelerometers, gyroscopes and magnetometers are strategically mounted and modeled to maximize measurement accuracy through the past work of inertial sensor fusion. The core contribution of this paper is the development of the Factor Graph Optimization (FGO) Preconditioned by the Extended Kalman Filter (EKF), which is termed FGOPreEKF in this paper, and its integration with the inertial sensor fusion. Since the online EKF efficiently derives the initial guess using the same motion and sensor models, the FGO estimates the motion of a multi-link mechanism efficiently and accurately. The proposed approach was experimentally validated on a two-link system mounted on a fast-moving linear axis, demonstrating superior accuracy compared to standalone EKF or FGO. These results demonstrate the potential of this approach for estimating multi-link motion in more complex scenarios.

Abstract:
This paper introduces RoboDexVLM, an innovative framework for robot task planning and grasp detection tailored for a collaborative manipulator equipped with a dexterous hand. Previous methods focus on simplified and limited manipulation tasks, which often neglect the complexities associated with grasping a diverse array of objects in a long-horizon manner. In contrast, our proposed framework utilizes a dexterous hand capable of grasping objects of varying shapes and sizes while executing tasks based on natural language commands. The proposed approach has the following core components: First, a robust task planner with a task-level recovery mechanism that leverages vision-language models (VLMs) is designed, which enables the system to interpret and execute open-vocabulary commands for long sequence tasks. Second, a language-guided dexterous grasp perception algorithm is presented based on robot kinematics and formal methods, tailored for zero-shot dexterous manipulation with diverse objects and commands. Comprehensive experimental results validate the effectiveness, adaptability, and robustness of RoboDexVLM in handling long-horizon scenarios and performing dexterous grasping. These results highlight the framework’s ability to operate in complex environments, showcasing its potential for open-vocabulary dexterous manipulation. Our open-source project page can be found at https://henryhcliu.github.io/robodexvlm.

Abstract:
The goal of extrinsic calibration is the alignment of sensor data to ensure an accurate representation of the surroundings and enable sensor fusion applications. From a safety perspective, sensor calibration is a key enabler of autonomous driving. In the current state of the art, a trend from target-based offline calibration towards targetless online calibration can be observed. However, online calibration is subject to strict real-time and resource constraints which are not met by state-of-the-art methods. This is mainly due to the high number of parameters to estimate, the reliance on geometric features, or the dependence on specific vehicle maneuvers. To meet these requirements and ensure the vehicle's safety at any time, we propose a miscalibration detection framework that shifts the focus from the direct regression of calibration parameters to a binary classification of the calibration state, i.e., calibrated or miscalibrated. Therefore, we propose a contrastive learning approach that compares embedded features in a latent space to classify the calibration state of two different sensor modalities. Moreover, we provide a comprehensive analysis of the feature embeddings and challenging calibration errors that highlight the performance of our approach. As a result, our method outperforms the current state-of-the-art in terms of detection performance, inference time, and resource demand. The code is open source and available on https://github.com/TUMFTM/MiscalibrationDetection.

Abstract:
We present DynamicPose, a retraining-free 6D pose tracking framework that improves tracking robustness in fast-moving camera and object scenarios. Previous work is mainly applicable to static or quasi-static scenes, and its performance significantly deteriorates when both the object and the camera move rapidly. To overcome these challenges, we propose three synergistic components: (1) A visual-inertial odometry compensates for the shift in the Region of Interest (ROI) caused by camera motion; (2) A depth-informed 2D tracker corrects ROI deviations caused by large object translation; (3) A VIO-guided Kalman filter predicts object rotation, generates multiple candidate poses, and then obtains the final pose by hierarchical refinement. The 6D pose tracking results guide subsequent 2D tracking and Kalman filter updates, forming a closed-loop system that ensures accurate pose initialization and precise pose tracking. Simulation and real-world experiments demonstrate the effectiveness of our method, achieving real-time and robust 6D pose tracking for fast-moving cameras and objects.

Abstract:
We propose a hierarchical reinforcement learning (HRL) framework for efficient Navigation Among Movable Obstacles (NAMO) using a mobile manipulator. Our approach combines interaction-based obstacle property estimation with structured pushing strategies, facilitating the dynamic manipulation of unforeseen obstacles while adhering to a preplanned global path. The high-level policy generates pushing commands that consider environmental constraints and path-tracking objectives, while the low-level policy precisely and stably executes these commands through coordinated whole-body movements. Comprehensive simulation-based experiments demonstrate improvements in performing NAMO tasks, including higher success rates, shortened traversed path length, and reduced goal-reaching times, compared to baselines. Additionally, ablation studies assess the efficacy of each component, while a qualitative analysis further validates the accuracy and reliability of the real-time obstacle property estimation.

Abstract:
In teleoperated surgery, the motion scaling factor directly influences both the operator’s control precision of surgical instruments and operational comfort. Previous studies have revealed that the master manipulator state and operator’s gaze information can reflect the complexity of surgical operations and the operator’s intention to some extent. Although enabling real-time adjustment of scaling factors, they were limited by the narrow range of core parameters and the results were significantly influenced by subjective factors. To tackle these challenges, this paper presents a multi-dimensional adaptive motion scaling strategy based on the Bayesian optimization. The prediction of operator’s intention and attention is achieved by integrating multiple dimensional parameters, including master-slave manipulator states, gaze information, as well as pupillary data, all of which have been experimentally validated. Specifically, there exists a significant temporal synchronization between the Index of Pupillary Activity (IPA) and teleoperation tasks, which aligns with research on the correlation between IPA and attention levels. Furthermore, to evaluate the proposed adaptive scaling strategy, we combine subjective questionnaire surveys with objective metric assessments, effectively reducing the excessive influence of operators’ personal conditions and proficiency levels on optimization results.

Abstract:
Robotic navigation plays a pivotal role in a wide range of real-world applications. While traditional navigation systems focus on efficiency and obstacle avoidance, their inability to model complex human behaviors in shared spaces has underscored the growing need for socially aware navigation. In this work, we explore a novel paradigm of socially aware robot navigation empowered by large language models (LLMs), and propose HSAC-LLM, a hybrid framework that seamlessly integrates deep reinforcement learning with the reasoning and communication capabilities of LLMs. Unlike prior approaches that passively predict pedestrian trajectories or issue pre-scripted alerts, HSAC-LLM enables bidirectional natural language interaction, allowing robots to proactively engage in dialogue with pedestrians to resolve potential conflicts and negotiate path decisions. Extensive evaluations across 2D simulations, Gazebo environments, and real-world deployments demonstrate that HSAC-LLM consistently outperforms state-of-the-art DRL baselines under our proposed socially aware navigation metric, which covers safety, efficiency, and human comfort. By bridging linguistic reasoning and interactive motion planning, our results highlight the potential of LLM-augmented agents for robust, adaptive, and human-aligned navigation in real-world settings. Project page: https://hsacllm.github.io/.

Abstract:
Real-time morphological perception and precise end force feedback prediction of surgical robots constitute critical technical elements for ensuring safety and efficacy in complex interventional procedures such as Endoscopic Retrograde Cholangiopancreatography (ERCP). In this paper, we design a miniature flexible surgical robot (FSR) with a nested spring structure and proposed a physics-informed deep learning approach to simultaneously predict both the FSR's shape and 2D contact forces at its end-effector. The physical constraints were derived from a quasi-static model of the FSR, which is capable of characterizing persistent environmental interactions. Our method eliminates the need for end-effector sensors, not only ensuring high accuracy in both shape and contact force predictions but also maintaining consistent predictive performance under continuous environmental interactions. Experimental validation of the method revealed a high consistency between predicted values and reference data, achieving a 34.97% improvement in computational speed and a maximum prediction accuracy enhancement of 71.64% compared to conventional LSTM approaches.

Abstract:
Mobile robots necessitate advanced natural language understanding capabilities to accurately identify locations and perform tasks such as package delivery. However, traditional visual place recognition (VPR) methods rely solely on single-view visual information and cannot interpret human language descriptions. To overcome this challenge, we bridge text and vision by proposing a multiview (360° views of the surroundings) text-vision registration approach called Text4VPR for place recognition task, which is the first method that exclusively utilizes textual descriptions to match a database of images. Text4VPR employs the frozen T5 language model to extract global textual embeddings. Additionally, it utilizes the Sinkhorn algorithm with temperature coefficient to assign local tokens to their respective clusters, thereby aggregating visual descriptors from images. During the training stage, Text4VPR emphasizes the alignment between individual text-image pairs for precise textual description. In the inference stage, Text4VPR uses the Cascaded Cross-Attention Cosine Alignment (CCCA) to address the internal mismatch between text and image groups. Subsequently, Text4VPR performs precisely place match based on the descriptions of text-image groups. On Street360Loc, the first text to image VPR dataset we created, Text4VPR builds a robust baseline, achieving a leading top-1 accuracy of 56% and a leading top-10 accuracy of 91% within a 5-meter radius on the test set, which indicates that localization from textual descriptions to images is not only feasible but also holds significant potential for further advancement, as shown in Figure 1. Our code is available at https://github.com/nuozimiaowu/Text4VPR.

Abstract:
This study addresses the challenge of multi-robot cooperative exploration under limited local observations in environments with dynamic robot populations. To achieve efficient area coverage within constrained timeframes, we propose the Multi-Robot Informative Planner (MIP), a novel reinforcement learning (RL)-based planning module. The core component of MIP is the Neighborhood Information Aggregator, which employs a graph neural network (GNN) to integrate local neighborhood information for each robot. Our design enhances sample efficiency by minimizing information requirements while ensuring scalability across environments with varying robot numbers. To generate high-quality, expressive neighborhood feature representations, we utilize Graphical Mutual Information (GMI) to maximize the correlation between neighboring robots’ input features and their high-level hidden representations. Furthermore, MIP incorporates the Spatial-Neighborhood Transformer, which captures spatial features and inter-robot interactions through spatial self-attention mechanisms. These components collectively form the Multi-Robot Neural Informative Mapping (MRNIM) framework, outperforming traditional benchmarks in Habitat simulator.

Abstract:
This paper presents an integrated Reinforcement Learning (RL) and Model Predictive Control (MPC) framework for autonomous satellite docking with a partially filled fuel tank. Traditional docking control faces challenges due to fuel sloshing in microgravity, which induces unpredictable forces affecting stability. To address this, we integrate Proximal Policy Optimization (PPO) and Soft Actor-Critic (SAC) RL algorithms with MPC, leveraging MPC’s predictive capabilities to accelerate RL training and improve control robustness. The proposed approach is validated through Zero-G Lab of SnT experiments for planar stabilization and high-fidelity numerical simulations for 6-DOF docking with fuel sloshing dynamics. Simulation results demonstrate that SAC-MPC achieves superior docking accuracy, higher success rates, and lower control effort, outperforming standalone RL and PPO-MPC methods. This study advances fuel-efficient and disturbance-resilient satellite docking, enhancing the feasibility of on-orbit refueling and servicing missions.

Abstract:
3D gaussian splatting has advanced simultaneous localization and mapping (SLAM) technology by enabling realtime positioning and the construction of high-fidelity maps. However, the uncertainty in gaussian position and initialization parameters introduces challenges, often requiring extensive iterative convergence and resulting in redundant or insufficient gaussian representations. To address this, we introduce a novel adaptive densification method based on Fourier frequency domain analysis to establish gaussian priors for rapid convergence. Additionally, we propose constructing independent and unified sparse and dense maps, where a sparse map supports efficient tracking via Generalized Iterative Closest Point (GICP) and a dense map creates high-fidelity visual representations. This is the first SLAM system leveraging frequency domain analysis to achieve high-quality gaussian mapping in realtime. Experimental results demonstrate an average frame rate of 36 FPS on Replica and TUM RGB-D datasets, achieving competitive accuracy in both localization and mapping. The source code is publicly available at https://github.com/3DV-Coder/FGS-SLAM.

Abstract:
This paper presents a lightweight bidirectional cable-driven ankle exoskeleton system (total mass: 2.6 kg) based on a series elastic actuation architecture (actuator module mass: 1.05 kg). The system utilizes a waist-mounted drive unit and Bowden cables to deliver bidirectional assistance to the ankle joint (nominal force: 460 N, peak force: 680 N). By integrating a dynamically coupled adaptive oscillator (AO), the system achieves robust gait synchronization across a range of walking speeds (0.6 – 1.8 m/s, phase estimation RMSE <2.48%, stride frequency estimation RMSE <0.1 Hz). This is complemented by a Gaussian Process (GP)-based torque planner and a cascaded torque control framework, ensuring seamless coordination with natural gait. Experimental characterization of the actuator demonstrates its high dynamic performance (torque bandwidth: 12.5 Hz) and low-impedance characteristics (peak passive backdrive torque: 0.97 N · m). Human trials involving five participants show that the system significantly expands the ankle joint range of motion (up to [-15.64 °, 20.67 ° ] at high speeds) while reducing peak muscle activation levels in the tibialis anterior (18.54%–30.21%) and gastrocnemius (19.34%– 25.45%). This design, combining lightweight construction with adaptive control strategies, provides a highly effective solution for daily mobility assistance and rehabilitation applications.

Abstract:
General-purpose robots should possess humanlike dexterity and agility to perform tasks with the same versatility as us. A human-like form factor further enables the use of vast datasets of human-hand interactions. However, the primary bottleneck in dexterous manipulation lies not only in software but arguably even more in hardware. Robotic hands that approach human capabilities are often prohibitively expensive, bulky, or require enterprise-level maintenance, limiting their accessibility for broader research and practical applications. What if the research community could get started with reliable dexterous hands within a day? We present the open-source ORCA hand, a reliable and anthropomorphic 17-DoF tendon-driven robotic hand with integrated tactile sensors, fully assembled in less than eight hours and built for a material cost below 2,000 CHF. We showcase ORCA’s key design features such as popping joints, auto-calibration, and tensioning systems that significantly reduce complexity while increasing reliability, accuracy, and robustness. We benchmark the ORCA hand across a variety of tasks, ranging from teleoperation and imitation learning to zero-shot sim-to-real reinforcement learning. Furthermore, we demonstrate its durability, withstanding more than 10,000 continuous operation cycles—equivalent to approximately 20 hours—without hardware failure, the only constraint being the duration of the experiment itself. Video is here: youtu.be/kUbPSYMmOds. Design files, source code, and documentation are available at srl.ethz.ch/orcahand.

Abstract:
Nasotracheal intubation (NTI) is critical for establishing artificial airways in clinical anesthesia and critical care. Current manual methods face significant challenges, including cross-infection, especially during respiratory infection care, and insufficient control of endoluminal contact forces, increasing the risk of mucosal injuries. While existing studies have focused on automated endoscopic insertion, the automation of NTI remains unexplored despite its unique challenges: Nasotracheal tubes exhibit greater diameter and rigidity than standard endoscopes, substantially increasing insertion complexity and patient risks. We propose a novel autonomous NTI system with two key components to address these challenges. First, an autonomous NTI system is developed, incorporating a prosthesis embedded with force sensors, allowing for safety assessment and data filtering. Then, the Recurrent Action-Confidence Chunking with Transformer (RACCT) model is developed to handle complex tube-tissue interactions and partial visual observations. Experimental results demonstrate that the RACCT model outperforms the ACT model in all aspects and achieves a 66% reduction in average peak insertion force compared to manual operations while maintaining equivalent success rates. This validates the system’s potential for reducing infection risks and improving procedural safety.

Affiliations: Shenzhen Research Institute of Big Data, The Chinese University of Hong Kong (Shenzhen), Shenzhen, China; Shenzhen Institutes of Advanced Technology (SIAT), Chinese Academy of Sciences, Shenzhen, China; State Key Laboratory of Internet of Things for Smart City (SKL-IOTSC), University of Macau, Macau, China; Manifold Tech Limited, Hong Kong, China; School of Electrical Engineering and Telecommunications, the University of New South Wales, Australia

Abstract:
Navigating autonomous vehicles in open scenarios is a challenge due to the difficulties in handling unseen objects. Existing solutions either rely on small models that struggle with generalization or large models that are resource-intensive. While collaboration between the two offers a promising solution, the key challenge is deciding when and how to engage the large model. To address this issue, this paper proposes opportunistic collaborative planning (OCP), which seamlessly integrates efficient local models with powerful cloud models through two key innovations. First, we propose large vision model guided model predictive control (LVM-MPC), which leverages the cloud for LVM perception and decision making. The cloud output serves as a global guidance for a local MPC, thereby forming a closed-loop perception-to-control system. Second, to determine the best timing for large model query and service, we propose collaboration timing optimization (CTO), including object detection confidence thresholding (ODCT) and cloud forward simulation (CFS), to decide when to seek cloud assistance and when to offer cloud service. Extensive experiments show that the proposed OCP outperforms existing methods in terms of both navigation time and success rate.

Abstract:
Recent advancements in imitation learning have shown promising results in robotic manipulation, driven by the availability of high-quality training data. To improve data collection efficiency, some approaches focus on developing specialized teleoperation devices for robot control, while others directly use human hand demonstrations to obtain training data. However, the former requires both a robotic system and a skilled operator, limiting scalability, while the latter faces challenges in aligning the visual gap between human hand demonstrations and the deployed robot observations. To address this, we propose a human hand data collection system combined with our hand-to-gripper generative model, which translates human hand demonstrations into robot gripper demonstrations, effectively bridging the observation gap. Specifically, a GoPro fisheye camera is mounted on the human wrist to capture human hand demonstrations. We then train a generative model on a self-collected dataset of paired human hand and UMI gripper demonstrations, which have been processed using a tailored data pre-processing strategy to ensure alignment in both timestamps and observations. Therefore, given only human hand demonstrations, we are able to automatically extract the corresponding SE(3) actions and integrate them with high-quality generated robot demonstrations through our generation pipeline for training robotic policy model. In experiments, the robust manipulation performance demonstrates not only the quality of the generated robot demonstrations but also the efficiency and practicality of our data collection method. More demonstrations can be found at: https://rwor.github.io/.

Abstract:
This paper investigates a framework (CATCH-FORM-3D) for the precise contact force control and surface deformation regulation in viscoelastic material manipulation. A partial differential equation (PDE) is proposed to model the spatiotemporal stress-strain dynamics, integrating 3D Kelvin-Voigt (stiffness-damping) and Maxwell (diffusion) effects to capture the material’s viscoelastic behavior. Key mechanical parameters (stiffness, damping, diffusion coefficients) are estimated in real time via a PDE-driven observer. This observer fuses visual-tactile sensor data and experimentally validated forces to generate rich regressor signals. Then, the framework employs an inner-outer loop control structure. In the outer loop, the reference deformation is updated by a novel admittance control law, i.e., a proportional-derivative (PD) feedback law with contact force measurements, ensuring that the system responds adaptively to external interactions. In the inner loop, a reaction-diffusion PDE for the deformation tracking error is formulated and then exponentially stabilized by conforming the contact surface to analytical geometric configurations (i.e., defining Dirichlet boundary conditions). This dual-loop architecture enables the effective deformation regulation in dynamic contact environments. Experiments using a PaXini robotic hand demonstrate sub-millimeter deformation accuracy and stable force tracking (±5% deviation). The framework advances compliant robotic interactions in applications like industrial assembly, polymer shaping, surgical treatment, and household service.

Abstract:
Despite recent remarkable achievements in quadruped control, it remains challenging to ensure robust and compliant locomotion in the presence of unforeseen external disturbances. Existing methods prioritize locomotion robustness over compliance, often leading to stiff, high-frequency motions, and energy inefficiency. This paper, therefore, presents a two-stage hierarchical learning framework that can learn to take active reactions to external force disturbances based on force estimation. In the first stage, a velocity-tracking policy is trained alongside an auto-encoder to distill historical proprioceptive features. A neural network-based estimator is learned through supervised learning, which estimates body velocity and external forces based on proprioceptive measurements. In the second stage, a compliance action module, inspired by impedance control, is learned based on the pre-trained encoder and policy. This module is employed to actively adjust velocity commands in response to external forces based on real-time force estimates. With the compliance action module, a quadruped robot can robustly handle minor disturbances while appropriately yielding to significant forces, thus striking a balance between robustness and compliance. Simulations and real-world experiments have demonstrated that our method has superior performance in terms of robustness, energy efficiency, and safety. Experiment comparison shows that our method outperforms the state-of-the-art RL-based locomotion controllers. Ablation studies are given to show the critical roles of the compliance action module.

Abstract:
Visual servoing enables robots to precisely position their end-effector relative to a target object. While classical methods rely on hand-crafted features and thus are universally applicable without task-specific training, they often struggle with occlusions and environmental variations, whereas learning-based approaches improve robustness but typically require extensive training. We present a visual servoing approach that leverages pretrained vision transformers for semantic feature extraction, combining the advantages of both paradigms while also being able to generalize beyond the provided sample. Our approach achieves full convergence in unperturbed scenarios and surpasses classical image-based visual servoing by up to 31.2% relative improvement in perturbed scenarios. Even the convergence rates of learning-based methods are matched despite requiring no task-or object-specific training. Real-world evaluations confirm robust performance in end-effector positioning, industrial box manipulation, and grasping of unseen objects using only a reference from the same category. Our code and simulation environment are available at: https://alessandroscherl.github.io/ViT-VS/

Abstract:
Unmanned Aerial Vehicles (UAVs) play an important role in various applications, where precise trajectory tracking is crucial. However, conventional control algorithms for trajectory tracking often exhibit limited performance due to the underactuated, nonlinear, and highly coupled dynamics of quadrotor systems. To address these challenges, we propose HBO-PID, a novel control algorithm that integrates the Heteroscedastic Bayesian Optimization (HBO) framework with the classical PID controller to achieve accurate and robust trajectory tracking. By explicitly modeling input-dependent noise variance, the proposed method can better adapt to dynamic and complex environments, and therefore improve the accuracy and robustness of trajectory tracking. To accelerate the convergence of optimization, we adopt a two-stage optimization strategy that allow us to more efficiently find the optimal controller parameters. Through experiments in both simulation and real-world scenarios, we demonstrate that the proposed method significantly outperforms state-of-the-art (SOTA) methods. Compared to SOTA methods, it improves the position accuracy by 24.7% to 42.9%, and the angular accuracy by 40.9% to 78.4%.

Abstract:
The European Space Agency (ESA) and the European Space Resources Innovation Centre (ESRIC) created the Space Resources Challenge to invite researchers and companies to propose innovative solutions for Multi-Robot Systems (MRS) space prospection. This paper proposes the Resilient Exploration And Lunar Mapping System 2 (REALMS2), a MRS framework for planetary prospection and mapping. Based on Robot Operating System version 2 (ROS 2) and enhanced with Visual Simultaneous Localisation And Mapping (vSLAM) for map generation, REALMS2 uses a mesh network for a robust ad hoc network. A single graphical user interface (GUI) controls all the rovers, providing a simple overview of the robotic mission. This system is designed for heterogeneous multi-robot exploratory missions, tackling the challenges presented by extraterrestrial environments. REALMS2 was used during the second field test of the ESA-ESRIC Challenge and allowed to map around 60% of the area, using three homogeneous rovers while handling communication delays and blackouts.

Abstract:
This paper introduces a novel distributed approach for forming UAV-based multi-hop relay networks by adapting traditional flocking models to create relay chains between remote points. Our method modifies the standard flocking paradigm by incorporating dynamic agent roles, allowing UAVs to self-organize based solely on local state and neighbor information, and integrates networking information such as routing decisions directly into mobility control. A side contribution is the introduction of a Line-of-Sight (LOS) conservation force, which mitigates communication failures due to obstacles and is easily adaptable to the flocking model. The proposed algorithm is evaluated using a joint robotics and network co-simulator that combines realistic multi-rotor physics with ns-3-based network simulations. Simulation results across diverse environments and varying mission complexities demonstrate that our approach effectively maintains connectivity, enhances Quality of Service (QoS), and scales robustly, thereby bridging the gap between robotic control and aerial wireless network design.

Abstract:
Microrobotics implies actuation-related constraints that make safe telemanipulation particularly challenging. We present a haptic shared control system for electromagnetic-based telemanipulation of a pair of microrobots using a constrained optimization framework. Our contributions include: (1) a Quadratic Programming formulation with Control Lyapunov Functions and Control Barrier Functions, for safe and stable navigation in cluttered environments; (2) a shared control architecture, combining a haptic interface and simulation environment, to teleoperate the microrobots and enable micromanipulation capabilities; and (3) haptic shared control strategies offering visuo-haptic cues for task execution. The approach is validated through a user study, highlighting better navigation accuracy, control stability and task efficiency.

Abstract:
Grasp detection methods typically target the detection of a set of free-floating hand poses that can grasp the object. However, not all of the detected grasp poses are executable due to physical constraints. Even though it is straightforward to filter invalid grasp poses in the post-process, such a two-staged approach is computationally inefficient, especially when the constraint is hard. In this work, we propose an approach to take the following two constraints into account during the grasp detection stage, namely, (i) the picked object must be able to be placed with a predefined configuration without in-hand manipulation (ii) it must be reachable by the robot under the joint limit and collision-avoidance constraints for both pick and place cases. Our key idea is to train an SE(3) grasp diffusion network to estimate the noise in the form of spatial velocity, and constrain the denoising process by a multi-target differential inverse kinematics with an inequality constraint, so that the states are guaranteed to be reachable and placement can be performed without collision. In addition to an improved success ratio, we experimentally confirmed that our approach is more efficient and consistent in computation time compared to a naive two-stage approach.

Abstract:
Generating overtaking trajectories in autonomous racing is a challenging task, as the trajectory must satisfy the vehicle’s dynamics and ensure safety and real-time performance running on resource-constrained hardware. This work proposes the Fast and Safe Data-Driven Planner to address this challenge. Sparse Gaussian predictions are introduced to improve both the computational efficiency and accuracy of opponent predictions. Furthermore, the proposed approach employs a bi-level quadratic programming framework to generate an overtaking trajectory leveraging the opponent predictions. The first level uses polynomial fitting to generate a rough trajectory, from which reference states and control inputs are derived for the second level. The second level formulates a model predictive control optimization problem in the Frenet frame, generating a trajectory that satisfies both kinematic feasibility and safety. Experimental results on the F1TENTH platform show that our method outperforms the State-of-the-Art, achieving an 8.93% higher overtaking success rate, allowing the maximum opponent speed, ensuring a smoother ego trajectory, and reducing 74.04% computational time compared to the Predictive Spliner method. The code is available at: https://github.com/ZJU-DDRX/FSDP.

Abstract:
This paper proposes a method for generating maps in indoor environments that include transparent objects by using a stereo polarization camera and projector. Conventional sensors like LiDAR and stereo cameras struggle with glass, as they rely on diffuse reflection, while glass allows light to pass through. In contrast, polarization cameras can measure light polarization and estimate surface normals, enabling depth estimation by combining polarization and RGB information. However, when measuring transparent objects, reflected and transmitted light cancel each other out, reducing polarization contrast, and the RGB information causes the depth estimation to output the depth of objects behind the glass. To address this issue, this paper proposes a novel method that (1) improves the S/N ration in polarization measument via diffuse reflection on non-glass regions and (2) masks out the RGB color from polarimetric depth estimation to not compute depth map of objects behind the glass to obtain depth images that include glass surfaces. Additionally, (3) in the mapping part, depth estimation is repeated at multiple locations, and the results are integrated using self-localization to generate a complete environmental map. Experiments in an indoor environment confirmed the effectiveness of the proposed method, enabling glass-inclusive depth estimation and successful map generation on a mobile robot.

Abstract:
Thermal cameras capture environmental data through heat emission, a fundamentally different mechanism compared to visible light cameras, which rely on pinhole imaging. As a result, traditional visual relocalization methods designed for visible light images are not directly applicable to thermal images. Despite significant advancements in deep learning for camera relocalization, approaches specifically tailored for thermal camera-based relocalization remain underexplored. To address this gap, we introduce ThermalLoc, a novel end-to-end deep learning method for thermal image relocalization. ThermalLoc effectively extracts both local and global features from thermal images by integrating EfficientNet with Transformers, and performs absolute pose regression using two MLP networks. We evaluated ThermalLoc on both the publicly available thermal-odometry dataset and our own dataset. The results demonstrate that ThermalLoc outperforms existing representative methods employed for thermal camera relocalization, including AtLoc, MapNet, PoseNet, and RobustLoc, achieving superior accuracy and robustness.

Abstract:
This paper presents a novel framework for Human Activity Recognition (HAR) by unifying all sensor streams, visual, audio, and inertial, into a single textual domain, enabling the direct application of GPT-3 for multimodal data classification. Unlike traditional pipelines that use dedicated encoders for each modality, we show that converting sensor outputs into text tokens offers both simplicity and a powerful proof-of-concept for large language models (LLMs). To further boost performance, we introduce a composite loss function combining cross-entropy, Kullback-Leibler divergence, total variation, and multimodal consistency terms, ensuring both temporal smoothness and cross-modal alignment. We conduct extensive experiments on the CMU-MMAC dataset, achieving up to 98% accuracy and significantly outperforming baseline methods. We also demonstrate robustness under missing sensor streams via partial tokenization, maintaining strong performance despite sensor failures. These results highlight the potential of LLM-driven HAR for enhanced human-robot interaction in real-world scenarios, and pave the way for broader multimodal applications of next-generation language models.

Abstract:
One-shot global localization is crucial in many robotic applications, providing significant advantages during initialization and relocalization processes. However, LiDAR-based one-shot global localization methods encounter challenges, including local feature matching errors, sensitivity to dynamic objects, and computational complexity in the absence of an initial pose. To address these issues, we propose a one-shot LiDAR-semantic-graph-based global localization method. To mitigate the interference of dynamic objects on localization, we extract stable semantic objects from LiDAR point clouds using dynamic curved voxel clustering and subsequently construct a semantic graph. Furthermore, we leverage the distribution characteristics of the semantic objects to quickly filter candidate retrievals and construct a cost matrix for the Hungarian algorithm, utilizing a semantic topological histogram to solve vertex matching. This yields a coarse pose estimate, which is subsequently refined using Fast-GICP. We demonstrate the superior localization performance compared to existing state-of-the-art methods on multiple large-scale outdoor datasets, including MulRan, MCD, and Apollo. Our method will be open-sourced and accessible at: https://github.com/Hfx-J/SGGL.

Abstract:
This paper presents a novel robotic system designed to address the challenges of a robotic-assisted Laparo-Endoscopic single-site (LESS) surgery. While numerous studies have explored mechanisms enabling Remote Center of Motion (RCM) for minimally invasive procedures, few have investigated the interactions of multiple (≥2) manipulators within an integrated system. The proposed robotic system features an arc-shaped mainframe for positioning and mounting of five-bar Spherical Parallel Mechanism (SPM). A conventional straight-shaft surgical tool is inserted through a linear guide rail aligned with the pointing axis leading into the RCM. To account for the physical dimensions of the closed-chain linkages, a modified Denavit-Hartenberg parameterization is adopted to assign spherical linkage frames. This design approach ensures self-collision avoidance within the parallel mechanism and enables systematic evaluation of the distal end-effector’s linear motion characteristics. Furthermore, we investigate the basic functions of SPM manipulators through a prototype experiment, providing preliminary insights that could inform future enhancement of surgical techniques for robotic-assisted LESS procedures.

Abstract:
Event-Based cameras offer significant advantages due to their high temporal resolution and low power consumption. However, when deploying multiple such cameras, a critical challenge emerges: each camera operates on an independent time system, resulting in temporal misalignment that severely degrades performance in multi-event camera applications. Traditional hardware-based synchronization methods face significant limitations in compatibility and are impractical for wide-baseline configurations. We introduce EventSync, a software-based algorithm that achieves millisecond-level synchronization by exploiting the motion of objects in the cameras’ shared field of view, while simultaneously estimating the relative orientation between cameras. Our approach eliminates the need for physical connections, making it particularly valuable for wide-baseline deployments. Through comprehensive evaluation in both simulated environments and real-world indoor/outdoor scenarios, we demonstrate robust synchronization accuracy and precise extrinsic calibration across varying camera configurations, significantly outperforming existing methods. Code: https://github.com/wlxing1901/event-sync

Abstract:
Path planning is usually solved by addressing either the (high-level) route planning problem (waypoint sequencing to achieve the final goal) or the (low-level) path planning problem (trajectory prediction between two waypoints avoiding collisions). However, real-world problems usually require simultaneous solutions to the route and path planning subproblems with a holistic and efficient approach. In this paper, we introduce NaviFormer, a deep reinforcement learning model based on a Transformer architecture that solves the global navigation problem by predicting both high-level routes and low-level trajectories. To evaluate NaviFormer, several experiments have been conducted, including comparisons with other algorithms. Results show competitive accuracy from NaviFormer since it can understand the constraints and difficulties of each subproblem and act consequently to improve performance. Moreover, its superior computation speed proves its suitability for real-time missions.

Abstract:
Hand motion monitoring plays a crucial role in fields such as human-machine interaction and rehabilitation training. Currently, electronic sensors are commonly used for hand motion monitoring. However, they are confronted with issues such as susceptibility to electromagnetic interference and sweat stains. Fiber Bragg Grating (FBG) sensors are small in size, highly sensitive, and possess good biocompatibility. In this paper, a flexible distributed Fiber Bragg Grating sensor is introduced. Emphasis is laid on the optimization and fabrication of the sensor, and performance tests are carried out on the fabricated sensor. To verify the potential of the sensor in hand motion monitoring, experiments are conducted. In the gesture recognition experiment, the Vision Transformer (ViT) model is utilized to classify eight types of gestures, and the final accuracy reaches 96.5%. In the wrist joint angle measurement experiment, the Pearson correlation coefficient between the physical angle and the measured angle is 0.985. In the grasping experiment, individual differences are reflected by the standard deviation during the grasping process. The experiments have demonstrated that the proposed sensor has the potential for monitoring hand motions.

Abstract:
This paper presents a wireless collaborative inference framework optimized for deep learning-based weed instance segmentation on resource-limited weeding robots. Traditional Mask R-CNN struggles with detecting small weeds, suffers from low recall rates, and exhibits the checkerboard effect in segmentation results. To address these challenges, we introduce three key improvements: a feature fusion strategy in the backbone network to enhance small object detection, an improved Region Proposal Network (RPN) with Soft-NMS to reduce false positives and missed detections in complex environments, and a refined mask branch incorporating fully connected upsampling to mitigate checkerboard effects. Additionally, knowledge distillation is employed to compress the model, significantly improving inference speed while maintaining segmentation accuracy. To further enhance inference efficiency, we propose a two-stage approach for determining the optimal partition point and develop a resource-aware optimization algorithm that dynamically adjusts to fluctuating network bandwidth and computational constraints. Experimental evaluations confirm that the proposed approach surpasses existing methods and remains stable across varying resource conditions. A real-world implementation of a drone-server system validates the feasibility of the framework, showcasing its potential for robust and scalable weed detection and segmentation in precision agriculture applications.

Abstract:
Autonomous error correction is critical for domestic robots to achieve reliable execution of complex long-horizon tasks. Prior work has explored self-reflection in Large Language Models (LLMs) for task planning error correction; however, existing methods are constrained by inflexible self-reflection mechanisms that limit their effectiveness. Motivated by these limitations and inspired by human cognitive adaptation, we propose the Flexible Constructivism Reflection Framework (FCRF), a novel Mentor-Actor architecture that enables LLMs to perform flexible self-reflection based on task difficulty, while constructively integrating historical valuable experience with failure lessons. We evaluated FCRF on diverse domestic tasks through simulation in AlfWorld and physical deployment in the real-world environment. Experimental results demonstrate that FCRF significantly improves overall performance and self-reflection flexibility in complex long-horizon robotic tasks. Website at https://mongoosesyf.github.io/FCRF.github.io/

Abstract:
Pedestrians Localization in Non-Line-of-Sight (NLoS) regions within urban environments poses a significant challenge for autonomous driving systems. While mmWave radar has demonstrated potential for detecting objects in such scenarios, the 2D radar point cloud (PCD) data is susceptible to distortions caused by multipath reflections, making accurate spatial inference difficult. Additionally, although camera images provide high-resolution visual information, they lack depth perception and cannot directly observe objects in NLoS regions. In this paper, we propose a novel framework that interprets radar PCD through road layout inferred from camera for localization of NLoS pedestrians. The proposed method leverages visual information from the camera to interpret 2D radar PCD, enabling spatial scene reconstruction. The effectiveness of the proposed approach is validated through experiments conducted using a radar-camera system mounted on a real vehicle. The localization performance is evaluated using a dataset collected in outdoor NLoS driving environments, demonstrating the practical applicability of the method.

Abstract:
Animal musculoskeletal systems are renowned for their ability to dynamically regulate stiffness and achieve energy-efficient motion. Being inspired by the biological control structure, this study presents a hybrid control framework that utilizes two-stage learning processes for body movement planning and muscle force computation. This methodology simplifies the learning process under joint redundancy and muscle redundancy. Then it enhances the interpretability of the resultant generated behaviors. The framework incorporates a reinforcement learning (RL)-trained joint controller to optimize joint torques, in conjunction with an LSTM-based muscle controller that translates these torques into muscle activations. Two control variants are proposed: One is prioritizing energy efficiency and the other is enhancing adaptability to environmental perturbations through co-contraction control. Validation with MuJoCo physics simulations demonstrates the framework’s capacity to autonomously learn and refine different gait modes without dependence on external motion datasets. The second variant demonstrates superior robustness and energy efficiency compared to conventional motor-driven models. This framework contributes to the enhancement of adaptability in complex scenarios dealing with the redundancy problem of musculoskeletal system coordination and holds potential for the development of bio-inspired locomotion control through the optimization of muscle activity composition.

Abstract:
Visual actionable affordance has emerged as a transformative approach in robotics, focusing on perceiving interaction areas prior to manipulation. Traditional methods rely on pixel sampling to identify successful interaction samples or processing pointclouds for affordance mapping. However, these approaches are computationally intensive and struggle to adapt to diverse and dynamic environments. This paper introduces ManipGPT, a framework designed to predict optimal interaction areas for articulated objects using a large pretrained vision transformer (ViT). We create a dataset of 9.9k simulated and real images to bridge the visual simto-real gap and enhance real-world applicability. By finetuning the vision transformer on this small dataset, we significantly improve part-level affordance segmentation, adapting the model’s in-context segmentation capabilities to robot manipulation scenarios. This enables effective manipulation across simulated and real-world environments by generating part-level affordance masks, paired with an impedance adaptation policy, sufficiently eliminating the need for complex datasets or perception systems. Our project page is available at: https://lxkim814.github.io/ManipGPT_website/

Abstract:
High-fidelity personalized human musculoskeletal models are crucial for simulating realistic behavior of physically coupled human-robot interactive systems and verifying their safety-critical applications in simulations before actual deployment, such as human-robot co-transportation and rehabilitation through robotic exoskeletons. Identifying subject-specific Hill-type muscle model parameters and bone dynamic parameters is essential for a personalized musculoskeletal model, but very challenging due to the difficulty of measuring the internal bio-mechanical variables in vivo directly, especially the joint torques. In this paper, we propose using Differentiable MusculoSkeletal Model (Diff-MSM) to simultaneously identify its muscle and bone parameters with an end-to-end automatic differentiation technique differentiating from the measurable muscle activation, through the joint torque, to the resulting observable motion without the need to measure the internal joint torques. Through extensive comparative simulations, the results manifested that our proposed method significantly outperformed the state-of-the-art baseline methods, especially in terms of accurate estimation of the muscle parameters (i.e., initial guess sampled from a normal distribution with the mean being the ground truth and the standard deviation being 10% of the ground truth could end up with an average of the percentage errors of the estimated values as low as 0.05%). In addition to human musculoskeletal modeling and simulation, the new parameter identification technique with the Diff-MSM has great potential to enable new applications in muscle health monitoring, rehabilitation, and sports science.

Abstract:
Autonomous visual navigation is an essential element in robot autonomy. Reinforcement learning (RL) offers a promising policy training paradigm. However, existing RL methods suffer from high sample complexity, poor sim-to-real transfer, and limited runtime adaptability. These problems are particularly challenging for drones, with complex nonlinear and unstable dynamics, and strong dynamic coupling between control and perception. In this paper, we propose a novel framework that integrates 3D Gaussian Splatting (3DGS) with differentiable deep reinforcement learning (DDRL) to train vision-based drone navigation policies. By leveraging high-fidelity 3D scene representations and differentiable simulation, our method improves sample efficiency and sim-to-real transfer. Additionally, we incorporate a Context-aided Estimator Network (CENet) to adapt to environmental variations at runtime. Moreover, by curriculum training in a mixture of different surrounding environments, we achieve in-task generalization, the ability to solve new instances of a task not seen during training. Drone hardware experiments demonstrate our method’s high training efficiency compared to state-of-the-art RL methods, zero shot sim-to-real transfer for real robot deployment without fine tuning, and ability to adapt to new instances within the same task class (e.g. to fly through a gate at different locations with different distractors in the environment). Our simulator and training framework are open-sourced at: https://github.com/Qianzhong-Chen/grad_nav.

Abstract:
Articulated object manipulation remains a critical challenge in robotics due to the complex kinematic constraints and the limited physical reasoning of existing methods. In this work, we introduce ArtGS, a novel framework that extends 3D Gaussian Splatting (3DGS) by integrating visual-physical modeling for articulated object understanding and interaction. ArtGS begins with multi-view RGB-D reconstruction, followed by reasoning with a vision-language model (VLM) to extract semantic and structural information, particularly the articulated bones. Through dynamic, differentiable 3DGS-based rendering, ArtGS optimizes the parameters of the articulated bones, ensuring physically consistent motion constraints and enhancing the manipulation policy. By leveraging dynamic Gaussian splatting, cross-embodiment adaptability, and closed-loop optimization, ArtGS establishes a new framework for efficient, scalable, and generalizable articulated object modeling and manipulation. Experiments conducted in both simulation and real-world environments demonstrate that ArtGS significantly outperforms previous methods in joint estimation accuracy and manipulation success rates across a variety of articulated objects. Additional images and videos are available on the project website: sites.google.com/view/artgs.

Abstract:
Deep learning-based semantic segmentation models achieve impressive results yet remain limited in handling distribution shifts between training and test data. In this paper, we present SDGPA (Synthetic Data Generation and Progressive Adaptation), a novel method that tackles zero-shot domain adaptive semantic segmentation, in which no target images are available, but only a text description of the target domain’s style is provided. To compensate for the lack of target domain training data, we utilize a pretrained off-the-shelf text-to-image diffusion model, which generates training images by transferring source domain images to target style. Directly editing source domain images introduces noise that harms segmentation because the layout of source images cannot be precisely maintained. To address inaccurate layouts in synthetic data, we propose a method that crops the source image, edits small patches individually, and then merges them back together, which helps improve spatial precision. Recognizing the large domain gap, SDGPA constructs an augmented intermediate domain, leveraging easier adaptation subtasks to enable more stable model adaptation to the target domain. Additionally, to mitigate the impact of noise in synthetic data, we design a progressive adaptation strategy, ensuring robust learning throughout the training process. Extensive experiments demonstrate that our method achieves state-of-the-art performance in zero-shot semantic segmentation. The code is available at https://github.com/ROUJINN/SDGPA

Abstract:
In recent years, lightweight large language models (LLMs) have garnered significant attention in the robotics field due to their low computational resource requirements and suitability for edge deployment. However, in task planning—particularly for complex tasks that involve dynamic semantic logic reasoning—lightweight LLMs have underperformed. To address this limitation, we propose a novel task planner, LightPlanner, which enhances the performance of lightweight LLMs in complex task planning by fully leveraging their reasoning capabilities. Unlike conventional planners that use fixed skill templates, LightPlanner controls robot actions via parameterized function calls, dynamically generating parameter values. This approach allows for fine-grained skill control and improves task planning success rates in complex scenarios. Furthermore, we introduce hierarchical deep reasoning. Before generating each action decision step, LightPlanner thoroughly considers three levels: action execution (feedback verification), semantic parsing (goal consistency verification), and parameter generation (parameter validity verification). This ensures the correctness of subsequent action controls. Additionally, we incorporate a memory module to store historical actions, thereby reducing context length and enhancing planning efficiency for long-term tasks. We train the LightPlanner-1.5B model on our LightPlan-40k dataset, which comprises 40,000 action controls across tasks with 2 to 13 action steps. Experiments demonstrate that our model achieves the highest task success rate despite having the smallest number of parameters. In tasks involving spatial semantic reasoning, the success rate exceeds that of ReAct by 14.9%. Moreover, we demonstrate LightPlanner’s potential to operate on edge devices.

Abstract:
We present a novel multi-altitude camera pose estimation system, addressing the challenges of robust and accurate localization across varied altitudes when only considering sparse image input. The system effectively handles diverse environmental conditions and viewpoint variations by integrating the cross-view transformer, deep features, and structure-from-motion into a unified framework. To benchmark our method and foster further research, we introduce two newly collected datasets specifically tailored for multi-altitude camera pose estimation; datasets of this nature remain rare in the current literature. The proposed framework has been validated through extensive comparative analyses on these datasets, demonstrating that our system achieves superior performance in both accuracy and robustness for multi-altitude sparse pose estimation tasks compared to existing solutions, making it well suited for real-world robotic applications such as aerial navigation, search and rescue, and automated inspection.

Abstract:
Invasive flexible neural electrodes are becoming increasingly prevalent in monitoring and modulating brain neural activity, necessitating the precise and minimally invasive implantation of these electrodes to a depth of a few millimeters beneath the cerebral surface. Although Neuralink has pioneered robot-assisted neural electrode implantation guided by microscopy, it currently lacks the ability to detect non-cerebral surface microvessels that are invisible under the white-light microscope, leading to inaccurate implantation planning and a high risk of trauma. To address this limitation, we introduce a vascular-enhanced strategy that fuses intraoperative white-light microscopy and preoperative photoacoustic microscopy and applies the fusion results to our established microsurgical robotic system for brain electrode implantation. Specifically, a multi-modality data preprocessing pipeline is devised to extract representative features, and a 2.5D fusion network that incorporates a depth encoding mechanism is proposed to predict cross-modality correspondence. The enhanced fusion results are utilized for implantation planning and intraoperative guidance during in vivo surgical procedures. Both quantitative and qualitative results are presented to demonstrate the effectiveness of our proposed cross-modality fusion methods. Furthermore, in vivo surgical implementations on mice underscore the potential of the proposed approach for achieving more precise and minimally invasive brain electrode implantation.

Abstract:
Rolling contact kinematics plays a vital role in dexterous manipulation and rolling-based locomotion. Yet, in practical applications, the environments and objects involved are often captured as discrete point clouds, creating substantial difficulties for traditional motion control and planning frameworks that rely on continuous surface representations. In this work, we propose a differential geometry-based framework that models point cloud data for continuous rolling contact using locally parameterized representations. Our approach leverages skeletonization to define a rotational reference structure for rolling interactions and applies a Fourier-based curve fitting technique to extract and represent meaningful controllable local geometric structure. We further introduce a novel 2D manifold coordinate system tailored to arbitrary surface curves, enabling local parameterization of complex shapes. The governing kinematic equations for rolling contact are then derived, and we demonstrate the effectiveness of our method through simulations on various object examples.

Abstract:
Robot skill acquisition processes driven by reinforcement learning often rely on simulations to efficiently generate large-scale interaction data. However, the absence of simulation models for tactile sensors has hindered the use of tactile sensing in such skill learning processes, limiting the development of effective policies driven by tactile perception. To bridge this gap, we present TwinTac, a system that combines the design of a physical tactile sensor with its digital twin model. Our hardware sensor is designed for high sensitivity and a wide measurement range, enabling high quality sensing data essential for object interaction tasks. Building upon the hardware sensor, we develop the digital twin model using a real-to-sim approach. This involves collecting synchronized cross-domain data, including finite element method results and the physical sensor’s outputs, and then training neural networks to map simulated data to real sensor responses. Through experimental evaluation, we characterized the sensitivity of the physical sensor and demonstrated the consistency of the digital twin in replicating the physical sensor’s output. Furthermore, by conducting an object classification task, we showed that simulation data generated by our digital twin sensor can effectively augment real-world data, leading to improved accuracy. These results highlight TwinTac’s potential to bridge the gap in cross-domain learning tasks.

Abstract:
Graph Convolutional Networks (GCNs) have proven to be highly effective for skeleton-based action recognition, primarily due to their ability to leverage graph topology for feature aggregation, a key factor in extracting meaningful representations. However, despite their success, GCNs often struggle to effectively distinguish between ambiguous actions, revealing limitations in the representation of learned topological and spatial features. To address this challenge, we propose a novel approach, Gaussian Topology Refinement Gated Graph Convolution (G3 CN), to address the challenge of distinguishing ambiguous actions in skeleton-based action recognition. G3 CN incorporates a Gaussian filter to refine the skeleton topology graph, improving the representation of ambiguous actions. Additionally, Gated Recurrent Units (GRUs) are integrated into the GCN framework to enhance information propagation between skeleton points. Our method shows strong generalization across various GCN backbones. Extensive experiments on NTU RGB+D, NTU RGB+D 120, and NW-UCLA benchmarks demonstrate that G3 CN effectively improves action recognition, particularly for ambiguous samples.

Abstract:
This paper presents the design, modeling, and experimental validation of a biomimetic robotic butterfly (BRB) that integrates a compliant mechanism to achieve coupled wing-abdomen motion. Drawing inspiration from the natural flight dynamics of butterflies, a theoretical model is developed to investigate the impact of abdominal undulation on flight performance. To validate the model, motion capture experiments are conducted on three configurations: a BRB without an abdomen, with a fixed abdomen, and with an undulating abdomen. The results demonstrate that abdominal undulation enhances lift generation, extends flight duration, and stabilizes pitch oscillations, thereby improving overall flight performance. These findings underscore the significance of wing-abdomen interaction in flapping-wing aerial vehicles (FWAVs) and lay the groundwork for future advancements in energy-efficient biomimetic flight designs.

Abstract:
Vehicle localization is a critical component in the planning and navigation of autonomous driving system. Generally, traditional vehicle localization methods rely on the Global Navigation Satellite System (GNSS) for self-localization. Unfortunately, GNSS can become unreliable and may fail in urban canyons, under trees, and beneath overpasses. To address this problem, we propose a visual localization framework assisted by offline Google satellite maps in GNSS-weak or GNSS-denied environments. And we introduce learning-based ground-to-satellite map feature matching method to mitigate the long-term cumulative drift of visual odometry. To reduce the negative impact of cross-view matching errors on localization accuracy, we propose a novel cross-view pose selection method to build two pose uncertainty models. Moreover, we combine the proposed method with classical SLAM methods to develop a vehicle localization framework. To verify the performance of the proposed method, we carried out the accuracy comparison experiment with state-of-the-art fusion localization methods and feature matching methods. Experimental results indicate that the proposed method achieves the best localization performance compared with the state-of-the-art methods, and our method achieves the root mean square error of 0.290m and 0.014rad in KITTI-05. The implementation code of this paper will be open-source at https://github.com/NEU-REAL/visualLocalization-with-satelliteMap.

Abstract:
Bearing-only Target Motion Analysis (TMA) is a promising technique for passive tracking in various applications as a bearing angle is easy to measure. Despite its advantages, bearing-only TMA is challenging due to the nonlinearity of the bearing measurement model and the lack of range information, which impairs observability and estimator convergence. This paper addresses these issues by proposing a Recursive Total Least Squares (RTLS) method for online target localization and tracking using mobile observers. The RTLS approach, inspired by previous results on Total Least Squares (TLS), mitigates biases in position estimation and improves computational efficiency compared to pseudo-linear Kalman filter (PLKF) methods. Additionally, we propose a circumnavigation controller to enhance system observability and estimator convergence by guiding the mobile observer in orbit around the target. Extensive simulations and experiments are performed to demonstrate the effectiveness and robustness of the proposed method. The proposed algorithm is also compared with the state-of-the-art approaches, which confirms its superior performance in terms of both accuracy and stability.

Abstract:
Autonomous valet parking enables vehicles to identify parking spaces and park without human intervention, with accurate localization being a fundamental prerequisite. Existing methods typically rely on visual feature maps or semantic maps for localization. However, visual feature maps often lack robustness in underground parking environments due to similar structures, weak textures, and fluctuating lighting conditions. Semantic maps require complex post-processing and suffer from heterogeneous data association problems. In this paper, we propose directly to regress the semantic corner points to build the semantic map. Furthermore, we introduce a novel map update and merge method, which is unaffected by environmental and temporal changes, allowing continuous update and refinement of the map. To establish a global semantic map, we use four fisheye cameras to synthesize surround-view images, combined with an IMU (Inertial Measurement Unit) and wheel encoders. Real-world experiments validate the localization accuracy of the proposed system. The experimental results demonstrate the robustness and practicability of the proposed system.

Abstract:
Sampling-based model predictive controllers generate trajectories by sampling control inputs from a fixed, simple distribution such as the normal or uniform distributions. This sampling method yields trajectory samples that are tightly clustered around a mean trajectory. This clustering behavior in turn, limits the exploration capability of the controller and reduces the likelihood of finding feasible solutions in complex environments. Recent work has attempted to address this problem by either reshaping the resulting trajectory distribution or increasing the sample entropy to enhance diversity and promote exploration. In our recent work, we introduced the concept of C-Uniform trajectory generation [1] which allows the computation of control input probabilities to generate trajectories that sample the configuration space uniformly. In this work, we first address the main limitation of this method: lack of scalability due to computational complexity. We introduce Neural C-Uniform, an unsupervised C-Uniform trajectory sampler that mitigates scalability issues by computing control input probabilities without relying on a discretized configuration space. Experiments show that Neural C-Uniform achieves a similar uniformity ratio to the original C-Uniform approach and generates trajectories over a longer time horizon while preserving uniformity. Next, we present CU-MPPI, which integrates Neural C-Uniform sampling into existing MPPI variants. We analyze the performance of CU-MPPI in simulation and real-world experiments. Our results indicate that in settings where the optimal solution has high curvature, CU-MPPI leads to drastic improvements in performance. Additionally, it performs as well as or better than baseline methods in dynamic environments. Additional results can be found at the project website.

Abstract:
This paper presents a rapid 3D printing framework that enhances the efficiency of commercially available layered manufacturing-based 3D printers. Unlike traditional methods that simplify printing regions to a single point or rely on predefined entry and exit points, our approach utilizes an improved Traveling Salesman Problem (TSP) algorithm to autonomously generate an optimized, cyclic printing path, while automatically assigning entry and exit points for each region. This minimizes non-printing paths and improves efficiency. Additionally, we propose a principal axis calculation method for irregular shapes, aligning better with geometric orientation. This optimization enhances infill uniformity and surface smoothness. Simulations and experimental results demonstrate that the proposed framework improves printing efficiency while maintaining print quality, with promising applicability to large-scale and complex 3D printing models.

Abstract:
To achieve human-like haptic perception in anthropomorphic grippers, the compliant sensing surfaces of vision tactile sensor (VTS) must evolve from conventional planar configurations to biomimetically curved topographies with continuous surface gradients. However, planar VTSs have challenges when extended to curved surfaces, including insufficient lighting of surfaces, blurring in reconstruction, and complex spatial boundary conditions for surface structures. With an end goal of constructing a human-like fingertip, our research (i) develops GelSplitter3D by expanding imaging channels with a prism and a near-infrared (NIR) camera, (ii) proposes a photometric stereo neural network with a CAD-based normal ground truth generation method to calibrate tactile geometry, and (iii) devises a normal integration method with boundary constraints of depth prior information to correcting the cumulative error of surface integrals. We demonstrate better tactile sensing performance, a 40% improvement in normal estimation accuracy, and the benefits of sensor shapes in grasping and manipulation tasks.

Abstract:
This paper introduces an upper limb postural optimization method for enhancing physical ergonomics and force manipulability during bimanual human-robot co-carrying tasks. Existing research typically emphasizes human safety or manipulative efficiency, whereas our proposed method uniquely integrates both aspects to strengthen collaboration across diverse conditions (e.g., different grasping postures of humans, and different shapes of objects). Specifically, the joint angles of a simplified human skeleton model are optimized by minimizing the cost function to prioritize safety and manipulative capability. To guide humans towards the optimized posture, the reference end-effector poses of the robot are generated through a transformation module. A bimanual model predictive impedance controller (MPIC) is proposed for our human-like robot, CURI, to recalibrate the end effector poses through planned trajectories. The proposed method has been validated through various subjects and objects during human-human collaboration (HHC) and human-robot collaboration (HRC). The experimental results demonstrate significant improvement in muscle conditions by comparing the activation of target muscles before and after optimization.

Abstract:
Drawing inspiration from the ability of fish to maintain efficient swimming over a wide range of speeds by tuning the stiffness of their tails, researchers have explored stiffness adjustment mechanisms in fish-like robots. Typically, existing mechanisms require extra actuators or power sources only for tuning stiffness, resulting in additional energy consumption and more complex structures. To address this, our study introduces an innovative fishtail featuring an online stiffness modulation mechanism that does not require additional actuators or power sources solely for stiffness adjustment. Through model-based simulations and experimental testing, we evaluated the effectiveness of the proposed method. The results demonstrate that the designed mechanism enables efficient swimming across a broader frequency range (0–4 Hz) compared to most servo-actuated platforms with adjustable stiffness reported in existing studies. The robot achieves a maximum average speed of 1.4 BL/s and a minimum cost of transport of 9.5 J/(m•kg).

Abstract:
In this paper, we propose a framework based on velocity obstacles to address dynamic obstacle avoidance problem for constrained mobile robots. The framework establishes a nonlinear mapping from the control domain to the velocity space based on the robot’s kinematic model and input constraints. This mapping defines the Velocity Feasible Region (VFR) as the set of reachable velocities at the next time step. Utilizing the VFR, we propose a gradient field, called the Dynamically Constrained Gradient Velocity Obstacle (DGVO), to represent the feasible motion region for mobile robots. DGVO preserves the original feasible region of the mobile robot. Based on DGVO, we formulate an unconstrained gradient descent optimization problem to compute collision-free velocities in real time. This framework enables real-time online computation of collision-free velocities for any constrained mobile robot, and it exhibits strong robustness to sensor noise. Extensive simulations and real-world experiments have validated the effectiveness of the proposed method. The introduction of the entire work can be found at the following link: https://youtu.be/HrTNTSOhKvE.

Abstract:
A common task in cinematography is tracking a subject or character through a scene. For complex setups, multiple cameras must track the subject simultaneously to attain sufficient coverage. Recently, researchers have considered using multiple camera-mounted autonomous mobile robots for this task. Existing work is limited to UAVs, which may be unavailable due to cost, safety requirements, or flight restrictions. Therefore, in this paper we present a tracking approach for complex and unstructured environments using differential-drive robots with gimbal-mounted cameras. Differential-Drive robots pose a challenge, as their movement is more restricted than UAVs. For this, we introduce a novel hierarchical planning framework which ensures safety and visibility while maximizing shot diversity. We begin by synthesizing a set of paths using sequential greedy viewpoint planning and conflict-based search under a set of optimal viewpoint constraints. These paths then form an initial guess for joint trajectory optimization, which synthesizes stable trajectories under the motion constraints of the robots and gimbals. Empirically, we show how our approach outperforms approaches aimed at UAVs, which may synthesize infeasible trajectories when applied to differential-drive robots.

Abstract:
An improved artificial bee colony (IABC) algorithm is presented for bias correction of fiber optic gyroscopes in inertial navigation systems. Its effectiveness is verified through an eight-orientation heading test conducted across two independent trials. In the absence of correction, heading deviations are evident, reaching approximately 0.056° and 0.079°, which reflect substantial bias accumulation. Conventional methods offer limited improvement and struggle to maintain consistency, yielding deviations near 0.06° and 0.041°. By contrast, the IABC algorithm reduces these deviations to below 0.046° and around 0.038°, respectively. The results confirm that the proposed approach provides more reliable compensation and enhances heading accuracy across diverse orientation scenarios.

Abstract:
Human joints enable precise bending for fine manipulation and complex movements. Similarly, robotic flexibility relies on bending structures, where accurate bending perception is crucial for precise control and enhanced humanrobot interaction. This paper proposes a C-shaped fiber optic array, embedding a fiber Bragg Grating sensor array into a 2 mm thick silicone layer, successfully achieving a highly sensitive (300 pm/N) and electromagnetic interference-resistant bending sensor. The flexible sensor can sensitively detect external stimuli, such as the touch of a 1g weight or a feather, and exhibits a good linear relationship with curvature, facilitating accurate curvature classification. Additionally, leveraging the wearable nature of the sensor, we achieved the detection of finger bending angles. Finally, by attaching the sensor to the wrist and combining it with deep learning algorithms, we achieved 100% gesture recognition accuracy. This sensor holds significant potential for applications in fields such as fruit size classification, rehabilitation healthcare, and human-robot interaction.

Abstract:
Recent advances in research have demonstrated that Vision-Language Models (VLMs) are a promising technology for robot task planning. This paper presents a novel approach that leverages visual prompts and VLMs to generate feasible robot action sequences for achieving shared tasks through human-robot collaboration while simultaneously estimating human intentions. Our method enhances VLMs’ understanding of the environment by utilizing annotations (bounding boxes and labels) and dynamically infers human intentions based on changing environmental conditions to generate optimal robot action sequences to achieve common goals. Additionally, the system incorporates a mechanism to regenerate new sequences through VLM analysis when action failures or external interference occur. Furthermore, by designing prompts as versatile modules for diverse tasks, our proposed technology offers a new approach to robot action planning that excels in both efficiency and adaptability.

Abstract:
Multi-Tasks decoding from electroencephalogram (EEG) signals is of great value for brain-computer interaction (BCI) applications in natural scenes. Although most existing studies have concentrated on decoding significantly different multi-tasks, a few studies explored the various cognitive processes that individuals may exhibit when elicited by the same stimulus. However, in practice, the diversity and complexity of individuals' cognitive responses when faced with the same stimulus cannot be ignored. In this paper, we aimed to construct a paradigm of the various cognitive processes elicited by the same stimulus, explore the neural signatures, and decode the multiple cognitive processes from EEG signals. Experimental results show that the regularized linear discriminant analysis (RLDA) classifier with event-related spectral perturbation (ERSP) features yielded a decoding accuracy of 96.30%±3.40% for the multi-cognitions. In-Depth research on signatures and decoding of various cognitive processes elicited by the same stimulus is of great significance for improving the naturalness and intelligence of BCI.

Abstract:
Pedestrian trajectory prediction ensures safe navigation in autonomous driving and intelligent robots. Existing methods have shown promising results but still face challenges in handling dynamic environments, social interactions, and high-dimensional data. In this paper, we propose a novel PhysGCN-DL within the itransformer framework to address these challenges. Our model incorporates physically-inspired dynamic interaction modeling by representing physical interactions between pedestrians as edge weights in graph convolution. This approach captures the heterogeneity of pedestrian movement and improves the interpretability of social interactions. Moreover, we design a novel loss function to jointly enhance prediction diversity and accuracy, thereby improving the model’s robustness across both dense and sparse scenarios. Empirical evaluations confirm that our approach outperforms existing methods in generating accurate and diverse pedestrian trajectories.

Abstract:
Swarm robotics plays a crucial role in enabling autonomous operations in dynamic and unpredictable environments. However, a major challenge remains ensuring safe and efficient navigation in environments shared by both dynamic alive (e.g., humans) and dynamic inanimate (e.g., non-living objects) obstacles. In this paper, we propose ImpedanceGPT, a novel system that leverages a Vision-Language Model (VLM) with Retrieval-Augmented Generation (RAG) framework to enable real-time reasoning for adaptive navigation of mini-drone swarm in complex environments. The key innovation of ImpedanceGPT lies in the merging VLM-RAG system with impedance control method, which is an active compliance strategy. This system provides drones with an enhanced semantic understanding of their surroundings and allows them to dynamically adjust impedance control parameters in response to obstacle types and environmental conditions. Our approach not only ensures safe and precise navigation but also improves coordination between drones in the swarm. Experimental evaluations demonstrate the effectiveness of our system. The VLM-RAG framework achieved an obstacle detection and retrieval accuracy of 80% under optimal lighting. In static environments, drones navigated dynamic inanimate obstacles at 1.4 m/s but slowed to 0.7 m/s with increased safety margin around humans. In dynamic environments, speed adjusted to 1.0 m/s near hard obstacles, while reducing to 0.6 m/s with higher deflection region to safely avoid moving humans.Video of ImpedanceGPT: https://youtu.be/JTdeg9bAzL4 Github: https://github.com/Faryal-Batool/ImpedanceGPT

Abstract:
Achieving autonomous navigation for biologically inspired jumping robots remains a long-standing challenge, due to the inherent instability of jumping motions and the limitations in onboard sensor capabilities. This paper proposes a skill-based hierarchical framework with dangerous action masking (SH-DAM) for autonomous navigation of jumping robot. The framework, based on hierarchical reinforcement learning, includes a low-level controller that learns locomotion skills (crawling, turning and jumping) to overcome various obstacles. A high-level controller selects and coordinates these skills, while also incorporating curriculum learning to enhance the performance of navigation tasks. For safe navigation, we utilize dangerous action masking to suppress the probability of selecting jump motions in dangerous regions. We improved the locust-inspired jumping robot platform JumpBot-S, by integrating a lightweight time-of-flight (ToF) sensor, and constructed a range of complex environments for experiments. Simulation results demonstrate that SH-DAM enables the robot to autonomously complete challenging navigation tasks. Compared to baseline algorithms, our method achieves a 12.57% increase in success rate, a 55.88% reduction in stuck rate, and a 57.89% reduction in rollover rate. Finally, we deployed our framework in real-world environments and conducted experiments in both normal lit and dimly lit conditions. This framework provides a new paradigm for jumping robot navigation in complex environments.

Abstract:
Tactile sensing plays a crucial role to empower robotic hands with improved grasping and manipulating abilities. In this paper, we propose an anthropomorphic robotic hand design with dual air bag sensors integrated soft fingertips to achieve tactile sensing. The air bag sensor is low-cost, easy-to-build and deformable, and can be embedded in the fingertip, endows the hand with the ability to perceive and makes it have the mechanical complicance similar to the human fingertip. The air bag sensor exhibits high performance metrics, including a sensitivity of ~1.65 kPa/N, a minimum detection force of < 0.01 N, a response time of < 10 ms, and good stability and repeatability. The experimental results show that the proposed robotic hand performs well in surface texture detection, hard inclusion depth detection and object softness detection, as well as grasping tasks. By applying a machine learning algorithm to the experimental data, an accuracy of 0.767 and 0.898 was achieved in predicting hard inclusion depth and object hardness, respectively. This study provides a simple and effective tactile sensing solution for the design of anthropomorphic robotic hand, and may have possible applications such as end-effectors for humanoid robots or robotic palpation.

Abstract:
Reinforcement learning (RL) has demonstrated remarkable capability in acquiring robot skills, but learning each new skill still requires substantial data collection for training. The pretrain-and-finetune paradigm offers a promising approach for efficiently adapting to new robot entities and tasks. Inspired by the idea that acquired knowledge can accelerate learning new tasks with the same robot and help a new robot master a trained task, we propose a latent training framework where a transferable latent-to-latent locomotion policy is pretrained alongside diverse task-specific observation encoders and action decoders. This policy in latent space processes encoded latent observations to generate latent actions to be decoded, with the potential to learn general abstract motion skills. To retain essential information for decision-making and control, we introduce a diffusion recovery module that minimizes information reconstruction loss during pretrain stage. During fine-tune stage, the pretrained latent-to-latent locomotion policy remains fixed, while only the lightweight task-specific encoder and decoder are optimized for efficient adaptation. Our method allows a robot to leverage its own prior experience across different tasks as well as the experience of other morphologically diverse robots to accelerate adaptation. We validate our approach through extensive simulations and real-world experiments, demonstrating that the pretrained latent-to-latent locomotion policy effectively generalizes to new robot entities and tasks with improved efficiency.

Abstract:
The true implementation of the Anticipatory Steam Detection for Steamer-Filling(ASDSF) process in baijiu intelligent distillation systems, which involves predicting and precisely spreading distillers’ grains before steam emerges, remains a critical unresolved challenge. In this study, we introduce the SLU model, which utilizes SwinLSTM as the core feature extraction module and adopts a U-shaped structure. This model achieves spatiotemporal feature extraction and dynamic change prediction. It is further enhanced by integrating a U-Net module for multi-scale feature fusion and optimized through a Deep Q-Network (DQN)-based decision-making process. The SLU-DQN model, specifically designed for anticipatory material spreading planning in the baijiu Steamer-Filling(SF) distillation system, predicts future steam emission areas. Finally, both quantitative and qualitative experimental results demonstrate the excellent performance of the SLU-DQN model in solving the ASDSF problem. The model achieved 91.1% reward accuracy, an F1-Score of 91% for material spreading point prediction, an MSE of 19.02, and an SSIM of 95.8%. These results not only highlight the model’s superior accuracy in predicting future steam emission areas but also provide a significant technical breakthrough for intelligent baijiu distillation systems, filling a crucial gap in the field.

Abstract:
Continuum robots, with their slender, flexible structures, are increasingly utilized for navigating confined spaces via follow-the-leader (FTL) motion to minimize trauma. However, traditional FTL algorithms fail to adjust the entire robot configuration, including the tail, and struggle with navigation in dynamic environments requiring real-time re-planning. We propose a novel FTL motion planning framework inspired by snake hole-digging, enabling optimal shape configuration during insertion. Our approach outperforms the baseline FTL by 75.36% in three environments (circular, confined circular, maze) with six tests. The framework integrates control barrier functions (CBF) and quadratic programming (QP) for real-time obstacle avoidance, with control Lyapunov functions (CLF) ensuring minimal deviation. The target reaching error decreases by 48.43% when using CLF-CBF-QP compared to CBF-QP in four dynamic circular environments with obstacles from different directions. This work, with open-source code, provides a robust solution for continuum robot FTL motion planning in dynamic environments.

Abstract:
Detection of deformable linear objects (DLOs) in three-dimensional space is essential for robotic manipulation of DLOs. However, their complex deformations and high degrees of freedom make perception highly susceptible to occlusions, noise, and data missing. To address these challenges, we propose a deep learning-based method that leverages the topological properties of DLOs to robustly detect keypoints from incomplete point clouds while preserving the topological order of keypoints. Our approach initializes a sequence of keypoints that adheres to the topological structure of DLOs. Then, these ordered keypoints are refined through bidirectional sequence learning. Simulation results demonstrate that our method generates accurate, uniform, and smooth keypoint sequences under varying levels of occlusion. Compared to existing baselines, our approach achieves superior performance. Real-world experiments further validate the generalization capability of our method in unseen and challenging scenarios involving occlusion and self-occlusion while maintaining real-time performance.

Abstract:
Manufacturing applications increasingly integrate visually aided robotic systems. Such systems must rely on excellent kinematic parameter calibration and a hand – eye matrix estimation to perform according to standards. The latter is as precise as the camera pose estimation capability and the robotic forward kinematic precision. To enhance the overall system’s precision, one must simultaneously act and improve the robot’s kinematic parameters and hand – eye transformation due to mutual inference. This work exploits standard 2D camera systems to simultaneously estimate the kinematic parameters and the hand – eye transformation matrix through a method based on the Unscented Kalman Filter (UKF) and the parameters uncertainty transportation through the robot’s kinematic. The method employs data gathered during the robot movements and camera readings and iteratively improves the system parameters’ estimate. The method is applied to industrial mobile manipulators and tested on both synthetic data and real experiment data, showing a great improvement in the kinematic parameters estimation.

Abstract:
Cable-driven serpentine manipulators hold great potential in unstructured environments, offering obstacle avoidance, multi-directional force application, and a lightweight design. By placing all motors and sensors at the base and employing plastic links, we can further reduce the arm’s weight. To demonstrate this concept, we developed a 9-degree-of-freedom cable-driven serpentine manipulator with an arm length of 545 mm and a total mass of only 308 g. However, this design introduces flexibility-induced variations, such as cable slack, elongation, and link deformation. These variations result in discrepancies between analytical predictions and actual link positions, making pose estimation more challenging. To address this challenge, we propose a physical reservoir computing based pose estimation method that exploits the manipulator’s intrinsic nonlinear dynamics as a high-dimensional reservoir. Experimental results show a mean pose error of 4.3 mm using our method, compared to 4.4 mm with a baseline long short-term memory network and 39.5 mm with an analytical approach. This work provides a new direction for control and perception strategies in lightweight cable-driven serpentine manipulators leveraging their intrinsic dynamics.

Abstract:
Current trunk-like continuum robots face limitations in actuator capabilities, hindering the realization of adjustable multi-turn helical deformations. In our previous research [1], [2], we proposed the Twisted String and Spiral Hose (TSSH) mechanism, which utilizes both the tensile and torsional forces of twisted string actuators (TSAs) to generate helical deformations. However, the deformation principles of the TSSH mechanism remain not fully understood, motivating further investigation. The main contribution of this study is the analysis of TSSH deformation principles using a conventional mathematical model of twisted strings, supported by experimental validation. Building on this analysis, we developed a static simulation model based on potential energy minimization to predict TSSH deformation. The proposed model provides insights into the deformation behavior of the TSSH mechanism, enabling parameter optimization for enhanced performance. Furthermore, this static analysis can be extended to other TSA-based and tendon-driven systems, providing a valuable reference for string-actuated robotic mechanisms.

Abstract:
The integration of high-quality datasets, a generalized network model, and robust evaluation strategies sets a significant benchmark for advancing policy development in industrial bin-picking. This paper introduces the concept of region-aware grasping, a cutting-edge simulation to reality system designed to generate and evaluate 6D poses, empowering robots to grasp novel workpieces in stacked environments. The proposed system comprises two core components: the Sim2Real dataset, a large-scale synthetic point cloud dataset for grasp analysis, and Semantic-GraspNet, a policy framework that predicts full 6D grasp poses for stacked objects. By encoding and decoding point cloud data, Semantic-GraspNet innovatively transforms the pose prediction into a semantic categorization problem. Furthermore, we present a hybrid evaluation strategy that integrates pose assessment with mechanical grasp performance analysis, thereby enhancing both grasp success rates and sorting efficiency. To extend its capabilities, Semantic-GraspNet is combined with multi-modal large models, enabling accurate object-category-specific grasping in complex bin-picking scenarios. In real-world industrial applications, the system achieves a grasp completion rate of 91.3% in cluttered scenes and 89.2% in densely stacked environments, showcasing state-of-the-art performance in robotic picking and placing tasks.

Abstract:
Robotics presents a promising opportunity for enhancing bathing assistance, potentially to alleviate labor shortages and reduce care costs, while offering consistent and gentle care for individuals with physical disabilities. However, ensuring flexible and efficient cleaning of the human body poses challenges as it involves direct physical contact between the human and the robot, and necessitates simple, safe, and effective control. In this paper, we introduce a soft, expandable robotic manipulator with embedded capacitive proximity sensing arrays, designed for safe and efficient bed bathing assistance. We conduct a thorough evaluation of our soft manipulator, comparing it with a baseline rigid end effector in a human study involving 12 participants across 96 bathing trails. Our soft manipulator achieves an an average cleaning effectiveness of 88.8% on arms and 81.4% on legs, far exceeding the performance of the baseline. Participant feedback further validates the manipulator’s ability to maintain safety, comfort, and thorough cleaning. https://sites.google.com/view/softbathing.

Abstract:
Soft robotic systems heavily depend on accurate sensor data for perception and control; however, this data is often corrupted by missing observations, due to partial sensor coverage, communication failures, or occlusions and noisy measurements stemming from hardware imperfections, environmental disturbances, and the intrinsic compliance of soft materials. Such corruption can obscure critical state information, causing unreliable modeling of soft robotics and degrading control accuracy. To address these challenges, we propose a Contrastive Dual-Latent Autoencoder (CDLAE) that jointly handles missing and noisy data in a single end-to-end framework. Our approach leverages an attention based autoencoder architecture with dual latent pathways, where one focuses on capturing the underlying clean signals while the other isolates noise-related components. A contrastive loss encourages strong separation between these pathways, enhancing the model’s ability to filter noise while reconstructing missing values. Additionally, the autoencoder is trained jointly with a downstream predictive network, ensuring that signal imputation is optimized with respect to the ultimate control task. Experimental evaluations on a pneumatic soft robot platform and multiple public time-series datasets demonstrate that CDLAE consistently outperforms existing methods in handling corrupted data, offering robust, high-fidelity reconstructions that significantly improve soft robot perception and control in real-world conditions.

Abstract:
This paper presents a novel control framework that integrates reservoir computing (RC) with Tube model predictive control (Tube-MPC) for robust path following in quadrotor autonomous underwater vehicles (QAUVs) under sudden fault conditions. The proposed RC-Tube-MPC leverages the dynamic modeling capabilities of RC to efficiently approximate complex nonlinear behaviors, while Tube correction ensures robust performance despite model uncertainties and external disturbances. Comparative simulations demonstrate that RC-Tube-MPC outperforms alternative approaches in terms of path following accuracy and computational efficiency. Additionally, the influence of training data length on learning performance is analyzed, revealing that the proposed method maintains superior performance across various data regimes. Notably, in severe fault scenarios, such as a fault factor of 0.3, RC-Tube-MPC uniquely restores convergence to the reference path. These results underscore the potential of the integrated RC-Tube-MPC approach for real-time control applications in dynamic, fault-prone underwater environments.

Abstract:
Behavior cloning (BC) has become a staple imitation learning paradigm in robotics due to its ease of teaching robots complex skills directly from expert demonstrations. However, BC suffers from an inherent generalization issue. To solve this, the status quo solution is to gather more data. Yet, regardless of how much training data is available, out-of-distribution performance is still sub-par, lacks any formal guarantee of convergence and success, and is incapable of allowing and recovering from physical interactions with humans. These are critical flaws when robots are deployed in ever-changing human-centric environments. Thus, we propose Elastic Motion Policy (EMP), a one-shot imitation learning framework that allows robots to adjust their behavior based on the scene change while respecting the task specification. Trained from a single demonstration, EMP follows the dynamical systems paradigm where motion planning and control are governed by first-order differential equations with convergence guarantees. We leverage Laplacian editing in full end-effector space, ℝ3 ×SO(3), and online convex learning of Lyapunov functions, to adapt EMP online to new contexts, avoiding the need to collect new demonstrations. We extensively validate our framework in real robot experiments, demonstrating its robust and efficient performance in dynamic environments, with obstacle avoidance and multi-step task capabilities. https://elastic-motion-policy.github.io/EMP/

Abstract:
We present Phaser, a flexible system that directs narrow-beam laser light to moving robots for concurrent wireless power delivery and communication. We design a semiautomatic calibration procedure to enable fusion of stereo-vision-based 3D robot tracking with high-power beam steering, and a low-power optical communication scheme that reuses the laser light as a data channel. We fabricate a Phaser prototype using off-the-shelf hardware and evaluate its performance with battery-free autonomous robots. Phaser delivers optical power densities of over 110 mW/cm2 and error-free data to mobile robots at multi-meter ranges, with on-board decoding drawing 0.3 mA (97% less current than Bluetooth Low Energy). We demonstrate Phaser fully powering gram-scale battery-free robots to nearly 2x higher speeds than prior work while simultaneously controlling them to navigate around obstacles and along paths. Code, an open-source design guide, and a demonstration video of Phaser is available at: mobilex.cs.columbia.edu/phaser

Abstract:
The employment of walking excavator robots, endowed with hybrid locomotion capabilities, holds considerable promise in facilitating the execution of intricate tasks in challenging terrain environments. A critical skill for such a system pertains to traversing obstacles through stepping locomotion, a process entailing the momentary disengagement of the end-effectors from the ground. Existing solutions are encumbered by two significant limitations. Primarily, they are often too cumbersome to develop and implement due to the complexity of the problem formulations. Secondly, they present restrictions on the available avenues to influence the behavior, hindering the effective leveraging of domain knowledge to achieve the intended objective.This research proposes an alternative approach to learning the stepping locomotion. The proposed method employs a hierarchical reinforcement learning strategy, wherein the complex control task is decomposed into multiple subtasks, each aligned with a sub-objective defined as a reward function. The training of these subtasks is conducted individually, starting from the lowest level and progressing to the higher levels, predominantly utilizing deep reinforcement learning. Additionally, the masking of invalid actions is utilized to guide the controller during training, offering enhanced opportunities to influence behavior while using only simple formulations. Notably, the proposed approach has been successfully trained for three distinct stepping scenarios: obstacle, step, and gap, underscoring the versatility of the controller. The design of the controller, along with the results of training and evaluation in simulation, is presented herein.

Abstract:
Magnetic levitation control provides a promising solution for wireless capsule endoscopy by minimizing tissue pressure and reducing patient discomfort and risks. Compared to electromagnetic actuation systems, using permanent magnets as the actuation source provides stronger magnetic fields at a lower cost. However, permanent magnet-based actuation systems are highly nonlinear and necessitate complex system modeling. In this study, we propose a deep reinforcement learning (DRL)-based control method for permanent magnet levitation. This approach utilizes DRL to learn optimal control strategies in complex and dynamic environments, without the need for detailed modeling of nonlinear physical phenomena such as magnetic interactions, manipulator dynamics, and capsule-environment interactions. A simulation environment was developed where a manipulator equipped with a permanent magnet actuates the internal magnet of a capsule. A multistage reward function and a recurrent neural network with memory capabilities were designed to improve control stability and accuracy. After Sim-to-Sim transfer, the proposed method successfully controlled five degrees of freedom, achieving navigation accuracies of 1.68 mm in the training environment and 9.18 mm in the testing environment. The system maintained stable performance and high accuracy while supporting dynamic tracking at speeds of up to 30 mm/s. Additionally, the method demonstrated significant resistance to disturbances.

Abstract:
In cluttered environments, a human-following mobile robot must predict the motion intention of the followed human and take environmental obstacles into consideration. Consequently, it brings several challenges, such as the human’s detour direction prediction problem and the visibility maintenance problem for route planning. To overcome these problems, this paper proposes an integrated follow-ahead framework, in which the human’s detour behavior is predicted by the Leg Motion Model-based EKF (LMM-EKF) and the iterative human route search algorithm, followed by the Safe Corridor-based Model Predictive Controller (SCMPC) used to obtain the optimal control solution. Also, a new perspective about visibility is provided in this paper that, via placing multiple obstacle-free safe regions along the human’s intended direction without any complex preprocessing for the point cloud, SCMPC prevents the robot from collision and occlusion simultaneously based on the basic properties of the convex set. The validity of the proposed method is comprehensively verified through real-world experiments.

Abstract:
Efficient robotic disassembly of end-of-life products is often impeded by inherent uncertainties in product condition and unknown internal structures. Conventional disassembly methods face challenges when adaptive exploration is required—particularly in cap-shaft disassembly, where connection mechanisms are frequently concealed. This paper proposes a novel robotic twist-and-pull disassembly strategy that integrates compliance control with reinforcement learning (RL). By enabling the robot to adapt to unknown connection geometries and systematic misalignments, the approach enhances the capabilities of robotic skill and reduces dependence on precisely pre-programmed trajectories. Experimental results confirm that the proposed strategy substantially improves robotic disassembly performance, improves RL training success rate, and demonstrates strong domain transferability, supporting its application across varied disassembly contexts.

Abstract:
The task of deploying a large number of autonomous vehicles is challenging, risky and often overlooked in the literature. These vehicles are typically deployed from a single location, and their underactuated nature, close proximity, and susceptibility to external disturbances make it difficult to achieve a mission-ready configuration without collisions. In this paper, we address the problem of transitioning a set of underactuated Autonomous Surface Vehicles (ASVs) from arbitrary and inconvenient initial conditions, to a deconflicted set of deployed vehicles. We propose a decentralized and scalable method that assigns the vehicles to their target positions, generates optimal paths given minimum turning radii and assures collision avoidance between the vehicles. Performance is verified through simulation and extensive field trials. Results demonstrate that our approach improves the time to decluster with 58% compared to the current manual method. By improving efficiency and robustness while eliminating human involvement, this work streamlines ASV fleet deployments, enabling more effective multi-agent field operations.

Abstract:
This paper presents a three-dimensional (3D) control framework for bevel-tip steerable needles that combines model predictive control (MPC) with hierarchical supervisory logic. The MPC layer uses a reduced-order two-mode switching model to generate the desired control actions, while the supervisory logic adaptively prioritizes in-plane and out-of-plane corrections based on real-time error magnitudes. This hierarchical approach smoothly modulates the axial rotation commands to minimize abrupt needle flips, thereby reducing the so-called "drilling effect", a key source of tissue trauma. The simulation results show that the proposed approach reduces tissue trauma by more than 50% compared to conventional pulse-width-modulated sliding mode controllers while achieving mean absolute error and targeting errors in the submillimeter range.

Abstract:
The ability of autonomous subtask generation is important for robots to effectively cope with unforeseen situations during indoor search and rescue missions. While prior work mainly focused on improving individual low-level skills of the rescue robot, this paper proposes AutoExpand: a high-level framework that takes advantage of the extensive knowledge and reasoning abilities inherent in large language models (LLM) to understand human instructions and environmental situation. Through tight coupling LLM with behavior tree, our method enables the robot to autonomously generate reactive context-aware operational subtasks on-site without human intervention or additional training. A series of real-world experiments demonstrate that AutoExpand can effectively generate appropriate tasks for search and rescue missions, leading to a search scope increased by 34.45% when compared with traditional methods. The sample code is available at https://github.com/nubot-nudt/AutoExpand.

Abstract:
Aligning robot navigation with human preferences is essential for ensuring comfortable, and predictable robot movement in shared spaces. While preference-based learning methods, such as reinforcement learning from human feedback (RLHF), enable this alignment, the choice of the preference collection interface may influence the process. Traditional 2D interfaces provide structured views but lack spatial depth, whereas immersive VR offers richer perception, potentially affecting preference articulation. This study systematically examines how the interface modality impacts human preference collection and navigation policy alignment. We introduce a novel dataset of 2,325 human preference queries collected through both VR and 2D interfaces, revealing significant differences in user experience, preference consistency, and policy outcomes. Our findings highlight the trade-offs between immersion, perception, and preference reliability, emphasizing the importance of interface selection in preference-based robot learning. The dataset is available to support future research.

Abstract:
Autonomous exploration is essential for the effective deployment of quadrotors in various applications. However, existing approaches face significant challenges in large-scale environments, particularly in balancing global coverage efficiency and computational overhead. These limitations often result in poor adaptability to environmental changes and redundant revisits to previously explored areas, reducing overall exploration efficiency. To address these issues, we propose FASTEX, a fast UAV exploration framework designed for large-scale environments, using dynamically expanding grids and coverage paths to improve exploration efficiency. To support efficient exploration planning in large-scale scenarios, we introduce an efficient environment preprocessing method, including a dynamic grid expansion mechanism and a sparse roadmap. Furthermore, we present a hierarchical exploration planning framework that integrates an incremental global planner with a local planner, ensuring high coverage and computational efficiency. Extensive simulation tests demonstrate the superior performance and robustness of the proposed method compared to the state-of-the-art methods. In addition, we conduct various real-world experiments to validate the feasibility of our autonomous exploration system.

Abstract:
The human somatosensory system integrates multimodal sensory feedback, including tactile, proprioceptive, and thermal signals, to enable comprehensive perception and effective interaction with the environment. Inspired by the biological mechanism, we present a sensorized soft anthropomorphic hand equipped with diverse sensors designed to emulate the sensory modalities of the human hand. This system incorporates biologically inspired encoding schemes that convert multimodal sensory data into spike trains, enabling highly-efficient processing through Spiking Neural Networks (SNNs). By utilizing these neuromorphic signals, the proposed framework achieves 97.14% accuracy in object recognition across varying poses, significantly outperforming previous studies on soft hands. Additionally, we introduce a novel differentiator neuron model to enhance material classification by capturing dynamic thermal responses. Our results demonstrate the benefits of multimodal sensory fusion and highlight the potential of neuromorphic approaches for achieving efficient, robust, and human-like perception in robotic systems.

Abstract:
Recently, many attempts have been made to integrate the foundation model with robotics. In most of those attempts, the model recognition results were treated as unique; however, the recognition results required for real robot tasks vary with the task goal. The recognition results of the foundation model are determined from the detection query; therefore, in the case of an ambiguous query, the query must be modified to match the purpose of the robot task. In this study, we propose an object recognition method that considers the task goal through application of an interactive task planning method using a large language model. The proposed method clarifies the purpose of the robot task by asking the user a question. Hence, uncertainty in the task plan due to ambiguous operation instructions is mitigated. During the task plan update arising from the dialog process, the object-detection results obtained from the query in the planning results are also updated to match the task goal. In our experiments, the proposed method’s effectiveness is verified quantitatively and qualitatively via object-detection tasks conducted on a custom-built verification dataset.

Abstract:
Multimodal Unmanned Aerial Vehicles (UAVs), capable of operating in different locomotion modes, offer greater versatility and optimized energy usage. This paper presents a novel trimodal UAV that integrates ground locomotion, hovering, and fixed-wing flight using a shared actuator system. The design features a quadcopter frame with modular components, including passive wheels for ground mobility and fixed wings for forward flight. A control system for ground locomotion was implemented within the ArduPilot framework, enabling autonomous waypoint navigation across all modes. The prototype was extensively tested, with a comprehensive energy efficiency evaluation conducted through wind tunnel experiments and flight trials. In forward flight, the vehicle’s range increased, although its endurance decreased. Ground mode saw significant gains in both. Wing incidence tuning in hover improved endurance and range but reduced controllability. Additionally, the vehicle was shown to be capable of climbing inclined surfaces, such as walls.

Abstract:
In indoor parking lots, the use of RTK/GNSS for vehicle localization is often impractical due to the significantly smaller space compared to outdoor roads, which demands higher precision in both mapping and localization. Although feature point based visual SLAM algorithms have achieved high localization accuracy, they impose significant storage demands on embedded systems, and the visual feature point maps are not time-stable and are sensitive to lighting conditions. In this paper, we propose a real-time mapping and localization system for multi-floor parking lot. For the mapping part, we introduce a map-free SLAM method for precise ego-pose estimation, along with an efficient incremental map update framework that supports loop closure and multi-session mapping tasks. In the localization part, a semantic map is reused for vehicle localization based on bidirectional incentive descriptors. We incorporate degenerate cases into our optimization process, which greatly enhances the localization results. To the best of our knowledge, this is the first comprehensive system proposed for multi-floor parking lots. Experimental results demonstrate that our approach achieves state-of-the-art mapping and localization accuracy in multi-floor environments on embedded platforms.

Abstract:
The control of locomotion in walking robots with various architectural designs presents significant challenges. While existing approaches primarily rely on low-level state information and isolated visual features, lacking the high-level semantic understanding that humans use to reason about movement and posture, we propose HeStIa, a novel framework that bridges visual perception, natural language understanding, and robotic control through multimodal learning. Our framework leverages multimodal large language models (MLLMs) to establish a semantic bridge between visual observations and motion control, enabling robots to understand and adjust their locomotion through both visual and linguistic modalities. By leveraging multimodal large language models (MLLMs), HeStIa establishes a semantic connection between visual observations and motion control, enabling robots to comprehend and adapt their locomotion through both visual and linguistic modalities. Our approach extracts spatiotemporal visual features from robot movements and transforms them into a cross-modal embedding space shared with textual descriptions. HeStIa incorporates an innovative vision-language-motion fusion mechanism to provide informed, context-aware feedback during the dynamic learning process. Through an asynchronous design, HeStIa effectively mitigates the inference delays typically associated with MLLMs while maintaining real-time performance in dynamic scenarios. The cross-modal representations learned by HeStIa facilitate more intuitive and efficient locomotion learning by grounding visual observations in natural language descriptions. Our comprehensive evaluation shows substantial improvements in motion naturalness, stability, and adaptability across diverse environmental conditions.

Abstract:
Human-like embodied tactile perception is crucial for the next-generation intelligent robotics. Achieving large-area, full-body soft coverage with high sensitivity and rapid response, akin to human skin, remains a formidable challenge due to critical bottlenecks in encoding efficiency and wiring complexity in existing flexible tactile sensors, thus significantly hinder the scalability and real-time performance required for human skin-level tactile perception. Herein, we present a new architecture employing code division multiple access- inspired orthogonal digital encoding to overcome these challenges. Our decentralized encoding strategy transforms conventional serial signal transmission by enabling parallel superposition of energy-orthogonal base codes from distributed sensing nodes, drastically reducing wiring requirements and increasing data throughput. We implemented and validated this strategy with off-the-shelf 16-node sensing array to reconstruct the pressure distribution, achieving a temporal resolution of 12.8 ms using only a single transmission wire. Crucially, the architecture can maintain sub-20ms latency across orders-of-magnitude variations in node number (to thousands of nodes). By fundamentally redefining signal encoding paradigms in soft electronics, this work opens new frontiers in developing scalable embodied intelligent systems with human-like sensory capabilities.

Abstract:
4D imaging radar is a type of low-cost millimeter-wave radar(costing merely 10-20% of lidar systems) capable of providing range, azimuth, elevation, and Doppler velocity information. Accurate extrinsic calibration between millimeter-wave radar and camera systems is critical for robust multimodal perception in robotics, yet remains challenging due to inherent sensor noise characteristics and complex error propagation. This paper presents a systematic calibration framework to address critical challenges through a spatial 3d uncertainty-aware PnP algorithm (3DUPnP) that explicitly models spherical coordinate noise propagation in radar measurements, then compensating for non-zero error expectations during coordinate transformations. Finally, experimental validation demonstrates significant performance improvements over state-of-the-art CPnP baseline, including improved consistency in simulations and enhanced precision in physical experiments. This study provides a robust calibration solution for robotic systems equipped with millimeter-wave radar and cameras, tailored specifically for autonomous driving and robotic perception applications.

Abstract:
Robotic system development must adopt a holistic approach for tactile and dynamic tasks, shifting from the decoupled design of end-effectors and robot manipulators for traditional sequential tasks. Although established metrics exist for traditional tasks, such as pick-and-place, they lack the nuanced evaluation required for dynamic and tactile operations. Accordingly, this paper introduces an integrated framework that defines and unifies decoupled and coupled gripper metrics into a single perspective. We categorise gripper metrics based on their interaction with the robot manipulator, which can be entirely decoupled, coupled by time-sequence, or coupled. Using this classification, we propose 16 metrics to evaluate force control, force reaction, and efficiency. We introduce three new experimental setups and describe the corresponding procedures to quantify these metrics. Results from three commercial finger grippers demonstrate the efficacy of the proposed metrics, revealing each gripper’s strengths and limitations when integrated into different manipulator systems. Incorporating these metrics into performance reviews provides a comprehensive evaluation of robotic system fitness, considering dynamic, real-time challenges. This supports informed design choices and enhances tactile manipulation tasks.

Abstract:
Learning a controller directly on the robot requires extreme sample efficiency. Model-based reinforcement learning (RL) methods are the most sample efficient, but they often suffer from a too long inference time to meet the robot control frequency requirements. In this paper, we address the sample efficiency and inference time challenges with two contributions. First, we define a general framework to deal with inference delays where the slow inference robot controller provides a sequence of actions to feed the control-hungry robotic platform without execution gaps. Then, we compare several RL algorithms in the light of this framework and propose RT-HCP, an algorithm that offers an excellent trade-off between performance, sample efficiency and inference time. We validate the superiority of RT-HCP with experiments where we learn a controller directly on a simple but high frequency FURUTA pendulum platform. Code: github.com/elasriz/RTHCP

Abstract:
The application scenarios of automated robots are undergoing a paradigm shift from structured environments to unstructured, complex settings. In highly constrained settings like factory inspections or disaster rescue, conventional steering systems show clear drawbacks. While the four-wheel independent drive and independent steering (4WS) robot provides a variety of steering modes, which can effectively meet the needs of complex environments. However, how a 4WS robot autonomously selects different steering modes based on trajectory point information during trajectory tracking remains a challenging problem. This paper proposes a multi-modal trajectory tracking method considering the switch of steering modes, which decomposes the trajectory tracking task into two parts: mode decision-making and tracking control. The corresponding method is designed based on deep reinforcement learning. Additionally, a target trajectory random generator and corresponding training interaction environment are designed to train the model in a data-driven manner. In the designed scenario, our tracker achieve more than a 30% improvement in average tracking error across all motion modes compared with model predictive control, and the decider’s average decision position error is less than 2 cm. Extensive experiments demonstrate that our method achieves superior tracking performance and real-time capabilities compared to current methods.

Abstract:
Incorporating formal methods into reinforcement learning (RL) has the potential to result in the best of both worlds, combining the robustness of formal guarantees with the adaptability and learning capabilities of RL, though careful design is needed to balance safety and exploration. In this work, we propose a framework to mitigate this loss of exploration while still allowing for the safety of the system to be ensured. Specifically, we introduce a less restrictive method that can reduce the conservativeness of formal methods by refining a disturbance model using online collected data and it evaluates the safety of a learning-based controller, using computationally efficient zonotopic reachability analysis for the safety analysis to facilitate a real-time implementation. We validate the framework in a real-world drone flight through a canyon, where the drone is subjected to unknown external disturbances and the framework is tasked with learning those disturbances online and adjusting the safety guarantees accordingly. The results show that the framework enables a less restrictive online training of learning-based controllers without compromising the safety of the system.

Abstract:
In this work, we propose a novel, fast, and memory-efficient unsupervised statistical method, combined with an unsupervised deep learning (DL) model, for de-snowing 3D LiDAR point clouds in a fully unsupervised fashion. The results obtained on the real-scanned Winter Adverse Driving dataSet (WADS) show that our DL model achieves a 6.3% improvement in mIoU over the current state-of-the-art unsupervised DL methods and performs comparable to supervised counterparts, substantially narrowing the performance gap between supervised and unsupervised approaches. In addition to that, our model also outperforms its closest competitor by 12.8% mIoU when tested on our Canadian Adverse Driving Conditions (CADC) dataset annotations. Additionally, our de-snowing algorithm enhances downstream semantic segmentation and object detection tasks without even requiring any modifications to the base segmentation and detection models. The source code, trained models, and the online supplementary information are available at the following URL: https://sporsho.github.io/3DUnOutDet.

Abstract:
Offline reinforcement learning (RL) has emerged as a promising framework for addressing robot social navigation challenges. However, inherent uncertainties in pedestrian behavior and limited environmental interaction during training often lead to suboptimal exploration and distributional shifts between offline training and online deployment. To overcome these limitations, this paper proposes a novel offline-to-online fine-tuning RL algorithm for robot social navigation by integrating Return-to-Go (RTG) prediction into a causal Transformer architecture. Our algorithm features a spatiotemporal fusion model designed to precisely estimate RTG values in real-time by jointly encoding temporal pedestrian motion patterns and spatial crowd dynamics. This RTG prediction framework mitigates distribution shift by aligning offline policy training with online environmental interactions. Furthermore, a hybrid offline-online experience sampling mechanism is built to stabilize policy updates during fine-tuning, ensuring balanced integration of pre-trained knowledge and real-time adaptation. Extensive experiments in simulated social navigation environments demonstrate that our method achieves a higher success rate and lower collision rate compared to state-of-the-art baselines. These results underscore the efficacy of our algorithm in enhancing navigation policy robustness and adaptability. This work paves the way for more reliable and adaptive robotic navigation systems in real-world applications.

Abstract:
Magnetic soft continuum robots are capable of bending with remote control in confined space environments, and they have been applied in various bioengineering contexts. As one type of ferromagnetic soft continuums, the Magnetically Induced Metamorphic Materials (MIMMs)-based continuum (MC) exhibits similar bending behaviors. Based on the characteristics of its base material, MC is flexible in modifying unit stiffness and convenient in molding fabrication. However, recent studies on magnetic continuum robots have primarily focused on one or two design parameters, limiting the development of a comprehensive magnetic continuum bending model. In this work, we constructed graded-stiffness MCs (GMCs) and developed a numerical model for GMCs’ bending performance, incorporating four key parameters that determine their performance. The simulated bending results were validated with real bending experiments in four different categories: varying magnetic field, cross-section, unit stiffness, and unit length. The graded-stiffness design strategy applied to GMCs prevents sharp bending at the fixed end and results in a more circular curvature. We also trained an expansion model for GMCs’ bending performance that is highly efficient and accurate compared to the simulation process. An extensive library of bending prediction for GMCs was built using the trained model.

Abstract:
The virtual pivot point (VPP), a theoretical convergence point of ground reaction forces during gait, has gained attention for its potential to uncover underlying locomotor control strategies. Here, we present the first investigation of VPP in individuals with unilateral above-knee amputation, using a publicly available dataset of 18 participants. Subjects were categorized into K2 (walking speeds 0.4–0.8m/s) and K3 (0.6–1.4m/s) functional levels. Our findings show that both groups demonstrate high sagittal-plane VPP quality, comparable to that of healthy individuals, with R2 > 95%, indicating a strong relationship between VPP formation and sagittal plane dynamics. Conversely, in the frontal plane, VPP analysis reveals greater variability and lower quality, indicating the absence of a well-defined pivot during gait. Notably, frontal-plane VPP quality deteriorates with increasing walking speed, particularly in K3 ambulators. While this speed-dependency is observed in healthy individuals as well, the rate of decline is significantly steeper in amputees. Additionally, spatial analysis of VPP positions reveals a consistent elevation of the amputated leg’s VPP compared to the intact leg. These findings emphasize the importance of frontal plane dynamics in amputee gait and suggest improvements in prosthetic design to enhance control and promote more symmetrical, natural gait.

Abstract:
Exoskeleton technology holds significant promise within the human-centric paradigm of Industry 5.0 for mitigating work-related musculoskeletal disorders (WMSDs). However, existing systems often struggle with mismatched assistive torque and inefficient human-machine collaboration under dynamic loading conditions, largely due to insufficient motion intent recognition accuracy. This study proposes a dual-model-based multimodal fusion control strategy that integrates a bidirectional LSTM neural network (Bi-LSTM) with a transformer-based multi-task learning model (MTL) to enable real-time torque compensation and accurate prediction of dynamic load mass under varying conditions. The team developed a lightweight elbow joint exoskeleton prototype, leveraging multi-modal information to enhance assistive torque prediction accuracy. Experimental results show an 83.7% reduction in agonist muscle activation under a 3.5 kg load compared to conditions without the exoskeleton, underscoring its potential for industrial material handling scenarios.

Abstract:
Path planning in uneven terrain scenarios is one of the core capabilities of intelligent off-road robots and vehicles. The complex terrain undulations often cause bumpy motion and sharp turns along the planned path, making smooth and safe path planning challenging. Most path planning methods in this community rely on dense point cloud maps as direct inputs, which inevitably incur high computational overhead for map representation and terrain assessment. To address these problems, we propose a novel path planning method toward uneven terrains, SEM-RRT, which balances both planning quality and computational efficiency. First, we propose a map representation namely Statistical Elevation Map (SEM), which is lightweight to store and compute. Then, to enable fast terrain risk assessment, a terrain risk filter with omnidirectional and multi-scale characteristics is designed. Finally, we incorporate multi-objective cost evaluation, backward search, and rolling optimization strategies into the Informed RRT framework, leveraging its path optimality on large scale map. Extensive experiments in challenging terrain scenarios, such as hills, canyons, and volcanic landscapes, show that SEM-RRT outperforms existing methods in both path quality and computational time.

Abstract:
The ability to perform cross-category object perception and manipulation is highly desirable in building intelligent robots. One promising approach is to define the concept of Generalizable and Actionable Parts (GAParts), such as buttons and handles, on both seen and unseen object categories. However, the accurate cross-category perception of GAParts is still challenging due to the large inter-category object shape variations. To address this issue, we introduce SAMIR, a novel framework using SAM-rectified segmentation and Iterative pose Refinement for GAPart detection and manipulation. Firstly, we introduce a Segment Anything (SAM) segmentation prior to rectify the unconfident, fragmented GAPart instance proposals. Secondly, in addition to the zero-shot generalization of the SAM foundation model, we further finetune it with a lightweight adaptor model on our task dataset. Finally, we propose an iterative pose refinement procedure that improves the accuracy of GAPart pose estimation. Our perception experiments on GAPartNet dataset show that SAMIR consistently outperforms the baseline method on instance segmentation and pose estimation tasks. Our manipulation experiments in Sapien simulator illustrate that SAMIR leads to an improved manipulation success rate. We also deploy our method to a real robot for real-world manipulation. Our code and video are available at sites.google.com/view/samir-gapart.

Abstract:
Lower limb exoskeletons (LLEs) play a crucial role in assisting paraplegic patients with walking in outdoor environments characterized by complex terrains, including various stairs, slopes, and uneven grounds. However, most existing control methods for LLEs rely on predefined joint angles, lacking the flexibility to adapt to diverse terrains. This deficiency often leads to unexpected contacts between the feet of the LLEs and the ground, thereby disrupting the walking balance of the LLEs. In this paper, a novel force-sensor-free contact estimation method is proposed to tackle this problem. This method utilizes only the sensors already present on the LLEs, eliminating the need for any additional force sensors. The proposed approach is founded on the probabilistic modeling of gait phases, knee joint torques, foot heights, and the displacement of the center of mass. Moreover, Kalman filtering is employed to enhance the contact estimation accuracy by integrating multiple probabilistic models. Experiments were carried out on both robot simulation platforms and real exoskeleton robots. The experimental results demonstrate that the proposed approach can accurately estimate contacts during walking on flat ground and stairs. Specifically, it achieves an accuracy of 99% with a time deviation of 8 ms on the flat ground and an accuracy of 95% with a time deviation of 10 ms on stairs.

Abstract:
Insects and hummingbirds exhibit remarkable agility, including full body flip maneuvers. Achieving similar maneuvers of bio-inspired tailless flapping-wing robots (FWRs) is challenging due to the complex dynamics, inherent nonlinearities and control issues. This paper presents an nonlinear model predictive control (NMPC) algorithm to enable the 360-degree flip maneuver for the developed X-wing tailless FWR, which weighs 30.8 g and has a wingspan of 14.5 cm. We first introduce a high-fidelity model of the FWR, which incorporates the aerodynamics of the wings, dynamics of the motors and servos, body kinodynamic model, and the model of thrust and torques generation. This high-fidelity model allows for testing the FWR in simulation environments, thereby reducing the damage and cost associated with flip maneuvers in real-world experiments. Based on this high-fidelity model, we propose an NMPC controller to offline compute optimal state trajectories and corresponding control inputs, which are then used as state references and the feedforward control for the FWR during its 360-degree flip maneuvers. Next, we present an online basic feedback controller that integrates the feedforward control for the FWR’s flip control. Experimental results demonstrate the successful execution of the flip maneuvers without any mechanical modifications, highlighting the effectiveness of the proposed control strategy.

Abstract:
Detecting human actions is a crucial task for autonomous robots and vehicles, often requiring the integration of various data modalities for improved accuracy. In this study, we introduce a novel approach to Human Action Recognition (HAR) using language supervision named LS-HAR based on skeleton and visual cues. Our method leverages a language model to guide the feature extraction process in the skeleton encoder. Specifically, we employ learnable prompts for the language model conditioned on the skeleton modality to optimize feature representation. Furthermore, we propose a fusion mechanism that combines dual-modality features using a salient fusion module, incorporating attention and transformer mechanisms to address the modalities’ high dimensionality. This fusion process prioritizes informative video frames and body joints, enhancing the recognition accuracy of human actions. Additionally, we introduce a new dataset tailored for real-world robotic applications in construction sites, featuring visual, skeleton, and depth data modalities, named VolvoConstAct. This dataset serves to facilitate the training and evaluation of machine learning models to instruct autonomous construction machines for performing necessary tasks in real-world construction sites. To evaluate our approach, we conduct experiments on our dataset as well as three widely used public datasets: NTU-RGB+D, NTU-RGB+D 120, and NW-UCLA. Results reveal that our proposed method achieves promising performance across all datasets, demonstrating its robustness and potential for various applications. The code, dataset, and demonstration of real-machine experiments are available at: https://mmahdavian.github.io/ls_har/

Abstract:
Traditional cable-driven parallel robots (T-CDPRs) typically employ multiple driving cables extending from fixed anchor points to connect with the end-effector. Once assembled, the anchor positions and the wrench feasible workspace of the CDPR are fixed and cannot be readily modified. This paper presents a spatial movable anchor winch cable-driven parallel robot (M-CDPR), whose distinctive feature lies in its ability to freely move anchor winches along enclosed circular tracks. Through modelling of the proposed CDPR, this study proposes an anchor winch reconfiguration methodology that significantly enhances the kinematic flexibility of the M-CDPR. Comparative simulation analyses of the wrench feasible workspace and orientation variation range between the M-CDPR and its fixed anchor counterpart with the identical dimensions demonstrates that the incorporation of movable anchor winches expands both the wrench feasible workspace and orientation variation range of the M-CDPR. Finally, a prototype was constructed, and relevant position control and impedance control experiments were conducted using the proposed anchor winch reconfiguration method.

Abstract:
Visual Place Recognition (VPR) is a crucial capability for long-term autonomous robots, enabling them to identify previously visited locations using visual information. However, existing methods remain limited in indoor settings due to the highly repetitive structures inherent in such environments. We observe that scene texts frequently appear in indoor spaces and can help distinguish visually similar but different places. This inspires us to propose TextInPlace, a simple yet effective VPR framework that integrates Scene Text Spotting (STS) to mitigate visual perceptual ambiguity in repetitive indoor environments. Specifically, TextInPlace adopts a dual-branch architecture within a local parameter sharing network. The VPR branch employs attention-based aggregation to extract global descriptors for coarse-grained retrieval, while the STS branch utilizes a bridging text spotter to detect and recognize scene texts. Finally, the discriminative texts are filtered to compute text similarity and re-rank the top-K retrieved images. To bridge the gap between current text-based repetitive indoor scene datasets and the typical scenarios encountered in robot navigation, we establish an indoor VPR benchmark dataset, called Maze-with-Text. Extensive experiments on both custom and public datasets demonstrate that TextInPlace achieves superior performance over existing methods that rely solely on appearance information. The dataset, code, and trained models are publicly available at https://github.com/HqiTao/TextInPlace.

Abstract:
Multi-Agent Path Finding (MAPF), which focuses on finding collision-free paths for multiple robots, is crucial for applications ranging from aerial swarms to warehouse automation. Solving MAPF is NP-hard so learning-based approaches for MAPF have gained attention, particularly those leveraging deep neural networks. Nonetheless, despite the community’s continued efforts, all learning-based MAPF planners still rely on decentralized planning due to variability in the number of agents and map sizes. We have developed the first centralized learning-based policy for MAPF problem called RAILGUN. RAILGUN is not an agent-based policy but a map-based policy. By leveraging a CNN-based architecture, RAILGUN can generalize across different maps and handle any number of agents. We collect trajectories from rule-based methods to train our model in a supervised way. In experiments, RAILGUN outperforms most baseline methods and demonstrates great zero-shot generalization capabilities on various tasks, maps and agent numbers that were not seen in the training dataset.

Abstract:
We introduce a novel self-improving framework that enhances Embodied Visual Tracking (EVT) with Vision-Language Models (VLMs) to address the limitations of current active visual tracking systems in recovering from tracking failure. Our approach combines the off-the-shelf active tracking methods with VLMs’ reasoning capabilities, deploying a fast visual policy for normal tracking and activating VLM reasoning only upon failure detection. The framework features a memory-augmented self-reflection mechanism that enables the VLM to progressively improve by learning from past experiences, effectively addressing VLMs’ limitations in 3D spatial reasoning. Experimental results demonstrate significant performance improvements, with our framework boosting success rates by 72% with state-of-the-art RL-based approaches and 220% with PID-based methods in challenging environments. This work represents the first integration of VLM-based reasoning to assist EVT agents in proactive failure recovery, offering substantial advances for real-world robotic applications that require continuous target monitoring in dynamic, unstructured environments. Project website: https://sites.google.com/view/evt-recovery-assistant.

Abstract:
We explore symbolic policy optimization for various legged locomotion challenges; specifically walker environments ranging from bipedal to highly redundant systems with 128 legs. These represent a broad range of action space dimensionalities. We find that state-of-the-art symbolic policy optimization approaches struggle to scale to these higher dimensional problems, due to the need to iterate over action dimensions, and their reliance on a neural network anchor policy. We thus propose Fast Symbolic Policy (FSP) to accelerate the training of symbolic locomotion policies. This approach avoids the need to iterate over the action dimensions, and does not require a pre-trained neural network anchor. We also propose Dim-X, a method for effectively reducing the action space dimensionality using the inductive priors of legged locomotion. We demonstrate that FSP with Dim-X can learn symbolic policies, with improved scaling performance compared to the baselines, vastly exceeding that possible with previous symbolic techniques. We further show that Dim-X on its own can also be integrated into neural network policies to shorten their training time and improve scaling performance.

Abstract:
High-definition (HD) maps are crucial for autonomous driving systems. Despite recent advances in learning-based HD map prediction methods, these approaches experience significant performance degradation when encountering unseen weather or lighting conditions due to feature distribution discrepancies (domain gaps) of input images. To address this issue, we propose SAMap, a novel map learning framework that enhances domain generalization capabilities of existing models by reducing domain discrepancies in input images. SAMap innovatively introduces a Semantic Aligner, an image-to-image transformation module that aligns images from different domains into a unified domain space while preserving semantic consistency. To train this aligner, we leverage Vision-Language Models (VLMs) that have acquired image-text alignment capabilities. Specifically, we first train a Prompt Learner that combines handcrafted and learnable prompts to capture domain-invariant semantic information. We then train Semantic Aligner through dual supervision mechanisms: a content preservation loss that maintains feature consistency across transformations and a semantic alignment loss that leverages VLM’s encoders to align transformed images with domain-invariant textual representations. Adequate experiments on the NuScenes dataset demonstrate that when integrated with three existing HD map prediction methods, SAMap achieves a performance improvement of up to 11.6% on unseen domains (rain or night conditions), effectively validating its generalization capabilities across domains.

Abstract:
This paper presents a novel, compact overactuated tricopter featuring a servo-driven twisting and tilting mechanism, preventing the adverse effects of internal force contradiction during flight. Each arm’s vectored thrust is provided by a single motor, with the twisting and tilting angles controlled by two vertically mounted servos. These components are collectively mounted within a 3D-printed semi-ring structure, and are rigidly attached to the fuselage via carbon tubes at the twist end. To address the asymmetry inherent in the tricopter configuration, we conducted a qualitative analysis of the disturbances introduced by the actuators. Additionally, we emphasize the need to include gyroscopic torque effects caused by arm rotations. This issue is addressed using a control allocation method, with our proposed improved Force Decomposition (FD)-based iteration offering a low-cost computational solution. The dynamic models of the motion of rotational joints, identified and employed as virtual sensors, contribute to the estimation of the improved control effective matrix. This overactuated tricopter can operate like a conventional rotorcraft, with the added capability of achieving attitude adjustments through manual control inputs. Finally, we demonstrate the tricopter’s advantages by comparing simulations and flight experiments, both with and without the application of the improved method.

Abstract:
Semantic scene understanding in bird-eye view (BEV) plays a crucial role in autonomous driving. A common approach to generating BEV maps from LiDAR point-cloud data involves constructing a pillar-level representation by projecting 3D point clouds onto a 2D plane. This process partially discards spatial geometric information, and produces sparse semantic maps. However, downstream tasks (e.g., trajectory planning and prediction), typically require dense grid-like semantic BEV maps rather than sparse segmentation outputs. To bridge this gap, we propose PointDenseBEV, an end-to-end, distribution-aware feature fusion framework. It takes as input sparse LiDAR point clouds and directly generates dense semantic BEV maps. Spatial geometric information and temporal context are embedded as auxiliary semantic cues within the BEV grid representation to enhance semantic density. Extensive experiments on the SemanticKITTI dataset demonstrate that our method achieves competitive performance compared to existing approaches.

Abstract:
Underwater target tracking is a critical challenge in marine exploration and defense applications due to the unknown maneuvers of target and the complex marine environment. To overcome the above challenge, this paper develops a digital twin (DT)-driven unknown maneuver target tracking strategy via remotely operated vehicles (ROVs). In order to capture the maneuver characteristics of target, a state prediction-based DT framework is constructed, where the neural network learning strategy is designed to estimate the unknown state transition matrix of target. Based on the predicted target state, a reinforcement learning (RL)-based tracking controller is designed for the virtual ROVs in DT model, such that the optimal tracking policy from DT model can be implemented to physical ROVs. To reduce the matching error between virtual and physical ROVs, an RL-based optimization algorithm is conducted by using the data interaction between DT model and ROVs. Note that the DT-driven target tracking strategy not only can reduce the communication energy consumption by periodically feeding back the real-data of ROVs to the DT model, but also can relax the dependence of target maneuver model via the state prediction method. Finally, experimental results are provided to verify the effectiveness of our strategy.

Abstract:
Robotic motion planning faces formidable challenges in constrained environments, particularly in rapidly searching for feasible solutions and converging towards optimal. This study introduces Multi-Sets Tree (MST), a sampling-based planner designed to accelerate path searching and solution optimization. MST integrates estimated guided incremental local densification (GuILD) sets that are based on prior estimated solution costs before finding the initial solution. For path optimization, MST integrates novel beacon selectors to define problem subsets, thereby guiding exploration and effectively exploiting high-potential areas. This multi-set strategy ensures balanced exploration and exploitation, enabling MST to handle sparse free space. Moreover, MST utilizes adaptive sampling techniques via Lebesgue’s measure of domain subsets for rapid search. MST improves search efficiency and path optimality, particularly in constrained high-dimensional environments. It extends the informed sampling concept by refining the search region and batch sampling. Experimental results demonstrate that MST outperforms single-query planners across ℝ4 to ℝ16 benchmarks and in real-world robotic navigation tasks. A video showcasing our experimental results is available at: https://youtu.be/obftvS0a41M.

Abstract:
For direct teleoperation tasks, the follower robot accomplishes tasks by strictly executing the inputs from the operator. However, the operator's physiological tremor seriously reduces the smoothness of the trajectory, especially in tasks relying on operator’s experience such as gluing, while the random tremor is hard to be described and suppressed online. To navigate this challenge, this paper proposes an Online Hidden Markov Model with Two-Layer Bayesian (TLB-OHMM) method to suppress the tremor by estimating the operator's expected speed, which can recognize new intention online and is adaptive to complex trajectories. First, the actual moving speed of the human hand is modeled as an HMM. Then, an Online-HMM method based on online expectation maximum (EM) algorithm is introduced to shorten the training time and realize the online updating of the HMM parameters. Finally, a two-layer Bayesian method is proposed to estimate the expected speed of the human hand. Experimental results in simulation and real teleoperated gluing task show that the proposed method greatly reduces computation time and improves the quality of trajectories, especially for complex curve trajectories, compared with the traditional HMM based method.

Abstract:
In human communication, referring to a specific object within an environment often involves the combination of a pointing gesture to indicate the object’s direction and linguistic descriptions specifying its name and attributes, thereby enabling precise object identification. Inspired by this natural multimodal interaction, we formalize the zero-shot object navigation driven by language and pointing gesture (LGZSON) task, which aims to more closely approximate real-world human-agent communication scenarios. To address this task, we propose LGNav, an open-set, training-free navigation framework. LGNav estimates the pointing gesture direction by extracting human body landmarks and integrates this directional information with depth images to initialize a versatile candidate position map (VCPM). The framework further employs open-vocabulary object detection to identify all potential candidate objects in the environment, projecting them onto the VCPM. Guided by a motion policy derived from the VCPM, LGNav continuously explores the unknown environment, sequentially visits candidate objects, and utilizes a large vision-language model (LVLM) to verify whether each candidate object satisfies the given navigation instruction. Extensive experimental results validate the effectiveness of LGNav, demonstrating its strong performance in the LG-ZSON task. Furthermore, even in the absence of pointing gestures, LGNav achieves competitive results on standard object navigation benchmarks, including the Gibson and HM3D datasets, outperforming a range of strong baseline methods.

Abstract:
The application of humanoid robots is gaining attention as a solution to the caregiver shortage caused by an aging population. In this study, we automated the process of changing a patient’s body position from supine to lateral, a key aspect of nursing care. We proposed a method for recognizing a patient’s 3D posture at close range by simultaneously using a fisheye camera and a RGBD camera. For robot motion, we developed a trajectory generation method that adapts to the patient’s posture by converting measurement data into a mathematical model. Additionally, we identified the optimal timing for the movement of robot arms with minimal physical strain by considering human body dynamics. In all experiments using mannequins of different body shapes, the robot successfully reached the target joint and lifted one side of the body by more than 48 degrees. Future work will include detection of joints unaffected by body bulges and application the method to other repositioning movements.

Abstract:
Precise motion tracking control with unknown structural knowledge and noise disturbance for redundant robots remains a critical and unresolved challenge. This article proposes a novel data-driven fuzzy discrete recurrent neural network (D2-FDRNN) model to address two fundamental limitations of existing models: dependency on known kinematic knowledge and fixed sampling schemes. First, a Jacobian pseudo-inverse estimator is developed to reconstruct the manipulator’s necessary kinematic knowledge using input and output data, eliminating the need for explicit Jacobian inversion. Second, a fuzzy logic-based adaptive sampling strategy dynamically adjusts the step size to balance computational efficiency and tracking precision. In addition, a Kalman filter algorithm is applied to reduce the impact of noise. Rigorous proofs confirm the model’s exponential convergence and noise immunity. To validate the proposed D2-FDRNN model, simulations and physical experiments are carried out. The source code is available at https://github.com/YingluckZ/DD-FDRNN.git.

Abstract:
Autonomous Underwater Vehicles (AUVs) require energy-efficient and responsive attitude control for underwater operations. We present UVS, a novel underwater vehicle that combines Variable Center of Mass System (VCMS) and thrusters for hybrid attitude regulation. Through multi-objective optimization of the VCMS structure, we achieved a 5.19% larger pitch angle range while reducing space occupation by 15.72%. Pool experiments demonstrated near-linear pitch control from 17.5° to 172.5° with stable horizontal-vertical mode transitions. Our proposed collaborative control method integrates VCMS and thruster advantages, enabling rapid convergence to target attitudes with long-term stability. The results show UVS’s potential for energy-efficient, wide-range attitude control in mobile ocean sensing applications.

Abstract:
Solving the Simultaneous Localization and Mapping (SLAM) problem is essential for most mobile robotics applications that do not have an a priori environment representation. The SLAM problem is well-studied, with previous works demonstrating impressive results with a variety of robots, sensors, and environments. However, despite the widespread need for SLAM solutions across a broad spectrum of robotics disciplines and applications, it remains challenging to quickly and easily scale existing solutions to new robots, environments, and tasks. In this paper, we address this problem by introducing Hyla-SLAM, a framework for 3D LiDAR-based SLAM that uses dynamic memory management to efficiently create and manage maps of environments of arbitrary size and density. Hyla-SLAM also scales to diverse systems and applications by using behavior trees to maximize runtime flexibility and extensibility. We demonstrate the scalability of Hyla-SLAM in experiments using datasets collected in North America, Europe, and Asia to generate a single, unified global-scale map thousands of kilometers across that can be efficiently accessed and expanded. We also show experiments using the behavior tree interface to make robot- or task-informed modifications that enable deployment on heterogeneous robots with varying system constraints. These results demonstrate the framework’s ability to efficiently create and manage huge maps while generalizing to a wide range of systems and applications with minimal reconfiguration. We release Hyla-SLAM’s code implementation1 open-source.

Abstract:
Aiming to enhance the consistency and thus long-term accuracy of Extended Kalman Filters for terrestrial vehicle localization, this paper introduces the Manifold Error State Extended Kalman Filter (M-ESEKF). By representing the robot’s pose in a space with reduced dimensionality, the approach ensures feasible estimates on generic smooth surfaces, without introducing artificial constraints or simplifications that may degrade a filter’s performance. The accompanying measurement models are compatible with common loosely- and tightly-coupled sensor modalities and also implicitly account for the ground geometry. We extend the formulation by introducing a novel correction scheme that embeds additional domain knowledge into the sensor data, giving more accurate uncertainty approximations and further enhancing filter consistency. The proposed estimator is seamlessly integrated into a validated modular state estimation framework, demonstrating compatibility with existing implementations. Extensive Monte Carlo simulations across diverse scenarios and dynamic sensor configurations show that the M-ESEKF outperforms classical filter formulations in terms of consistency and stability. Moreover, it eliminates the need for scenario-specific parameter tuning, enabling its application in a variety of real-world settings.

Abstract:
Multi-agent adversarial tasks such as swarm robotics and autonomous vehicle coordination, demand efficient decentralized collaboration under partial observability. While model-free multi-agent RL (MF-MARL) methods suffer from necessitating extensive environment interactions, most existing multi-agent model-based RL (MA-MBRL) methods fail to align with the Centralized Training with Decentralized Execution (CTDE) paradigm, which limits system flexibility. This paper proposes a novel modular architecture with refined observations (MARO) to achieve the CTDE paradigm by decoupling agents from the world model. Key innovations include: 1) an enhanced world model with weighted loss and history-augmented rollout for high-quality data generation; 2) a dual-stream semantic decomposition network (DSDN) that performs fine-grained decomposition of observations to refine action mapping and mitigate performance degradation from information loss. Extensive experiments on the StarCraft Multi-Agent Challenge (SMAC) demonstrate superior performance over opponents, validating the effectiveness and advancement of MARO.

Abstract:
Cable-driven soft manipulators, with inherent compliance and hyper-redundancy, offer significant advantages in unstructured environments but present formidable challenges in modeling of inverse kinematics due to nonlinear deformations and underactuation. In this paper, building on a modified forward kinematic model, a physics-informed neural networks (PINN) framework based on spatiotemporal data is proposed for efficient inverse kinematics computation of cable-driven soft robotic manipulators. A geometrically exact forward kinematic model is constructed under the Piecewise Constant Curvature (PCC) assumption, extended to multi-section configurations, and enhanced by cable deflection compensation to account for practical routing constraints. Experimental validation shows a 40.11% reduction in end-effector positioning error (average 15.98 mm) when deflection effects are included. The proposed PINN architecture takes time and section count as inputs and outputs the corresponding manipulator configuration, enabling unified spatiotemporal trajectory tracking by minimizing elastic energy while satisfying kinematic constraints. Compared to particle swarm optimization (PSO), which requires iterative computation for each trajectory sample, the proposed method reduces computational time by over 71.9%, demonstrating superior efficiency in solving redundant inverse kinematics problems. This work bridges data-driven and mechanics-based approaches, offering a scalable solution for real-time control of soft manipulators.

Abstract:
Autonomous navigation is needed for several robotics applications. In this paper we present an autonomous Micro Aerial Vehicle (MAV) system which purely relies on cost-effective and light-weight passive visual and inertial sensors to perform large-scale autonomous navigation in outdoor, unstructured and cluttered environments. We leverage visual-inertial simultaneous localization and mapping (VI-SLAM) for accurate MAV state estimates and couple it with a volumetric occupancy submapping system to achieve a scalable mapping framework which can be directly used for path planning. To ensure the safety of the MAV during navigation, we also propose a novel reference trajectory anchoring scheme that deforms the reference trajectory the MAV is tracking upon state updates from the VI-SLAM system in a consistent way, even upon large state updates due to loop-closures. We thoroughly validate our system in both real and simulated forest environments and at peak velocities up to 3 m/s – while not encountering a single collision or system failure. To the best of our knowledge, this is the first system which achieves this level of performance in such an unstructured environment using low-cost passive visual sensors and fully on-board computation, including VI-SLAM. Code available at https://github.com/ethz-mrl/mrl_navigation.

Abstract:
This paper proposes a novel whole-body impedance control method for the Collaborative dUal-arm Robot manIpulator (CURI) in Human-Robot Collaboration (HRC). The method enables CURI to adapt its physical behavior to human motion while following trajectories learned from human-human demonstrations. A whole-body impedance controller coordinates the robot joints to achieve desired Cartesian space impedance. Collaborative tasks are captured from human-human demonstrations and represented using a Task-parameterized Gaussian Mixture Model (TP-GMM). Electromyography (EMG) sensors record muscle activities to estimate human impedance profiles, which are then mimicked by a variable impedance controller. An adaptive parameter is introduced to adjust robot stiffness based on spatial displacement between the robot and human, ensuring safe and efficient interaction. Experimental validation through confrontational Tai Chi pulling/pushing tasks demonstrates the superiority of the proposed adaptive impedance method over the fixed impedance controller.

Abstract:
LiDAR odometry has gained popularity due to the LiDAR sensor’s accurate depth measurement and robustness to varying illumination conditions. The feature-based methods achieve the advantage of efficiency through feature extraction, while the distribution-based ones attain better accuracy by modeling the point cloud as distributions. Combining the strengths of both methods is anticipated to yield superior performance. However, existing combination schemes typically extract and process features and distributions separately, and such a loosely integrated approach cannot fully leverage their complementarity. To address this problem, we propose a novel LiDAR odometry method based on surface distributed point (SDP) feature with dual feature fusion. Specifically, the SDP feature is introduced to tightly integrate features and distributions, facilitating efficient feature association and map maintenance. On this basis, the associated source and target features are then effectively integrated through dual feature fusion to form the dual-fusion (DF) associated plane. This plane serves as the basis for constructing the point-to-DF associated plane constraint for pose optimization. As a result, the local planar structure is more accurately reflected, thereby enhancing the accuracy of pose estimation. The SDP feature and the resultant constraint are employed in both scan-to-map matching and fixed-lag smoothing, which are hierarchically organized to achieve accurate pose estimation. Experiments on KITTI dataset and large-scale KITTI-360 dataset demonstrate the effectiveness of the proposed method.

Abstract:
Magnetic helical microrobots have been widely applicated in environmental remediation, sensing, targeted medical applications, and so on. However, for locomotion and manipulation in unstructured liquid environments, the capabilities of distinguished motions and on-demand parking/starting over a team of microrobot are essential. Here, we propose a method for achieving on-demand motions conversion of helical microrobots by modulating surface wettability through surface chemical modification and surface microstructural modifications. An obvious difference shows that microrobot after chemical modification exhibit hydrophilicity and microrobot after microstructural modification exhibit hydrophobicity, where the latter possess higher moving step-out frequency and maximum forward velocity compared to the microrobot after surface chemical modification. The step-out frequencies and maximum velocities of the three types of microrobots (chemistry-modified, unmodified, and pimples-modified) are 13 Hz, 16 Hz, 22 Hz, and 385 μm/s, 511 μm/s, 649 μm/s. Furthermore, our method has demonstrated that can be employed to achieve effective on-demand targeted motions and modal conversion in liquid environment. We anticipate that the method can be potentially employed to achieve precise targeted drug delivery and surgery in biomedical applications.

Abstract:
Recently, visual grounding and multi-sensors setting have been incorporated into perception system for terrestrial autonomous driving systems and Unmanned Surface Vessels (USVs), yet the high complexity of modern learning-based visual grounding model using multi-sensors prevents such model to be deployed on USVs in the real-life. To this end, we design a low-power multi-task model named NanoMVG for waterway embodied perception, guiding both camera and 4D millimeter-wave radar to locate specific object(s) through natural language. NanoMVG can perform both box-level and mask-level visual grounding tasks simultaneously. Compared to other visual grounding models, NanoMVG achieves highly competitive performance on the WaterVG dataset, particularly in harsh environments. Moreover, the real-world experiments with deployment of NanoMVG on embedded edge device of USV demonstrates its fast inference speed for real-time perception and capability of boasting ultra-low power consumption for long endurance.

Abstract:
Micromanipulation using magnetic microswarms has garnered significant attention in recent years due to their potential in microscale cargo delivery tasks. While existing studies have demonstrated the capabilities of microswarms in basic manipulation tasks, they often lack the autonomy required to handle more complex specifications, particularly temporal logic tasks. In this paper, we propose a novel formal planning strategy for magnetic microswarms that enables cargo delivery in complex environments while satisfying finite linear temporal logic (LTLf) specifications. Our approach consists of two key components. First, we develop a high-level path planner based on a bidirectional temporal logic rapid-explore random tree star (BTL-RRT) algorithm, which facilitates efficient planning while ensuring compliance with the given task specifications. Second, we employ an automaton to manage the manipulation modes of the microswarm, enabling real-time control over the capture and release of cargoes. In addition, we implement the planning strategy on microswarms actuated by a visual feedback magnetic tweezers system. Extensive simulations and experimental results demonstrate the effectiveness of the proposed planning strategy. The results indicate that, using the proposed approach, microswarms can autonomously select and deliver multiple microbeads to designated regions in both static and dynamic environments, adhering to the LTLf specifications.

Abstract:
Passive haptic feedback devices are designed to be lightweight, compact, and easier to control than their active counterparts. However, most existing passive force feedback wearable devices rely on mechanical locking or jamming mechanisms, which are bulky and often require large external power sources such as air compressors, fluid pressure systems, or high-voltage supplies for stiffness modulation. This study introduces a new electromagnetically modulated resistance mechanism to achieve compact and efficient passive force feedback in wearable devices. The proposed system employs two electromagnets to dynamically modulate tendon tension, generating resistance forces for the fingers. This approach enables passive force feedback without bulky mechanisms, complex actuation, or intricate control strategies. To validate the effectiveness of the proposed mechanism, we developed a kinesthetic wearable device for the index finger and thumb. Experimental evaluations demonstrated that the device achieved a peak tendon locking force of 5.8N with a current of 0.1A at 9V. A user study with nine participants assessed stiffness discrimination using resistive force feedback in three tasks: index finger only, thumb only, and pinch action. The study yielded accuracy rates of 75%, 66%, and 69%, respectively. Participants found the device comfortable and easy to use, highlighting its potential for realizing compact, lightweight, and effective passive force feedback devices.

Abstract:
The optimization of hardware and control of humanoid robots for multiple tasks is still an open challenge due to the competing objectives of different behaviors and the complexity of considering control architectures at the design level of a humanoid robot. In this work, we propose a unified multi-objective optimization framework that jointly optimizes both hardware and hierarchical control architectures of a humanoid robot to enhance performance in multiple tasks. Our method employs a Non-dominated Sorting Genetic Algorithm II (NSGA-II) to identify optimal robot morphology and control parameters while balancing trade-offs between diverse task requirements. By leveraging genetic algorithms, we enable the integration of discrete search spaces while overcoming the local minima limitations associated with classical nonlinear optimization techniques. Furthermore, the proposed approach directly incorporates the simulation results, ensuring that hardware optimization is performed considering the system dynamics. We validate our approach by optimizing a humanoid robot for two distinct tasks: walking and payload lifting, leveraging MuJoCo to evaluate the task performances. The proposed framework successfully identifies Pareto-optimal tradeoffs, providing a set of design solutions adaptable to different operational requirements.

Abstract:
Lower limb exoskeletons and exosuits have shown promise in augmenting human physical capabilities, with applications ranging from rehabilitation to performance enhancement. Accurate evaluation of their impact on metabolic energy expenditure is crucial for optimizing design and control strategies. While experimental measurement of metabolic cost via indirect calorimetry provides direct assessment, it is often impractical outside laboratory settings. Computational models offer an alternative, but their effectiveness in predicting metabolic cost changes induced by assistive devices remains underexplored. This study investigates the impact of incorporating different levels of complexity and sensory information, as well as various metabolic cost models, on estimating muscle metabolic cost during walking with a passive biarticular thigh exosuit. We compare three modeling approaches: joint-space dynamics, musculoskeletal simulation with effort minimization, and EMG-informed musculoskeletal simulation, each employing several metabolic models. Results show that EMG-informed musculoskeletal simulation, particularly using the Uchida (2016) metabolic model, provides the highest accuracy in predicting metabolic cost changes. Musculoskeletal simulation with effort minimization also shows promise, offering a viable alternative without the need for EMG data. These findings highlight the potential of computational models in evaluating and optimizing assistive devices.

Abstract:
In this letter, we present the design, manufacturing, and performance test of a centimeter-scale two-legged swimming robot, RoboNotonecta, inspired by the efficient and agile locomotion of an aquatic beetle backswimmer (Notonectid). The robot utilizes a crank-slider and slider-rocker paddling mechanism for propulsion and steering. Its swimming legs are made from Smart Composite Microstructures (SCM) and feature an asymmetric anti-bending stiffness design. This design allows the legs to bend during the recovery stroke and expand during the power stroke, enabling efficient thrust generation. RoboNotonecta has a total mass of 10.3 g and a body length (BL) of 6.4 cm. It can swim at speeds up to 16.2 cm/s (2.5 BL/s), with a minimum turning radius of 4.7 cm (0.73 BL). Equipped with an onboard battery, it can move freely on the water surface while avoiding obstacles. These capabilities demonstrate its potential for environmental monitoring and exploration applications.

Abstract:
Robust and accurate localization and mapping of an environment using laser scanners, so-called LiDAR SLAM, is essential to many robotic applications. Early 3D LiDAR SLAM methods often exploited additional information from IMU or GNSS sensors to enhance localization accuracy and mitigate drift. Later, advanced systems further improved the estimation at the cost of a higher runtime and complexity. This paper explores the limits of what can be achieved with a LiDAR-only SLAM approach while following the "Keep It Small and Simple" (KISS) principle. By leveraging this minimalist design principle, our system, KISS-SLAM, achieves state-of-the-art performance in pose accuracy while requiring little to no parameter tuning for deployment across diverse environments, sensors, and motion profiles. We follow best practices in graph-based SLAM and build upon LiDAR odometry to compute the relative motion between scans and construct local maps of the environment. To correct drift, we match local maps and optimize the trajectory in a pose graph optimization step. The experimental results demonstrate that this design achieves competitive performance while reducing complexity and reliance on additional sensor modalities. By prioritizing simplicity, this work provides a new strong baseline for LiDAR-only SLAM and a high-performing starting point for future research. Furthermore, our pipeline builds consistent maps that can be used directly for downstream tasks like navigation. Our open-source system operates faster than the sensor frame rate in all presented datasets and is designed for real-world scenarios.

Abstract:
Despite advantages from neurosurgical systems, achieving intuitive and safe collaboration with robot during the coarse positioning of instrument insertion end effector (IIEE) remains a critical issue. In this paper, we propose a novel non-contact hand-guided method for such advancement based on magnetic sensing. First, a wearable magnet band and a magnetic sensor are designed, based on which the magnetic localization is achieved for surgeon’s hand location detection. Second, a quadratic programming-based control is implemented, to guarantee the pose-based servo performance, higher rotational manipulability for IIEE fine alignment, and joint position&velocity limits avoidance. For evaluation, two experiments are designated and conducted. Results show that the magnetic localization algorithm can achieve < 4.7 mm and 2.6° errors in a dynamic path tracking test, which can provide an accurate magnet location for hand guidance. Moveover, workflow of the proposed solution in a brain biopsy scenario demonstrates its enhancement of IIEE rotational manipulability (11.6% increase at final configuration), and safety improvement of collision avoidance when other surgeon approaches for cannula delivery. This research contributes to enhanced intuitiveness and safety for surgeon-robot collaborative coarse positioning of IIEE in neurosurgery.

Abstract:
Estimating the 3D pose of a drone is important for anti-drone systems, but existing methods struggle with the unique challenges of drone keypoint detection. Drone propellers serve as keypoints but are difficult to detect due to their high visual similarity and diversity of poses. To address these challenges, we propose DroneKey, a framework that combines a 2D keypoint detector and a 3D pose estimator specifically designed for drones. In the keypoint detection stage, we extract two key-representations (intermediate and compact) from each transformer encoder layer and optimally combine them using a gated sum. We also introduce a pose-adaptive Mahalanobis distance in the loss function to ensure stable keypoint predictions across extreme poses. We built new datasets of drone 2D keypoints and 3D pose to train and evaluate our method, which have been publicly released. Experiments show that our method achieves an AP of 99.68% (OKS) in keypoint detection, outperforming existing methods. Ablation studies confirm that the pose-adaptive Mahalanobis loss function improves keypoint prediction stability and accuracy. Additionally, improvements in the encoder design enable real-time processing at 44 FPS. For 3D pose estimation, our method achieved an MAE-angle of 10.62°, an RMSE of 0.221m, and an MAE-absolute of 0.076m, demonstrating high accuracy and reliability. The code and dataset are available at https://github.com/kkanuseobin/DroneKey.

Abstract:
Trajectory prediction is a fundamental technology for advanced autonomous driving systems and represents one of the most challenging problems in the field of cognitive intelligence. Accurately predicting the future trajectories of each traffic participant is a prerequisite for building high safety and high reliability decision-making, planning, and control capabilities in autonomous driving. However, existing methods often focus solely on the motion of other traffic participants without considering the underlying intent behind that motion, which increases the uncertainty in trajectory prediction. Autonomous vehicles operate in real-time environments, meaning that trajectory prediction algorithms must be able to process data and generate predictions in real-time. While many existing methods achieve high accuracy, they often struggle to effectively handle heterogeneous traffic scenarios. In this paper, we propose a Subjective Intent-based Low-latency framework for Multiple traffic participants joint trajectory prediction. Our method explicitly incorporates the subjective intent of traffic participants based on their key points, and predicts the future trajectories jointly without map, which ensures promising performance while significantly reducing the prediction latency. Additionally, we introduce a novel dataset designed specifically for trajectory prediction. Related code and dataset will be available soon.

Abstract:
The 3D trajectory of a shuttlecock required for a badminton rally robot for human-robot competition demands real-time performance with high accuracy. However, the fast flight speed of the shuttlecock, along with various visual effects, and its tendency to blend with environmental elements, such as court lines and lighting, present challenges for rapid and accurate 2D detection. In this paper, we first propose the YO-CSA detection network, which optimizes and reconfigures the YOLOv8s model’s backbone, neck, and head by incorporating contextual and spatial attention mechanisms to enhance model’s ability in extracting and integrating both global and local features. Next, we integrate three major sub-tasks—detection, prediction, and compensation—into a real-time 3D shuttlecock trajectory detection system. Specifically, our system maps the 2D coordinate sequence extracted by YO-CSA into 3D space using stereo vision, then predicts the future 3D coordinates based on historical information, and re-projects them onto the left and right views to update the position constraints for 2D detection. Additionally, our system includes a compensation module to fill in missing intermediate frames, ensuring a more complete trajectory. We conduct extensive experiments on our own dataset to evaluate both YO-CSA’s performance and system effectiveness. Experimental results show that YO-CSA achieves a high accuracy of 90.43% mAP@0.75, surpassing both YOLOv8s and YOLO11s. Our system performs excellently, maintaining a speed of over 130 fps across 12 test sequences.

Abstract:
To improve the generalization of the autonomous driving (AD) perception model, vehicles need to update the model over time based on the continuously collected data. As time progresses, the amount of data fitted by the AD model expands, which helps to improve the AD model generalization substantially. However, such ever-expanding data is a double-edged sword for the AD model. Specifically, as the fitted data volume grows to exceed the AD model’s fitting capacities, the AD model is prone to under-fitting. To address this issue, we propose to use a pretrained Large Vision Models (LVMs) as backbone coupled with downstream perception head to understand AD semantic information. This design can not only surmount the aforementioned under-fitting problem due to LVMs’ powerful fitting capabilities, but also enhance the perception generalization thanks to LVMs’ vast and diverse training data. On the other hand, to mitigate vehicles’ computational burden of training the perception head while running LVM backbone, we introduce a Posterior Optimization Trajectory (POT)-Guided optimization scheme (POTGui) to accelerate the convergence. Concretely, we propose a POT Generator (POTGen) to generate posterior (future) optimization direction in advance to guide the current optimization iteration, through which the model can generally converge within 10 epochs. Extensive experiments demonstrate that the proposed method improves the performance by over 66.48% and converges faster over 6 times, compared to the existing state-of-the-art approaches.

Abstract:
This paper presents Virtual Teach and Repeat (VirT&R): an extension of the Teach and Repeat (T&R) framework that enables GPS-denied, zero-shot autonomous ground vehicle navigation in untraversed environments. VirT&R leverages aerial imagery captured for a target environment to train a Neural Radiance Field (NeRF) model so that dense point clouds and photo-textured meshes can be extracted. The NeRF mesh is used to create a high-fidelity simulation of the environment for piloting an unmanned ground vehicle (UGV) to virtually define a desired path. The mission can then be executed in the actual target environment by using NeRF-generated point cloud submaps associated along the path and an existing LiDAR Teach and Repeat (LT&R) framework. We benchmark the repeatability of VirT&R on over 12 km of autonomous driving data using physical markings that allow a sim-to-real lateral path-tracking error to be obtained and compared with LT&R. VirT&R achieved measured root mean squared errors (RMSE) of 19.5 cm and 18.4 cm in two different environments, which are slightly less than one tire width (24 cm) on the robot used for testing, and respective maximum errors were 39.4 cm and 47.6 cm. This was done using only the NeRF-derived teach map, demonstrating that VirT&R has similar closed-loop path-tracking performance to LT&R but does not require a human to manually teach the path to the UGV in the actual environment.

Abstract:
Repetitive gait training with lower-limb exoskeletons enhances neuroplasticity and reduces muscle atrophy by promoting patient engagement in active rehabilitation training. Importantly, the therapeutic efficacy of such engagement critically depends on providing patients with task difficulty levels matching their real-time walking capacities. To address this, a closed-loop Mobility-Matching Framework is proposed, integrating Hybrid Multi-attractor Dynamic Movement Primitives (Hm-DMP) with Policy Improvement with Path Integral (PI2) optimization, which achieves real-time trajectory adaptation. The Hm-DMP module preserves critical kinematic invariants of normative gait patterns during trajectory deformation through constrained multi-attractor modulation. Simultaneously, the PI2-driven optimizer iteratively adjusts joint trajectory keypoints of Hm-DMP by optimizing a hybrid cost function, enabling dynamic matching between training trajectories and patients’ real-time mobility. Experimental trials on the WEI-EXO platform demonstrate the proposed framework’s robustness to detect and respond to real-time changes in patient’ ambulatory capacity by optimizing assistance trajectories while preserving the normative gait kinematics. This closed-loop adaptation process facilitates personalized gait rehabilitation with exoskeletons, enhancing training efficacy and maintaining comfort across patients with diverse mobility levels.

Abstract:
Soil is a vital resource for various industries, including agriculture, engineering, and manufacturing, where accurate in-situ classification is essential for a wide range of applications. Electrical Impedance Tomography (EIT) enables real-time soil classification by capturing complex impedance data across varying distances. This study presents a novel approach integrating EIT with actuating probes to dynamically generate rich datasets for distinguishing soil types and moisture levels. By utilizing eight moving electrodes multiplexed across 32 channels, this system overcomes the limitations of traditional laboratory-based methods, such as time constraints and data skew caused by non-homogeneous inclusions. The moving electrode design significantly outperforms the stationary setup by 21%, achieving an average classification accuracy of 93% across varying moisture levels of sand, clay, and silt combinations. Experimental results on a larger data set demonstrates a classification accuracy of up to 79.7% across 25 different soil-moisture combinations, underscoring the technique’s potential for effective in-field soil analysis The improved accuracy achieved through actuation, compared to stationary probes, suggests broader applications in precision agriculture, civil engineering, and environmental monitoring.

Abstract:
Fire-induced indoor environments, characterized by smoke, glare, and dimness, critically challenge rescue safety. While LiDAR and cameras suffer from signal attenuation, millimeter-wave (mmWave) radar exhibits robust imaging performance. Radar-based building mapping and object detection in indoor environments are thus required to facilitate situational awareness specified by firefighting standards. Prior radar datasets mostly focus on outdoor object detection and the few existing indoor datasets remain insufficient in several aspects: (1) lacking adverse scenario analysis; (2) lacking raw analog-to-digital converter (ADC) data for dense point cloud generation; and (3) lacking 3D object annotations for building layout understanding. This work introduces the Indoor FireRescue Radar (IFR) dataset, a novel large-scale multimodal benchmark for indoor situational awareness. It includes 27K frames of 4D radar point cloud, co-calibrated with LiDAR, RGB camera, and IMU streams, alongside 3D objects annotations across 10 buildings. This dataset also provides raw ADC data and sensor configuration metadata. We applied voxel-based and pillar-based object detectors to 4D radar-based indoor object detection. We also demonstrated the robustness of radar perception in fire-induced indoor environments by real smoke tests at a firefighter training facility. Dataset is available at: https://huggingface.co/datasets/yysd123/indoor_mmwave

Abstract:
Books, as enduring repositories of cultural heritage as well as knowledge, play a fundamental role in human development. Although advances in embodied AI and robotics revolutionize automation in domains, e.g., manufacturing and logistics, robotic book manipulation remains an underexplored frontier. Two primary bottlenecks impede progress: (1) scarcity of fine-grained annotated datasets for benchmarking robotic book manipulation, and (2) lack of unified perception-action frameworks capable of dynamically coupling multi-modal sensing and manipulation in real-world scenarios. To these issues, we present THU-Book, the first open-access benchmark featuring 643 3D scene captures, encompassing 11,298 high-fidelity book instances with rich annotations to support tasks from book recognition and localization to grasping and repositioning. Building upon this foundation, we develop BookBot, a novel voice-interactive book manipulation pipeline to support cross-environmental, multilingual, and multi-categorical book manipulation. First, we utilize Large Language Models (LLMs) to parse and comprehend ambiguity in user instructions. We further propose an instance segmentation module combined with OCR tool to link language to visual instances. Finally, we introduce a PCA-based manipulation policy to refine the robotic grasp pose, utilizing the principal components of the books’ geometry, improving the precision and efficiency of grasping. Experiments conducted on the THU-Book benchmark validate the effectiveness of our BookBot. The dataset is available at https://github.com/wanghq-public/BookBot.

Abstract:
Real-time and accurate perception of dynamic objects is crucial for autonomous driving. To better capture the motion information of objects, some methods now employ 4D Doppler point clouds collected by frequency-modulated continuous-wave (FMCW) LiDAR to enhance the detection and tracking of moving objects. Compared to standard time-of-flight (ToF) LiDAR, FMCW LiDAR can provide the relative radial velocity of each point through the Doppler effect, offering a more detailed understanding of an object’s motion state. However, despite the proven efficacy of these methods, ablation studies reveal that the direct contribution of Doppler velocity to tracking is limited, with performance gains often resulting from improved object recognition and labeling accuracy.Revisiting the role of Doppler velocity, this study proposes DopplerTrack, a simple yet effective learning-free tracking method tailored for FMCW LiDAR. DopplerTrack harnesses Doppler velocity for efficient point cloud preprocessing and object detection with O(N) complexity. Furthermore, by exploring the potential motion directions of objects, it reconstructs the full velocity vector, enabling more direct and precise motion prediction. Extensive experiments on four datasets demonstrate that DopplerTrack outperforms existing learning-free and learning-based methods, achieving state-of-the-art tracking performance with strong generalization across diverse scenarios. Moreover, DopplerTrack runs efficiently at 120 Hz on a mobile CPU, making it highly practical for real-world deployment. The code and datasets have been released at https://github.com/12w2/DopplerTrack.

Abstract:
On-orbit servicing (OOS) has become an essential aspect of modern space missions, encompassing satellite maintenance, orbital assembly, and debris removal. This paper presents a novel decentralized navigation algorithm designed for a swarm of servicing spacecraft to collaboratively reconstruct the shape of unknown tumbling space objects. The proposed method leverages a dynamic factor graph-based Collaborative Simultaneous Localization and Mapping (C-SLAM) framework, integrating observed and identified point cloud features across the swarm. To address the challenges associated with the target’s tumbling dynamics, the approach also incorporates a dynamic SLAM formulation, utilizing a noisy parametric model to propagate the dynamic map and construct the dynamic factor graph at the front-end. Kinematic factors are introduced to account for loop closures, enabling the swarm to recognize previously observed features as they rotate with the target, thereby enhancing mapping robustness. Additionally, the target kinematic model parameters themselves—such as its center of mass, linear velocity, and angular velocity—are estimated in real time from the reconstructed maps. The latter notably is estimated using a singular value decomposition approach that determine the best fit rotation between two consecutive sets of mapped points. These estimates are fed back into the kinematic factors and loop closure processes to refine map optimization iteratively. Simulation results demonstrate that incorporating kinematic factors, addressing loop closures, and facilitating inter-robot communication significantly enhance the swarm’s ability to track the evolving map without a central leader. This decentralized approach highlights the potential of equipping spacecraft swarms with advanced, robust, and scalable perception capabilities for the collaborative inspection and characterization of unknown tumbling space targets.

Abstract:
The Fiber Bragg grating (FBG) three-axial force sensor provides force feedback for an endoscopic surgical robot, reducing operational difficulty and risks. However, the packaging method of the optical fiber sensor demonstrates limited adaptability to both high-temperature sterilization environments and the wet operative areas encountered during surgery. Based on this, this paper presents step-reduced FBG and enclosed three-axial force sensor. The sensor adopts an integrated design with a maximum outer diameter of 4.5 mm and can be seamlessly integrated into the end of the flexible endoscopic surgical robot. At the same time, the hydrofluoric acid corrosion process is introduced to obtain the twin reflection spectrum and realize the decoupling of three-axial forces and temperature. Static calibration demonstrates etched grating sensitivities of 227.78 pm/N (Fx), 242.63 pm/N (Fy), and 233.50 pm/N (Fz) via least-squares fitting. Force-temperature coupling experiment confirms maximum full-scale force errors remain below 5% under temperature perturbation, verifying the reliability of the sensor. Finally, the high-temperature sterilization experiment at 180°C was conducted, demonstrating the designed sensor’s thermal stability under medical device sterilization protocols.

Abstract:
In this paper, we propose SAGENet that utilizes only binaural echoes (i.e., for scenarios when vision perception seriously degrades) for scene depth estimation. Unlike previous methods that implicitly learn spatial features from echoes, which may cause shape and scale drift, SAGENet explicitly extracts spatial cues, effectively enhancing depth estimation accuracy. First, we leverage signal processing to generate coarse 2D geometric cues, which contain scene scale and shape information, as additional input for the 3D depth estimation network. This approach aids the network in better reconstructing depth information from the scene. Given the substantial noise in the 2D geometric cues, we design a geometric cue consistency denoising loss function to help the network accurately interpret the scale and shape information embedded in the features. Second, we initialize learnable queries with angular spectrum peaks and fuse them with audio features via self-attention to guide the network to focus on the first few reflections echo dominant feature, while effectively suppressing interference from reverberation. Finally, Our experimental results on the Replica and real-world BatVision datasets show that the proposed method outperforms the existing binaural echo-based methods (including BatVision) by more than 5% and 10% in absolute relative error, respectively. To benefit the community, we open-source the code at https://github.com/zjuersdsd/SAGENet.git.

Abstract:
Loop closure detection and pose estimation play a significant role in correcting odometry trajectories and generating globally consistent point cloud maps. Geometric feature descriptor methods neglect object-level spatial topology features, resulting in inadequate performance in loop closure detection. Semantic graph-based loop closing methods improve upon this, however, they still follow the paradigm of "first generating descriptors, then comparing similarity, and finally achieving alignment (6D pose)". Specifically, they compare two semantic graphs that are not spatially aligned, which makes direct node correspondences impossible and necessitates extensive descriptor extraction and comparison. This decouples similarity comparison from 6D pose estimation, resulting in a cumbersome process that limits practicality and scalability. This paper proposes SLOOP, a novel descriptor-free semantic graph matching method that "aligns two graphs first, followed by efficient similarity comparison". Specifically, we first design a dedicated neighborhood semantic feature module to extract high-quality matched node pairs. Next, we seek the aligned coordinate systems for candidate loops based on the robust ground normal vectors and two suitable node pairs examined by the two-stage global geometric consistency metrics. Finally, the aligned coordinate systems enable efficient extraction and comparison of node spatial distributions. We conducted extensive outdoor loop detection experiments and compared with various loop closure detection approaches, demonstrating the improved performance of SLOOP in loop closure detection and its practicality. The code and related materials are available at https://github.com/bit-tyj/sloop_c.

Abstract:
Tactile sensing is crucial for robots to achieve human-like manipulation capabilities and safe interaction with the environment. Existing piezoresistive tactile sensors often suffer from limited spatial resolution and poor conformability to contact objects due to their unstretchable contact surfaces. In this paper, we present a low-cost and easily fabricated piezoresistive tactile sensor array that utilizes flexible printed circuit (FPC) via as electrodes to achieve high spatial resolution (64cm−2), while incorporating stretchable materials for surface encapsulation to enable conformal contact with objects. Unified material selection for both the sensing and encapsulation layers ensures robust performance. We characterized the sensing performance, mechanical durability, and uniformity of the sensor array and further demonstrated its practical applications in contact shape reconstruction.

Abstract:
This paper proposes an energy-efficient collision avoidance system for an autonomous mobile manipulator (AMM) in dynamic environments. The system enables safe obstacle avoidance and energy-efficient path generation. A B-Spline-based path optimization algorithm is developed, incorporating an energy cost function for planning collision-free, energy-efficient paths. The Bi-RRT algorithm generates an initial path, segmenting it at curvature maxima for energy-efficient optimization. The local path planning creates a path with seven control points, applying the same optimization for safe, low-energy avoidance of dynamic obstacles. A Model Predictive Control (MPC) system ensures precise path following. Experiments with a TM5M-900 robot validate the method's effectiveness in avoiding obstacles. Compared to Bi-RRT and Hybrid-RRT systems, it improves motion smoothness by 29.35% and reduces energy consumption by 16.84%, providing a more efficient solution for an AMM in dynamic environments.

Abstract:
Hand gesture recognition (HGR) is crucial in developing advanced prosthetics, neurorobotics, and human-robot interaction (HRI). Surface electromyography (sEMG) and high-density sEMG (HD-sEMG) have gained attention for their ability to capture the muscle activity underlying hand gestures. Although many models achieve high performance within the same subjects, generalizing across different subjects remains a significant challenge, limiting the practical application of these systems in real-world settings. Furthermore, most conventional approaches primarily focus on the steady phase of gestures, which slows down real-time prediction. To address these issues, we propose a cross-subject dynamic hand gesture recognition (DHGR) framework based on the Vision Transformer (ViT) architecture, referred to as ViT-DHGR. Our model focuses explicitly on the signal transient phase before gesture stabilization to reduce gesture prediction latency and counteract system control delays. By incorporating subject embeddings and transfer learning strategies, the proposed ViT-DHGR framework for 34 dynamic hand gestures achieved an accuracy of 76.44% for 10 subjects using only 1 repetition of gesture data, which improved to 85.03% with 2 repetitions. In addition, our proposed framework achieves over 16% higher average accuracy across test subjects using 1 repetition of data compared to training subject-specific models from scratch. This work demonstrates the potential of HD-sEMG for capturing dynamic hand gestures and highlights the benefits of cross-user knowledge transfer in reducing data requirements and enhancing practicality for robotic applications.

Abstract:
This paper introduces BEVPointNet3D, an innovative 3D lane detection model that effectively integrates Bird’s Eye View (BEV) and point cloud features. The proposed approach addresses the inherent limitations of conventional methods that predominantly rely on the flat-ground assumption. BEVPointNet3D incorporates a 2D encoder to extract preliminary lane information from BEV images. For 3D feature extraction, the model employs a hierarchical local-to-global processing scheme to capture the geometric characteristics of LiDAR point clouds. A novel cross-attention mechanism is implemented to precisely align and integrate the 2D and 3D feature representations. This architectural design not only improves detection accuracy but also strengthens the adaptability and performance of the model in complex driving scenarios. Comprehensive evaluations on the K-Lane and CampusLane datasets demonstrate the superior performance of BEVPoint-Net3D. Notably, the model exhibits exceptional capability in accurately estimating lane spatial positions on steep inclines, thereby providing reliable support for autonomous driving systems in challenging terrain conditions.

Abstract:
Neural Radiance Fields (NeRF) and 3D Gaussian Splatting (3DGS) have emerged as powerful tools for 3D reconstruction and SLAM tasks. However, their performance depends heavily on accurate camera pose priors. Existing approaches attempt to address this issue by introducing external constraints but fall short of achieving satisfactory accuracy, particularly when camera trajectories are complex. In this paper, we propose a novel method, RA-NeRF, capable of predicting highly accurate camera poses even with complex camera trajectories. Following the incremental pipeline, RA-NeRF reconstructs the scene using NeRF with photometric consistency and incorporates flow-driven pose regulation to enhance robustness during initialization and localization. Additionally, RA-NeRF employs an implicit pose filter to capture the camera movement pattern and eliminate the noise for pose estimation. To validate our method, we conduct extensive experiments on the Tanks&Temple dataset for standard evaluation, as well as the NeRFBuster dataset, which presents challenging camera pose trajectories. On both datasets, RA-NeRF achieves state-of-the-art results in both camera pose estimation and visual quality, demonstrating its effectiveness and robustness in scene reconstruction under complex pose trajectories.

Abstract:
Plastic bags are challenging objects for robot manipulation due to transparency and deformability. This paper proposes a learning approach for a robot to insert its hand into a container- or bag-shaped object based on visual and tactile sensing. The basic idea is to utilize probing action that allows to acquire rich information about the object even with a simple tactile sensor. The structure of the object is estimated by unsupervised learning with contact and reachability information. The result is transferred to visual recognition as self-supervised learning. Based on the unsupervised learning result, the robot can verify whether the hand truly reached the interior of the bag by additional probing actions. The proposed method was evaluated experimentally by a robot hand with a simple tactile sensor.

Abstract:
Autonomous driving faces challenges in navigating complex real-world traffic, requiring safe handling of both common and critical scenarios. Reinforcement learning (RL), a prominent method in end-to-end driving, enables agents to learn through trial and error in simulation. However, RL training often relies on rule-based traffic scenarios, limiting generalization. Additionally, current scenario generation methods focus heavily on critical scenarios, neglecting a balance with routine driving behaviors. Curriculum learning, which progressively trains agents on increasingly complex tasks, is a promising approach to improving the robustness and coverage of RL driving policies. However, existing research mainly emphasizes manually designed curricula, focusing on scenery and actor placement rather than traffic behavior dynamics. This work introduces a novel student-teacher framework for automatic curriculum learning. The teacher, a graph-based multi-agent RL component, adaptively generates traffic behaviors across diverse difficulty levels. An adaptive mechanism adjusts task difficulty based on student performance, ensuring exposure to behaviors ranging from common to critical. The student, though exchangeable, is realized as a deep RL agent with partial observability, reflecting real-world perception constraints. Results demonstrate the teacher’s ability to generate diverse traffic behaviors. The student, trained with automatic curricula, outperformed agents trained on rule-based traffic, achieving higher rewards and exhibiting balanced, assertive driving.

Abstract:
In this paper, we introduce an event-triggered Maps of Dynamics (ETMoD) framework for modeling spatial motion patterns in non-stationary environments. Traditional approaches often rely on fixed grid resolutions and assume gradual temporal changes, limiting their effectiveness in real-world scenarios where motion patterns exhibit abrupt variations. To address these limitations, we propose a novel framework that employs a grid-shifting mechanism to generate context-aware cells based on historical observations. Temporal patterns are modeled using Neural Stochastic Differential Equations, while a diffusion model is integrated to handle abrupt changes in motion patterns through an event-triggered mechanism. Experimental results demonstrate that our framework outperforms state-of-the-art methods, particularly in capturing abrupt changes during peak activity periods, while significantly reducing training time.

Abstract:
Recent advances in 3D Gaussian Splatting have demonstrated impressive performance in novel view synthesis, particularly with dense image sets. However, its performance degrades significantly in sparse-view scenarios, primarily due to the challenge of obtaining accurate camera poses. Also, achieving scale-consistent and detailed depth maps is crucial, while existing depth estimation methods struggle to meet both requirements, further limiting view synthesis quality in sparse settings. To address these challenges, we propose DPR-Splat, an efficient neural reconstruction framework that builds 3D Gaussian models from sparse scenes. DPR-Splat refines the coarse outputs of MASt3R by leveraging dedicated pose and depth refinement modules, resulting in precise camera poses and depth maps. With the refined outputs, it progressively expands the 3D Gaussian set to construct an accurate scene model. Extensive experiments demonstrate that DPR-Splat enhances both novel view synthesis quality and pose estimation accuracy, and significantly accelerates training and rendering. Code and demonstration video are available at https://github.com/h0xg/DPR-Splat.

Abstract:
In computer-assisted orthopedic surgery (CAOS), accurate point set registration is essential for enhancing surgical accuracy. However, the sparse and low-overlap nature of intraoperative point sets presents significant challenges for reliable registration. To deal with these challenges, we propose a novel registration-after-completion framework, where the intraoperative point set is first completed, after which the two full point sets are registered. Our main contributions include the follows. First, we propose a progressive two-stage strategy to progressively complete the sparse and partial intraoperative point set. Second, considering that 1) intra-operative point set contains noise 2) the point completion process is not perfect, and 3) the resolution of preoperative image is limited, we adopt the bidirectional hybrid mixture models (HMMs) to represent the point set pairs and formulate the probabilistic registration network. In the proposed novel correspondence network where a dual-path cross-attention mechanism is adopted for feature fusion and a clustering mechanism is leveraged for calculating point-to-mixture correspondences. Furthermore, the bidirectional registration mechanism is leveraged to compute the transformation based on estimated correspondences. Third, we have extensively validated the proposed approach on various datasets and bone phantoms. Our experiments on 1399 human femur and 1301 hip models demonstrate that our method achieves state-of-the-art performance across overlap rates from 15% to 35% and at various point counts (i.e., 25, 50, and 100 points) under conditions with less than 50% overlap. Additionally, real phantom experiments on femur and hip models validate the method’s performance in simulated surgical scenarios. Experiments on ModelNet40 further confirmed our method’s effectiveness and generalizability.

Abstract:
In warehouse logistics and post-disaster rescue, multi-UAV payload transport must navigate tight spaces, such as 1.2m × 0.8m aisles and collapsed pipelines as narrow as 0.6m. Traditional four-DOF (translation and scaling) trajectory planning struggles under such constraints. To overcome this, we propose an optimization-based framework that introduces rotational degrees of freedom, expanding the solution space to five dimensions. Using the MINCO transformation, we reformulate constrained formation adjustment into an unconstrained optimization problem via smooth mappings and penalty functions, enabling simultaneous obstacle avoidance and formation control. The GeoSafe algorithm further enhances safe passage by integrating iterative region expansion and semi-definite programming to maximize obstacle-free space. Extensive simulations and real-world experiments show our method’s superiority over sampling-based and IF-based approaches in narrow passage traversal, computational efficiency, and formation scalability.

Abstract:
In the field of autonomous driving, sensor simulation is essential for generating rare and diverse scenarios that are difficult to capture in real-world environments. Current solutions fall into two categories: 1) CG-based methods, such as CARLA, which lack diversity and struggle to scale to the vast array of rare cases required for robust perception training; and 2) learning-based approaches, such as NeuSim, which are limited to specific object categories (vehicles) and require extensive multi-sensor data, hindering their applicability to generic objects. To address these limitations, we propose SynthDrive, a scalable "real2sim2real" system that leverages 3D generation to automate asset mining, generation, and rare-case data synthesis.Our framework introduces two key innovations: 1) Automated Rare-Case Mining and Synthesis. Given a text prompt describing specific objects, SynthDrive automatically mines image data from the Internet and then generates corresponding high-fidelity 3D assets, which eliminates the need for costly manual data collection. By integrating these assets into existing street-view data, our pipeline produces photorealistic rare-case data, supporting rapid scaling to diverse assets including irregular obstacles and temporary traffic facilities. 2) High-Fidelity 3D Generation. We propose a hybrid asset generation pipeline that combines a geometry-aware LRM, iterative mesh optimization, and an improved texture fusion algorithm. Our approach achieves 0.0164 Chamfer Distance on the GSO dataset, outperforming InstantMesh by 14.1% in geometry accuracy, and achieves 19.05 PSNR (vs 16.84) for texture quality. This enables fine geometry details and high-resolution texture generation, which is essential for perception model training. Experiments demonstrate that SynthDrive-generated data improves the performance of downstream perception tasks (2D and 3D detection on rare objects) by 2-4% mAP. SynthDrive greatly lowers the data production cost and improves the diversity for corner-case data generation, showcasing extensive potential applications in the field of autonomous driving.

Abstract:
Trajectory forecasting in urban environments is a critical task that needs to be addressed for the safety of autonomous vehicles, particularly in urban road intersection scenarios, where agents exhibit diverse behaviors mainly due to complex interactions between agents and the environment and the diversity of paths available to the agents. Current state-of-the-art methods do not perform well in urban road intersection scenarios. To address this issue, the proposed novel framework CandidateGraph-Net (CG-Net), improves trajectory prediction in urban road intersection scenarios by encoding the available candidate centerlines at the current location of the target agent. The proposed interaction encoder in CG-Net is inspired by human behavior. It is modeled utilizing a bipartite graph attention network to predict the trajectory of the target agent. It estimates the trajectory in the same way as humans anticipate the trajectory of other vehicles and pedestrians in dynamic environments. The agent embeddings in the interaction encoder at each time step pay attention to nearby agents and surrounding scene elements simultaneously. This enables the model to learn how to prioritize interactions between nearby agents and the environment map. Further, CG-Net’s performance is evaluated using the Argoverse 2 motion forecasting dataset. The results demonstrate its effectiveness in urban road intersection scenarios, with an overall improvement in key metrics such as minFDE and minADE compared to baseline methods. These improvements highlight CG-Net’s ability to perform better motion forecasting in urban road intersection scenarios.

Abstract:
Miniature robots hold great promise for performing micromanipulation tasks within hard-to-reach confined spaces. However, effectively maneuvering across complex and unstructured terrain, achieving adaptive morphogenesis, and developing adaptive multimodal locomotion strategies remain challenges for these robotic systems. Here, we develop a multi-stimulus-responsive deformable miniature robot integrated with an adaptive multimodal motion control method. Sodium alginate hydrogel and graphene-coated magnetic elastomer are integrated into the sheet-shaped robot to enable responsiveness to temperature, humidity, and magnetic fields. A kinematic gait model is designed to control oscillatory motion in the semi-contracted state and rotational motion in the fully contracted state of the miniature robot. To automatically mitigate angular deviation between the robot's motion direction and the intended path, an adaptive orientation compensation control algorithm based on Support Vector Regression (SVR) is proposed. Experimental results demonstrate that the proposed robot exhibits capabilities for flexible and accurate navigation within unstructured environments (e.g., rock piles and stomach models), and is further shown to be capable of cargo transport. The proposed adaptive morphogenesis robots, enabled by dual-mode motion control, hold significant potential for targeted delivery and other micromanipulation applications in complex, unstructured, and confined environments.

Abstract:
With the increasing complexity of pipeline systems in various industrial and environmental applications, there is a critical need for flexible and efficient robotic solutions that can navigate and inspect confined spaces. This paper introduces a lightweight pipeline crawling robot based on spring-roll dielectric elastomer actuators (DEAs). Inspired by the adaptability of caterpillars, the robot combines anisotropic friction feet with a spring-roll DEA structure to achieve high-speed movement. It operates effectively in pipes with diameters ranging from 16 mm to 20 mm, reaching a maximum speed of 357 mm/s (5.95 BL/s) under a 3.5 kV driving voltage. The optimized design enhances actuator performance and friction distribution, significantly outperforming existing soft crawling robots. This innovation demonstrates great potential for high-speed, lightweight pipeline inspection applications and advances the field of soft robotics for diverse industrial tasks.

Affiliations: Division of Advanced Manufacturing, Shenzhen International Graduate School, Tsinghua University, Shenzhen, China; Department of Mechanical Engineering, National University of Singapore, Singapore; Department of Mechanical and Automation Engineering, The Chinese University of Hong Kong, Hong Kong, China; Department of Biomedical Engineering, The Chinese University of Hong Kong, Hong Kong, China; Shenzhen Key Laboratory of Intelligent Robotics and Flexible Manufacturing Systems, Institute for Robotics, Southern University of Science and Technology, Shenzhen, China; Department of Surgery, Faculty of Medicine, The Chinese University of Hong Kong, Prince of Wales Hospital, Hong Kong, China

Abstract:
Ensuring both mobility and stability during colonoscopy is crucial for enhancing procedural efficiency and reducing the risk of complications. Traditional colonoscopy robots face challenges due to the fixed diameter of the colonoscope and the variability in colon anatomy. To address this, we propose a soft automatic anchoring system (SAAS) to enhance mobility and stability in colonoscopy robots. The SAAS features proximal and distal soft balloon anchors, employing origami principles to achieve a 42.2% higher maximum expansion capability compared to conventional flat surface anchors. With real-time pressure feedback, the SAAS automatically anchors upon contact detection, ensuring precise anchoring without overinflation while adapting to colon diameter variations and robot posture changes. Experimental results show a tenfold reduction in displacement during the stability test, significantly enhancing the robot’s performance under external loads. Performance comparison tests in phantom further demonstrated notable improvements in the efficiency of colonoscopy procedures using the SAAS. This system has the potential to greatly enhance the safety, precision, and overall efficiency of colonoscopy, offering substantial benefits for both medical practitioners and patient outcomes.

Abstract:
This paper presents the concept of Industry 6.0, which introduces the world’s first fully automated production system that autonomously handles the entire product design and manufacturing process based on user-provided natural language descriptions. By leveraging generative AI, the system automates critical aspects of production, including product blueprint design, component manufacturing, logistics, and assembly. A heterogeneous swarm of robots, each equipped with individual AI through integration with Large Language Models (LLMs), orchestrates the production process. The robotic system includes manipulator arms, delivery drones, and 3D printers capable of generating assembly blueprints. The system was evaluated using commercial and open source LLMs, operating via APIs and local deployment. A user study demonstrated that the system reduced the average production time to 119.10 minutes, significantly outperforming a team of expert human developers, who averaged 528.64 minutes (an improvement factor of 4.4). Furthermore, in the product blueprinting stage, the system outperformed human CAD operators by an unprecedented factor of 47, completing the task in 0.5 minutes compared to 23.5 minutes. This breakthrough represents a major leap towards fully autonomous manufacturing.

Abstract:
Efficient coverage of unknown environments requires robots to adapt their paths in real time based on on-board sensor data. In this paper, we introduce CAP, a connectivity-aware hierarchical coverage path planning algorithm for efficient coverage of unknown environments. During online operation, CAP incrementally constructs a coverage guidance graph to capture essential information about the environment. Based on the updated graph, the hierarchical planner determines an efficient path to maximize global coverage efficiency and minimize local coverage time. The performance of CAP is evaluated and compared with five baseline algorithms through high-fidelity simulations as well as robot experiments. Our results show that CAP yields significant improvements in coverage time, path length, and path overlap ratio.

Affiliations: Department of Computer Science, Cognitive Robotics Laboratory, University of Manchester, United Kingdom; Autonomous Systems and Robotics Lab/UIS, ENSTA, Institut Polytechnique de Paris, Palaiseau, France; Computer Science Department, UCL Interaction Centre, University College London, London, United Kingdom; Department of Computer Science, HBUG Lab, University of Exeter, Exeter, United Kingdom; Human-Robot Interfaces and Interaction (HRI) Lab, Istituto Italiano di Tecnologia, Genoa, Italy

Abstract:
Nonverbal communication plays a crucial role in both human-human and human-robot interactions (HRIs), where facial expressions convey emotions, intentions and trust. Enabling humanoid robots to generate human-like facial reactions in response to human speech and facial behaviours remains significant challenges. In this work, we leverage human-human interaction (HHI) datasets to train a humanoid robot, allowing it to learn and imitate facial reactions to both speech and facial expression inputs. Specifically, we extend a sequence-to-sequence (Seq2Seq)-based framework that enables robots to simulate human-like virtual facial expressions that are appropriate for responding to the perceived human user behaviours. Then, we propose a deep neural network-based motor mapping model to translate these expressions into physical robot movements. Experiments demonstrate that our facial reaction–motor mapping framework successfully enables robotic self-reactions to various human behaviours, where our model can best predict 50 frames (two seconds) of facial reactions in response to the input user behaviour of the same duration, aligning with human cognitive and neuromuscular processes. Our code is provided at https://github.com/mrsgzg/Robot_Face_Reaction.

Abstract:
Large flapping-wing aerial vehicles (FWAVs) face dual challenges in aerodynamic and structural design, with long-standing technical bottlenecks, particularly in roll maneuvers. In this study, by reverse-engineering the biomechanical mechanisms of raptor flight, we propose a bio-inspired wing-shoulder torsional mechanism and successfully developed an eagle-inspired flapping-wing aerial vehicle with a wingspan of 1.87m and a takeoff weight of 1,260g. A nonlinear explicit dynamics-lattice Boltzmann fluid-structure interaction (FSI) numerical model was innovatively established, comprehensively revealing the interaction mechanism between unsteady flapping flow fields and flexible wing deformations. Numerical simulations demonstrate that at a cruising speed of 8 m/s, the proposed mechanism generates a high-purity roll torque of 3.3 N·m (with a residual yaw torque of 0.2 N·m, torque purity ratio 16.5:1), while lift and thrust losses are below 1.5%. Flight experiments validate the exceptional performance of this mechanism in 3D maneuvers: a 360° barrel roll is completed in 2.6 seconds (average roll rate 136°/s). This study provides a theoretical framework and technological prototype for next-generation bio-inspired aerial vehicles that integrate efficient cruising with high maneuverability, marking the first instance where FWAVs surpass traditional aircraft in specific 3D maneuverability metrics.

Abstract:
Imitation learning has emerged as a powerful paradigm for robot skills learning. However, traditional data collection systems for dexterous manipulation face challenges, including a lack of balance between acquisition efficiency, consistency, and accuracy. To address these issues, we introduce Exo-ViHa, an innovative 3D-printed exoskeleton system that enables users to collect data from a first-person perspective while providing real-time haptic feedback. This system combines a 3D-printed modular structure with a slam camera, a motion capture glove, and a wrist-mounted camera. Various dexterous hands can be installed at the end, enabling it to simultaneously collect the posture of the end effector, hand movements, and visual data. By leveraging the first-person perspective and direct interaction, the exoskeleton enhances the task realism and haptic feedback, improving the consistency between demonstrations and actual robot deployments. In addition, it has cross-platform compatibility with various robotic arms and dexterous hands. Experiments show that the system can significantly improve the success rate and efficiency of data collection for dexterous manipulation tasks. Webpage: https://exo-viha2025.github.io/.

Abstract:
With the accelerated development of Industry 4.0, intelligent manufacturing systems increasingly require efficient task allocation and scheduling in multi-robot systems. However, existing methods rely on domain expertise and face challenges in adapting to dynamic production constraints. Additionally, enterprises have high privacy requirements for production scheduling data, which prevents the use of cloud-based large language models (LLMs) for solution development. To address these challenges, there is an urgent need for an automated modeling solution that meets data privacy requirements. This study proposes a knowledge-augmented mixed integer linear programming (MILP) automated formulation framework, integrating local LLMs with domain-specific knowledge bases to generate executable code from natural language descriptions automatically. The framework employs a knowledge-guided DeepSeek-R1-Distill-Qwen-32B model to extract complex spatiotemporal constraints (82% average accuracy) and leverages a supervised fine-tuned Qwen2.5-Coder-7B-Instruct model for efficient MILP code generation (90% average accuracy). Experimental results demonstrate that the framework successfully achieves automatic modeling in the aircraft skin manufacturing case while ensuring data privacy and computational efficiency. This research provides a low-barrier and highly reliable technical path for modeling in complex industrial scenarios.

Abstract:
As one of the effective closed-loop control methods, visual servoing control methods are widely applied to continuum robots. However, existing visual servoing control methods mostly focus on accurate control of the robot’s end-effector, with less consideration given to the robot’s shape. In this work, a spatial position-based visual servoing obstacle-avoidable shape control framework for an 11-degree-of-freedom (DOF) hybrid continuum robot is proposed. In the control framework, a set of markers representing the shape of the continuum robot are measured and two spatial arcs are used to fit the shape. When controlling the redundant DOFs of the robot, position-based visual servoing shape control combined with obstacle avoidance is formulated as a quadratic programming problem, yielding the optimal solution at each sample time for the joint velocity vector of the 11-DOF hybrid continuum robot. Several experiments are conducted to validate the proposed control framework, which indicates the accuracy of the shape control achieves 0.88 mm.

Abstract:
Self-localization on a 3D map by using an inexpensive monocular camera is required to realize autonomous driving. Self-localization based on a camera often uses a convolutional neural network (CNN) that can extract local features that are calculated by nearby pixels. However, when dynamic obstacles, such as people, are present, CNN does not work well. This study proposes a new method combining CNN with Vision Transformer, which excels at extracting global features that show the relationship of patches on whole image. Experimental results showed that, compared to the state-of-the-art method (SOTA), the accuracy improvement rate in a CG dataset with dynamic obstacles is 1.5 times higher than that without dynamic obstacles. Moreover, the self-localization error of our method is 20.1% smaller than that of SOTA on public datasets. Additionally, our robot using our method can localize itself with 7.51cm error on average, which is more accurate than SOTA.

Abstract:
Gas distribution mapping (GDM) refers to the task of mapping the gas concentrations of an airborne chemical over a region of interest. A mobile robot equipped with a gas sensor can be used potentially autonomously to build such a distribution map. However, modern-day robots might not have enough battery power to cover the entire area of interest. Therefore, a group of n such collaborative mobile robots can be used for this purpose. The goal of the robots is to sample concentrations from a fraction of locations and infer the gas intensities in the rest of the area using a supervised machine learning technique, namely the Gaussian Process (GP). To this end, we propose a novel multi-robot gas distribution mapping framework, named GDM-Net++, which works in both 2D and 3D settings. Our proposed framework first divides the environment into n unique regions using Voronoi partitioning. Next, we employ a multi-agent deep Q-learning framework for the robots to learn a joint policy. As GP is a compute-intensive process, during testing, the learned policy is applied without re-training the GP model. The experiments are performed in simulation using Python on six types of Gaussian plumes to validate our proposed technique. Compared to two baselines – greedy and random walk, GDM-Net++ performs by 278% and 852% better in terms of earned rewards, while outperforming them by 34% and 155%, respectively, in terms of the precision of gas distribution modeling across unseen 2D test cases. Our approach can also gracefully handle 2D GDM scenarios where the distribution is consistently affected by wind.

Abstract:
Along-edge driving, where an autonomous vehicle follows road edges, is increasingly common in urban environments and particularly challenging on curvy roads due to rapidly changing curvature. This paper presents a hierarchical trajectory planning framework that integrates Cartesian and Frenet frames to optimize along-edge motion. Cartesian planners struggle with nonlinear constraints, while Frenet-based approaches simplify edge-relative motion but often neglect trajectory curvature and suffer from non-convexity in obstacle avoidance. To address these limitations, our method employs an optimization-based planner with curvature constraints for precise along-edge motion and a sampling-based planner for stable lane changes when encountering obstacles. This novel approach maintains an along-edge distance within a precision of 0.1m, reducing error by 80% (from 0.7m to 0.1m). It also ensures smooth trajectory transitions and enhances stability in complex environments. Simulations and real-world experiments validate the framework’s efficiency, achieving an average planning time of 1.22ms per frame while effectively balancing accuracy, feasibility, and real-time performance.

Abstract:
Advanced machine learning algorithms require platforms that are extremely robust and equipped with rich sensory feedback to handle extensive trial-and-error learning without relying on overwhelming inductive biases. Traditional robotic designs, while well-suited for their specific use cases, are often fragile when used with these algorithms as they fail to address the intermediate sub-optimal posterior-based behavior these algorithms exhibit. To address this gap—and inspired by the vision of enabling curiosity-driven baby robots—we present a novel robotic limb designed from scratch. Our design features a semi-soft structure, a high degree of redundancy achieved through rich non-contact sensors (exclusively cameras), and strategically designed, easily replaceable failure points. Proof-of-concept experiments using two contemporary reinforcement learning algorithms on a physical prototype demonstrate that our design is able to succeed in a target-finding task even under simulated sensor failures, all with minimal human oversight during extended learning periods. Additional experiments on the robustness of the design show that it is able to withstand relatively large amounts of mechanical stress. We believe this design represents a concrete step toward more tailored robotic designs capable of supporting general-purpose, generally intelligent robots.

Abstract:
Articulated soft robots (ASRs) driven by variable stiffness actuators (VSAs) are challenging to control well due to their highly nonlinear dynamics and difficulties in accurate modeling. The paper proposes a locally weighted learning (LWL)-based robust composite learning control (RCLC) solution for ASRs with agonistic-antagonistic (AA)-VSAs to enable the favorable tracking of both joint position and stiffness without exact robot models. In our solution, two LWL models are adopted online to estimate uncertainties in the link-side and stiffness dynamics, respectively, a nonlinear disturbance observer (DOB) is applied to improve tracking robustness at the link side, and a composite learning law is developed to achieve parameter convergence under a condition of interval excitation strictly weaker than persistent excitation so as to improve online modeling speed and accuracy. A distinctive feature of the proposed LWL-RCLC framework lies in the fact that the estimation of the DOB and the learning of LWL are independent yet work in a synergistic manner, which enables exact robot modeling online while improving tracking robustness. Experiments on a multi-DoF ASR with AA-VSAs have verified the superiority of the proposed method.

Abstract:
The simple and modular structure of cable-driven parallel robots (CDPRs) can enable effective real-time reconfiguration. In this paper, an online reconfiguration strategy is proposed for a 3-DOF point-mass CDPR to adjust the cable anchor positions and enhance its performance in physical human-robot interaction (pHRI). The reconfiguration problem, inclusive of all relevant constraints such as the wrench feasible condition (WFC) and the structural constraint on the cable anchors, is formulated as a non-convex optimization problem to determine the optimal positions of cable anchors. However, such original formulation poses a serious challenge to real-time determination, primarily due to the non-convex constraint imposed by the WFC and the non-convex objective function. To address this issue, the characteristics of the CDPR are considered, and a linear approximation method is employed to simplify the original optimization problem into a linear one, allowing it to be efficiently solved by the dual simplex method. Additionally, an artificial potential field (APF) is designed, considering both the inherent workspace properties and the interaction force, to adjust the solution of the linear optimization problem, which ensures that the optimal solution remains within a safe distance from the boundary of the solution space. Simulations validate the effectiveness of the strategy in improving the interaction metric while satisfying constraints.

Abstract:
Increasing evidence highlights the role of proprio-ceptive deficits in falls, emphasizing the need for targeted rehabilitation in populations with functional movement disorders. Despite advances in rehabilitation robots, movement constraints still hinder active engagement of the lower limb muscles, thereby limiting the effectiveness of proprioceptive training. In this work, We developed a neuro-rehabilitation robotic platform to address this need by physically and virtually simulating multi-terrain scenarios. The robot introduces common perturbations, such as uneven mountain trails, sandy beaches, and bumpy bus rides, to assess user stability and recovery, thereby assisting in the design of individualized training programs. The platform enhances neuromuscular responses across multiple directions and facilitates targeted muscle contraction through motor tasks that combine proprioceptive and visual feedback. Preliminary studies demonstrated that the robot successfully facilitated a complete range of ankle rotational movements. Electromyographic analysis revealed increased activation of specific muscle groups, changes in muscle loading and contraction patterns, suggesting that the system recruits multiple muscle groups while enhancing proprioceptive input to periarticular soft tissues. The proposed robot and control strategies established a feasible solution to enhance proprioception rehabilitation.

Abstract:
The hybrid aerial-ground robots combine ground mobility and aerial flight capability, which are often designed for executing multi-terrain tasks. However, most of the existing hybrid aerial-ground robots integrate the ground and aerial functionality into one single whole system. This leads to functional coupling that prevents the full utilization of multimodal locomotion capabilities. In this paper, we design a hybrid aerial-ground robot called SeparaTrek. SeparaTrek features separable and combinative ground and aerial locomotion parts by a free separation and combination structure, reducing the coupling relationship between ground and aerial functionality. Furthermore, we design a multimodal locomotion controller based on extended Kalman filter algorithm and adaptive sliding mode control, achieving stable locomotion of SeparaTrek on complex and variable terrain. Through experiments, it is demonstrated that SeparaTrek’s design is rational and its motion is stable.

Abstract:
Robotic gantry stages are a prevalent class of industrial robots used for precise positioning tasks in various fields, including semiconductor manufacturing, 3D printing, and automated assembly. However, these systems often exhibit time-varying dynamics because the position of the end-effector (i.e., the payload) shifts the mass/inertia properties. Such dynamic variations are not captured by conventional Linear Time-Invariant (LTI) models, leading to modeling inaccuracies and degraded control performance. Linear Parameter-Varying (LPV) system identification is a more suitable alternative, but existing approaches typically employ a single, fixed basis-function order for all parameters, resulting in excessive model complexity and poor efficiency.This paper presents a novel global LPV system identification method for multi-axis robotic gantry systems, enabling independent basis-function order selection for each parameter. By eliminating unnecessary high-order terms, the method reduces computational overhead and enhances modeling accuracy. Experimental validation on an industrial gantry testbed confirms superior precision and robustness compared to conventional LPV approaches with uniform polynomial orders.

Abstract:
Navigating unknown underwater environments is a significant challenge for autonomous underwater vehicles (AUVs), especially those with torpedo-like shapes. Lacking a prior map, these vehicles rely on real-time sensor data for perception. Although online motion planning addresses this challenge, many existing methods are primarily tested on more maneuverable robots, such as multicopters and ground vehicles, and do not account for the unique kinematics of torpedo-shaped AUVs, such as limited lateral movement, or the need for 3D motion planning. In this paper, we propose an online motion planning system specifically designed for torpedo-shaped AUVs to navigate 3D underwater terrain without prior environmental knowledge. The system employs a receding horizon planning framework to ensure safe navigation by replanning the trajectory when collisions are detected or the planning horizon is reached. For trajectory generation, a search-based method is used and utilizes a 3D Dubins curve heuristic to guide the generation of an optimal 3D trajectory that adheres to the AUV’s kinematic constraints. To further enhance safety and smoothness, gradient-based optimization is applied to refine the trajectory. Experiments in simulated environments validate the proposed method, demonstrating its ability to generate safe trajectories for AUVs in complex and unknown environments. We release our code as an open-source package1.

Abstract:
This paper presents the design, development, and validation of a fully autonomous dual-arm aerial robot capable of mapping, localizing, planning, and grasping parcels in an intra-logistics scenario. The aerial robot is intended to operate in a scenario comprising several supply points, delivery points, parcels with tags, and obstacles, generating the mission plan from the voice commands given by the user. The paper derives a transferability model of the scenario, the robot, and the task, so that the proposed system design can be generalized to different scenarios (environment transfer) and platforms (embodiment transfer). The proposed transferable system architecture allows the integration of software modules managed by the Aerial Delivery Robot Operations Manager (ADROM) through the Module Interface Instances (MII) that handle the requests and the signals involved during the execution of the operation. The performance of the developed system was evaluated as part of the euROBIN Nancy Competition, conducting more than 50 flight tests. The software modules are open source, making the flight dataset also publicly available.

Abstract:
Imitation learning (IL) aims to enable robots to perform tasks autonomously by observing a few human demonstrations. Recently, a variant of IL, called In-Context IL, utilized off-the-shelf large language models (LLMs) as instant policies that understand the context from a few given demonstrations to perform a new task, rather than explicitly updating network models with large-scale demonstrations. However, its reliability in the robotics domain is undermined by hallucination issues such as LLM-based instant policy, which occasionally generates poor trajectories that deviate from the given demonstrations. To alleviate this problem, we propose a new robust in-context imitation learning algorithm called the robust instant policy (RIP), which utilizes a Student’s t-regression model to be robust against the hallucinated trajectories of instant policies to allow reliable trajectory generation. Specifically, RIP generates several candidate robot trajectories to complete a given task from an LLM and aggregates them using the Student’s t-distribution, which is beneficial for ignoring outliers (i.e., hallucinations); thereby, a robust trajectory against hallucinations is generated. Our experiments, conducted in both simulated and real-world environments, show that RIP significantly outperforms state-of-the-art IL methods, with at least 26% improvement in task success rates, particularly in low-data scenarios for everyday tasks. Video results available at https://sites.google.com/view/robustinstantpolicy

Abstract:
Data-Driven Inverse Kinematics (DDIK) solvers emerged as promising Inverse Kinematics (IK) methods for reliably approximating the IK of robotic manipulators. However, these solvers remain heavily robot-dependent, where for each robot of interest, a network needs to be trained in an one-solver-one-robot framework. In this paper, we build on our previous work on Learning-By-Example (LBE) for DDIK, and introduce an one-solver-many-robots framework; where a single neural network is used to predict the IK of multiple robots – mainly with 6 and 7 Degrees of Freedom (DoF). In our LBE approach, the neural network input includes an example of joint-pose tuple (e.g. any previous joint and corresponding pose tuple in the path) along with the queried pose as the same network outputs the desired robot joint configuration. Here, we investigate five network architectures: a Plain Multilayer Perceptron (MLP), a Residual-Based MLP (RMLP), a Densely Connected MLP (DMLP), and two transformers inspired by Generative Pre-Trained Transformer (GPT) and tested them using 3 diverse datasets with 20 real-world robotic arms with 6 and 7 DoF. Our experimental results demonstrate that a single lightweight, LBE-based DDIK solver can reliably predict the IK for multiple and hitherto unseen robots, within each of the 6 or 7DoF family as well as across both 6 and 7DoF robot families with position errors below 1mm and orientation errors below 1deg. Additionally, we compare all proposed LBE-DDIK solvers with three established numerical IK solvers: Selectively Damped Least-Squares (SD), Singular Value Filtering (SVF), and Mixed Inverse (MX) and observe that our LBE-DDIK solvers achieve comparable accuracy, with the advantage of being a one-solver-many-robots framework.

Abstract:
In recent years, local magnetic field actuation technology has garnered significant attention in the field of collaborative control of multiple micro-robots. By optimizing the design of coil array structures, gradient magnetic fields can be generated in target areas, enabling independent control of multi-robot systems. However, existing research is largely limited to step-by-step actuation modes, which often cause noticeable jitter during robot movement. Moreover, they make it difficult to achieve omnidirectional and continuous motion, severely limiting both motion smoothness and positioning accuracy. To address these issues, this study proposes a coil array actuation platform and introduces a differential current actuation strategy, effectively achieving smooth motion control of multi-robot systems. The research first analyzes the spatial magnetic field distribution characteristics through modeling; then, based on the magnetic force model, a differential current actuation strategy for multiple robots is proposed; finally, an experimental platform is constructed and a series of experiments are conducted. The experimental results show that this actuation platform can achieve independent and smooth control of multiple micro-robots, demonstrating promising potential in applications such as automated microscopic manipulation.

Abstract:
Robots that use gestures in conjunction with speech can achieve more effective and natural communication with human teammates, however, not all robots have capable and dexterous arms. Augmented Reality technology has effectively enabled deictic gestures for morphologically limited robots in prior work, however, the design space of AR-facilitated iconic gestures remains under-explored. Moreover, existing work largely focuses on closed-world context, where all referents are known a priori. In this work, we present a human-subject study situated in an open-world context, and compare the task performance and subjective perception associated with three different iconic gesture designs (anthropomorphic, non-anthropomorphic, deictic-iconic) against previously studied abstract gesture design. Our quantitative and qualitative results demonstrate that deictic iconic gestures (in which a robot hand is shown pointing to a visualization of a target referent) outperforms all other gestures on all metrics – but that non-anthropomorphic iconic gestures (where a visualization of a target referent appears on its own) is overall most preferred by users. These results represent a significant step forward to enabling effective human-robot interactions in realistic large-scale open-world environments.

Abstract:
Achieving flexible human-following in real-world environments remains a critical yet challenging problem in Human-Robot Interaction (HRI). Traditional approaches typically constrain robots to fixed tracking positions—such as following from behind, ahead, or alongside—thereby limiting their adaptability in dynamic and unstructured environments. This study introduces a reinforcement learning-based framework that allows the robot to dynamically adjust its tracking positions in response to workspace constraints. An interaction space is defined to capture the relationship between the human and the robot while considering the environment. This space serves as the basis for state spaces in Deep Reinforcement Learning (DRL), helping the robot adapt to environmental changes. The selected tracking position is then utilized as input for a human-following controller, ensuring smooth and continuous motion. Experimental evaluations in both indoor and outdoor environments demonstrate that the proposed approach enables robots to follow humans flexibly and adaptively, adjusting their positions autonomously and avoiding obstacles without requiring a predefined tracking position.

Abstract:
We propose a globally consistent semantic SLAM system (GCSLAM) and a semantic-fusion localization subsystem (SF-Loc), which achieves accurate semantic mapping and robust localization in complex parking lots. Visual cameras (front-view and surround-view), IMU, and wheel encoder form the input sensor configuration of our system. The first part of our work is GCSLAM. GCSLAM introduces a semantic-constrained factor graph for the optimization of poses and semantic map, which incorporates innovative error terms based on multi-sensor data and BEV (bird’s-eye view) semantic information. Additionally, GCSLAM integrates a Global Slot Management module that stores and manages parking slot observations. SF-Loc is the second part of our work, which leverages the semantic map built by GCSLAM to conduct map-based localization. SF-Loc integrates registration results and odometry poses with a novel factor graph. Our system demonstrates superior performance over existing SLAM on two real-world datasets, showing excellent capabilities in robust global localization and precise semantic mapping.

Abstract:
Real-to-sim-to-real systems have been studied to overcome the challenges of robot policy learning in the real world by creating a virtual environment that mimics the actual workspace. However, previous studies have limitations, requiring human assistance, such as observing the workspace with a hand-held camera or manipulating objects with a hand. To solve these limitations, we propose a novel real-to-sim-to-real framework, ARIC, that performs without human help. First, ARIC observes real objects by repeatedly changing the object poses through the pre-trained robot policy via reinforcement learning. Through iterative interactions between the robot and the environment, ARIC gradually improves the accuracy of 3D object reconstruction. Next, ARIC learns task-specific robot policies in simulation using replicated objects and applies the policies to real-world scenarios without fine-tuning. We confirm that ARIC efficiently learns robotic tasks by achieving a success rate of 83.3% on average for three real-world tasks.1

Abstract:
As lunar exploration missions grow increasingly complex, ensuring safe and autonomous rover-based surface exploration has become one of the key challenges in lunar exploration tasks. In this work, we have developed a lunar surface simulation system called the Lunar Exploration Simulator System (LESS) and the LunarSeg dataset, which provides RGB-D data for lunar obstacle segmentation that includes both positive and negative obstacles. Additionally, we propose a novel two-stage segmentation network called LuSeg. Through contrastive learning, it enforces semantic consistency between the RGB encoder from Stage I and the depth encoder from Stage II. Experimental results on our proposed LunarSeg dataset and additional public real-world NPO road obstacle dataset demonstrate that LuSeg achieves state-of-the-art segmentation performance for both positive and negative obstacles while maintaining a high inference speed of approximately 57 Hz. We have released the implementation of our LESS system, LunarSeg dataset, and the code of LuSeg at: https://github.com/nubot-nudt/LuSeg.

Affiliations: Department of Mechanical Engineering, MECO Research Team, KU Leuven, Belgium and With Flanders Make@KU, Leuven, Belgium; Computer Science and Artificial Intelligence Lab (CSAIL), Massachusetts Institute of Technology, Cambridge, MA, USA; Department of Mechanical Engineering, Marine Robotics Lab, College of Engineering, University of Wisconsin-Madison, Madison, WI, USA; SENSEable City Laboratory, Massachusetts Institute of Technology, Cambridge, MA, USA

Abstract:
Safe motion planning is essential for autonomous vessel operations, especially in challenging spaces such as narrow inland waterways. However, conventional motion planning approaches are often computationally intensive or overly conservative. This paper proposes a safe motion planning strategy combining Model Predictive Control (MPC) and Control Barrier Functions (CBFs). We introduce a time-varying inflated ellipse obstacle representation, where the inflation radius is adjusted depending on the relative position and attitude between the vessel and the obstacle. The proposed adaptive inflation reduces the conservativeness of the controller compared to traditional fixed-ellipsoid obstacle formulations. The MPC solution provides an approximate motion plan, and high-order CBFs ensure the vessel’s safety using the varying inflation radius. Simulation and real-world experiments demonstrate that the proposed strategy enables the fully-actuated autonomous robot vessel to navigate through narrow spaces in real time and resolve potential deadlocks, all while ensuring safety.

Abstract:
With the rapid advancement of intelligent manufacturing and the rise of emerging markets, global auto-mobile exports have surged, placing unprecedented demands on logistics infrastructure. Efficient coordination of multiple robots for vehicle autonomous transfer is essential in high-density storage environments. However, conventional navigation mode, where autonomous robots navigate the entire space, often leads to inefficiencies, congestion, and increased safety risks. To address these challenges, this paper proposes a dynamic network topology framework to optimize large-scale vehicle transfers in high-density environments. The approach models free space as a network graph with directional, weighted movement costs. Leveraging yard operational characteristics, real-time transfer conditions, and robot specific capabilities, we introduce an event-triggered mechanism to update the network topology dynamically. This method continuously refines drivable space, effectively integrating yard areas with roadways to enhance routing flexibility in robot scheduling. Scenario-Based evaluations demonstrate that the proposed approach reduces traveled distance by up to 12.3% and task completion time by 19.3% compared to traditional operational networks, leading to lower operational costs and improved task efficiency. Notably, these benefits become more pronounced as the number of robots increases and the operational environment grows more complex.

Abstract:
Contact estimation and force sensing are fundamental requirements for sensitive manipulation and safe physical human-robot interaction. The robot controllers that enable these functions rely on accurate and precise sensing. The performance of external force estimation is influenced by the design of the robot’s sensory system. And similar to how humans prefer specific arm configurations for performing precise and delicate tasks, e.g., drawing a thin, straight line, robots also have "sweet spots" that allow for the most accurate performance of tasks based on their sensing capabilities. To fully exploit a robot’s proprioceptive force sensing, it is essential to provide robot integrators, designers, and simulations with knowledge about these optimal settings including factors such as joint configurations, temperatures, and many more. This paper first investigates which of these factors are most relevant and how they can be best measured and based on that introduces force sensing error maps as a tool for structured research on robot force sensing performance and future developments of tactile robot applications. We first investigate the factors influencing the force sensing performance of 7-degree-of-freedom robots on the example of a Kinova Gen3 and then derive 2-dimensional Cartesian force sensing error maps for this robot, an LWR iiwa 14, and a Franka Emika robot. These maps enable comparison of robot sensing capabilities, revealing patterns and weak spots to guide application design toward more tactile areas.

Abstract:
Ensuring safe and effective physical human-robot interaction (pHRI) remains a critical challenge in industrial robotics, particularly in ensuring compliance with ISO/TS 15066 safety standards. This paper proposes a novel framework to achieve a safe and robust physical human-robot interaction (pHRI). The framework adapts the parameters of a variable admittance controller online in order to guarantee passivity and compliance with ISO/TS 15066. Passivity is guaranteed using an energy tank, while a safety constraint explicitly handles the Power and Force Limiting (PFL) energy limit. Experimental validation on an industrial robot demonstrates the effectiveness of the framework.

Abstract:
This work investigates the use of multiple Autonomous Surface Vehicles (ASVs) as Communication/Navigation Aids (CNAs) to enhance the navigation and state estimation of an Autonomous Underwater Vehicle (AUV). Our approach builds on recent advancements in low-cost sensors and platforms, which enable novel AUV applications across fundamental science, commercial industries, and defense. We consider six different combinations of Kalman Filter and Factor Graph localization solutions on three datasets, covering 53 minutes and 3.1 kilometers of operation. We first present the solution using the measurements from all three ASVs, before occluding measurements from two of the ASVs to assess the effect of reduced observability on localization performance.

Abstract:
Graph Neural Networks have emerged as a formidable tool for analyzing point clouds, leveraging their capacity to aggregate local features across multiple spatial scales via layered structures. However, a significant challenge lies in effectively and selectively integrating these multi-scale features to maximize overall performance. To tackle this integration challenge, we design a novel model, MambaGCN, which employs a state space model to dynamically adjust the feature weights across spatial scales during aggregation, enabling more refined feature integration while ensuring computational efficiency. Unlike transformers with their quadratic complexity, MambaGCN achieves linear complexity, substantially reducing GPU memory usage and computational cost. Moreover, we have enhanced the architectural depth by designing a density-based farthest point sampling algorithm, which allows us to selectively downsample the input data to achieve varying levels of point density. This innovation facilitates the seamless concatenation of multiple MambaGCN layers, significantly deepening the structure of the network and enhancing its ability to tackle complex point cloud tasks effectively. Through these strategic developments, MambaGCN has demonstrated outstanding performance in tasks such as point cloud classification and part segmentation, affirming its robustness and efficiency in processing point cloud data.

Abstract:
Legged animals still outperform many terrestrial robots due to the complex interplay of various component subsystems. Centralization is a potential integrated design axis to help improve the performance of legged robots in variable terrain environments. Centralization arises from the coupling of multiple limbs and joints through mechanics or feedback control. Strong couplings contribute to a whole-body coordinated response (centralized) and weak couplings result in localized responses (decentralized). Rarely are both mechanical and neural couplings considered together in designing centralization. In this study, we use an empirical information theory-based approach to evaluate the emergent centralization of a hexapod robot. We independently vary the mechanical and neural coupling through adjustable joint stiffness and variable coupling of leg controllers, respectively. We found an increase in centralization as neural coupling increased. Changes in mechanical coupling did not significantly affect centralization during walking, but did change the total information processing of the neuromechanical control architecture. Information-based centralization increased with robotic performance in terms of cost of transport and speed, implying that this may be a useful metric in robotic design.

Abstract:
Recently, data-driven trajectory prediction methods have achieved remarkable results, significantly advancing the development of autonomous driving. However, the instability of single-vehicle perception introduces certain limitations to trajectory prediction. In this paper, a novel lightweight framework for cooperative trajectory prediction, CoPAD, is proposed. This framework incorporates a fusion module based on the Hungarian algorithm and Kalman filtering, along with the Past Time Attention (PTA) module, mode attention module and anchor-oriented decoder (AoD). It effectively performs early fusion on multi-source trajectory data from vehicles and road infrastructure, enabling the trajectories with high completeness and accuracy. The PTA module can efficiently capture potential interaction information among historical trajectories, and the mode attention module is proposed to enrich the diversity of predictions. Additionally, the decoder based on sparse anchors is designed to generate the final complete trajectories. Extensive experiments show that CoPAD achieves the state-of-the-art performance on the DAIR-V2X-Seq dataset, validating the effectiveness of the model in cooperative trajectory prediction in V2X scenarios.

Abstract:
In this study, we propose a reconfigurable laminate mechanism based bistable tail for Throwbot transforming into a ball type and a wheel type. Various robots such as snake robots, drones, and throwing robots for life-saving missions on behalf of humans at disaster sites have been studied. In particular the hybrid type throwing robot can have both the throwing ease of the ball type and the driving stability of the wheel type. However, it requires the tail to be stored inside when being thrown and to be rigidly deployed when driving. To satisfy these requirements, we developed a foldable tail based on scissor lift structure in our previous study. But, such a structure was composed of only rigid parts, which caused interference with other parts when stored, and difficulty about changing the maximum deployed tail length further. To overcome these limitations, we wanted to develop a bistable tail suitable for the hybrid type that can maintain a bendable state and a rigid state. Before actual development, we calculate the minimum tail length for overcoming obstacle through statics analysis. Then, we design a bistable structure utilizing a reconfigurable laminate mechanism. Next, we calculate the design constraints to mount it on the actual robot. Finally, the developed tail is mounted on the actual Throwbot to perform obstacle overcoming experiments. We confirm that it can secure both ease throwing and stable obstacle overcoming ability. Through this, we propose a bistable tail suitable for the hybrid type throwing robots.

Abstract:
When tracking underwater target, autonomous underwater vehicles (AUVs) need to estimate the target state based on the information detected by sensors and plan their own tracking paths accordingly to achieve active sensing of the target. When the sensor equipped on the AUV is a flank array sonar, the problem becomes significantly more complex due to the limited field of view (FOV) of the sonar and the fact that bearing-only information is available for observation. To address this issue, this paper proposes a distributed solution for cooperative tracking and active sensing using dual-AUV systems equipped with flank array sonar for detection. Based on the analysis of underwater acoustic communication modes in dual-AUV systems, this study decomposes the problem into two aspects: cooperative estimation and planning control for active sensing. Corresponding algorithms are proposed and their effectiveness is verified.

Abstract:
Autonomous driving with reinforcement learning (RL) has significant potential. However, applying RL in real-world settings remains challenging due to the need for safe, efficient, and robust learning. Incorporating human expertise into the learning process can help overcome these challenges by reducing risky exploration and improving sample efficiency. In this work, we propose a reward-free, active human-in-the-loop learning method called Human-Guided Distributional Soft Actor-Critic (H-DSAC). Our method combines Proxy Value Propagation (PVP) and Distributional Soft Actor-Critic (DSAC) to enable efficient and safe training in real-world environments. The key innovation is the construction of a distributed proxy value function within the DSAC framework. This function encodes human intent by assigning higher expected returns to expert demonstrations and penalizing actions that require human intervention. By extrapolating these labels to unlabeled states, the policy is effectively guided toward expert-like be-havior. With a well-designed state space, our method achieves real-world driving policy learning within practical training times. Results from both simulation and real-world experiments demonstrate that our framework enables safe, robust, and sample-efficient learning for autonomous driving. The videos and code are available at: https://github.com/lzqw/H-DSAC.

Abstract:
Modern robotic systems frequently engage in complex multi-agent interactions, many of which are inherently multi-modal, i.e., they can lead to multiple distinct outcomes. To interact effectively, robots must recognize the possible interaction modes and adapt to the one preferred by other agents. In this work, we propose MultiNash-PF, an efficient algorithm for capturing the multimodality in multi-agent interactions. We model interaction outcomes as equilibria of a game-theoretic planner, where each equilibrium corresponds to a distinct interaction mode. Our framework formulates interactive planning as Constrained Potential Trajectory Games (CPTGs), in which local Generalized Nash Equilibria (GNEs) represent plausible interaction outcomes. We propose to integrate the potential game approach with implicit particle filtering, a sample-efficient method for non-convex trajectory optimization. We utilize implicit particle filtering to identify the coarse estimates of multiple local minimizers of the game’s potential function. MultiNash-PF then refines these estimates with optimization solvers, obtaining different local GNEs. We show through numerical simulations that MultiNash-PF reduces computation time by up to 50% compared to a baseline. We further demonstrate the effectiveness of our algorithm in real-world human-robot interaction scenarios, where it successfully accounts for the multi-modal nature of interactions and resolves potential conflicts in real-time.

Abstract:
Large Vision Language Models (VLMs) have been adopted in robotics for their strong common sense understanding and generalization capabilities. Existing works leverage VLMs for task and motion planning based on language instructions and robot observations. In this work, we explore using VLM to interpret long-horizon human demonstration videos to generate a sequence of robot task plans in natural language. To achieve this, we propose SeeDo, an agent that integrates keyframe selection module, visual prompting module, and a VLM interpreter into a pipeline that enables the VLM to "see" human demonstrations and generate step-by-step plans for robots to "do" them. To evaluate, we curate a benchmark of long-horizon human demonstration videos of pick-and-place tasks in three diverse categories and designed comprehensive evaluation metrics. The experiments demonstrate SeeDo’s superior performance in generating subtask planning in natural language from long-horizon human demo videos. Experiments show SeeDo outperforms state-of-the-art video VLMs in generating subtask plans. By further integrating SeeDo with low-level action primitive functions and language model programs, we validated SeeDo in both simulated and real-world deployments. The code, demos, prompts and data can be found at ai4ce.github.io/SeeDo.

Abstract:
This paper introduces a novel robotic gripper, named as the SPD gripper. It features a palm and two mechanically identical and symmetrically arranged fingers, which can be driven independently or by a single motor. The fingertips of the fingers follow a linear motion trajectory, facilitating the grasping of objects of various sizes on a tabletop without the need to adjust the overall height of the gripper. Traditional industrial grippers with parallel gripping capabilities often exhibit an arcuate motion at the fingertips, requiring the entire robotic arm to adjust its height to avoid collisions with the tabletop. The SPD gripper, with its linear parallel gripping mechanism, effectively addresses this issue. Furthermore, the SPD gripper possesses adaptive capabilities, accommodating objects of different shapes and sizes. This paper presents the design philosophy, fundamental composition principles, and optimization analysis theory of the SPD gripper. Based on the design theory, a robotic gripper prototype was developed and tested. The experimental results demonstrate that the robotic gripper successfully achieves linear parallel gripping functionality and exhibits good adaptability. In the context of the ongoing development of embodied intelligence technologies, this robotic gripper can assist various robots in achieving effective grasping, laying a solid foundation for collecting data to enhance deep learning training.

Abstract:
The rise of lightweight, low-cost underwater vehicle-manipulator systems (UVMS) has made autonomous underwater manipulation increasingly accessible. Yet, most current research remains limited to isolated tasks, such as trajectory tracking or compensation of unknown payloads. Detailed experimental analyses that go beyond a proof-of-concept are particularly rare.We present a comprehensive open-source software framework for fully automated pick-and-place studies. We build upon our previous work on a task-priority control framework and extend it to enable fully autonomous manipulation. This includes a high-level decision-making process to coordinate the pick-and-place sequence and a grasp detection method to verify the successful pick-up of the object. We demonstrate this framework on the widely-used platform of a BlueROV2 and an Alpha 5 manipulator.Extensive quantitative experimental studies (100+ trials) show the picking and placing to be highly accurate, with mean position errors of <5 mm and <10 mm, respectively. We additionally validate our grasp detection approach and analyze trajectory tracking sensitivity to varying payloads and speeds. These results provide a baseline of what accuracy is currently achievable with state-of-the-art lightweight hardware under ideal research conditions. The code is available at https://github.com/HippoCampusRobotics/uvms.

Abstract:
In this work, we present a novel control approach based on partial feedback linearization (PFL) for the stabilization of a suspended aerial platform with an attached load. Such systems are envisioned for various applications in construction sites involving cranes, such as the holding and transportation of heavy objects. Our proposed control approach considers the underactuation of the whole system while utilizing its coupled dynamics for stabilization. We demonstrate using numerical stability analysis that these coupled terms are crucial for the stabilization of the complete system. We also carried out robustness analysis of the proposed approach in the presence of external wind disturbances, sensor noise, and uncertainties in system dynamics. As our envisioned target application involves cranes in outdoor construction sites, our control approaches rely on only onboard sensors, thus making it suitable for such applications. We carried out extensive simulation studies and experimental tests to validate our proposed control approach.

Abstract:
Deep reinforcement learning (DRL) has recently emerged as a promising tool for tackling pursuit-evasion tasks. However, most existing DRL-based pursuit approaches still rely on individual rewards and struggle with complex scenarios. To address these challenges, we propose a knowledge-enhanced DRL approach for multi-agent pursuit-evasion in complex environments. Specifically, the cooperative pursuit problem is modeled as a decentralized partially observable Markov decision process from each pursuers perspective, where the team reward function is elaborately designed to encourage collaborative behavior and enhance team coordination. Then, a novel knowledge enhanced multi-agent twin delayed deep deterministic policy gradient (KE-MATD3) algorithm is presented to efficiently learn the cooperative pursuit policy. By integrating a knowledge enhancement mechanism that extracts effective information from an improved artificial potential field method, the cooperative pursuit policy achieves more robust convergence, mitigating the local optima that typically arise from individual reward-based learning. Finally, extensive numerical simulations and real-world experiments validate the efficiency and superiority of the proposed approach, demonstrating emergent cooperative behaviors among the pursuers.

Abstract:
This paper establishes an analytical model for a dual-pressure-actuated pneumatic linear actuator, investigating the relationship between the output force of the linear actuator and both the pressure differential and displacement. Experiments were designed to validate the model. The maximum output force of the linear actuator under negative pressure (-40 kPa) is 100 N, while under hybrid air pressure (negative pressure -40kPa combined with positive pressure 40 kPa), the maximum output force significantly increases to approximately 210 N, demonstrating that dual pressure driving can substantially enhance output performance. The analytical results exhibit excellent agreement with experimental data under low-pressure conditions, with a maximum relative error of only 5%. Furthermore, comparisons with a flexible bellows of the same dimensions confirm that the linear actuator also exhibits high stiffness. Finally, potential applications of the linear actuator in daily life are discussed.

Abstract:
With the increasing demand for safe and efficient human-robot interaction in industrial applications, robotic manipulators with mixed rigid-elastic joints have gained significant attention, yet their control remains challenging due to inherent parameter uncertainties and complex dynamics. In this paper, an adaptive Cartesian position control for robotic manipulators with mixed rigid-elastic joints is presented. Adaptive controllers are designed to deal with uncertainties in the parameters of the motors, while robust control signals effectively cope with uncertainties of the links and stiffness of the elastic joints. Furthermore, a switching strategy between Cartesian space position control and joint space position control when the end-effector comes into the vicinity of the target point is proposed. This switching strategy helps to keep the pose of the robotic manipulator stable when the end-effector has approached the target point. Simulation results on a 6-DOF robotic manipulator demonstrate that the proposed control scheme can achieve the desired accuracy in position and maintain a stable pose when the end-effector reaches the target.

Abstract:
In this paper, we propose a theoretical framework for designing a multi-robot formation equipped with Ultra-wideband (UWB) sensors to localize a target robot. In the presence of noisy range measurements, the accuracy of the target robot’s pose estimation is highly dependent on the chosen formation geometry. Different from existing works, we account for the heterogeneous standard deviations of range measurements across different UWB transmitter-receiver pairs. We establish new optimality conditions for formation geometries and conduct a sensitivity analysis of optimal formations under robot positioning errors. In a 2D setting, we derive necessary and sufficient conditions for both optimality and robustness to robot positioning uncertainty. Experimental results confirm the heterogeneous standard deviations of UWB range measurements and validate the target robot’s confidence ellipse model. An experimental comparison of formation geometries, optimized with and without considering heterogeneous noise, emphasizes the importance of accounting for the heterogeneous standard deviations of range measurements. In addition, we experimentally demonstrate that robust formation geometries improve the target robot’s confidence ellipse in the presence of positioning errors.

Abstract:
Retinal vein cannulation (RVC) is a minimally invasive microsurgical procedure for treating retinal vein occlusion (RVO), a leading cause of vision impairment. However, the small size and fragility of retinal veins, coupled with the need for high-precision, tremor-free needle manipulation, create significant technical challenges. These limitations highlight the need for robotic assistance to improve accuracy and stability. This study presents an automated robotic system with a top-down microscope and B-scan optical coherence tomography (OCT) imaging for precise depth sensing. Deep learning-based models enable real-time needle navigation, contact detection, and vein puncture recognition, using a chicken embryo model as a surrogate for human retinal veins. The system autonomously detects needle position and puncture events with 85% accuracy. The experiments demonstrate notable reductions in navigation and puncture times compared to manual methods. Our results demonstrate the potential of integrating advanced imaging and deep learning to automate microsurgical tasks, providing a pathway for safer and more reliable RVC procedures with enhanced precision and reproducibility.

Abstract:
The construction industry faces growing challenges in sustainability, labor shortages, and efficiency. Stone masonry, known for its durability and low environmental impact, has declined in modern construction due to high labor costs and slow building processes. This work presents a robotic system for automating the construction of mortar-joint, multi-leaf stone masonry walls. Our approach integrates stone layout optimization, sequence planning, vacuum-based grasping, automated pick-and-place motion, and vision- and sensor-guided trajectory correction. A digital twin system provides real-time feedback to improve accuracy and adaptability. To evaluate our method, we construct a 700 × 700 × 400 mm3 masonry wall and compare it to those built by skilled masons. The final structure demonstrates comparable strength, stone interlocking, and stone filling to manually built walls. As the first robotic approach to mortar-joint multi-leaf masonry construction, this work addresses key challenges such as robotic manipulation and dense packing of irregular stones with wet mortar. Our findings contribute to advancing robotic construction with natural materials and offer a scalable framework for future architectural applications.

Abstract:
The success of 3D Gaussian splatting in 3D reconstruction has recently led to efforts to integrate it with SLAM systems. However, most existing research has focused on indoor tracking and mapping, while outdoor Gaussian SLAM methods still heavily rely expensive LiDAR sensor. To address these challenges, we propose LOG-SLAM, a novel method for large-scale outdoor tracking and mapping using Gaussian Splatting. Our approach supports tracking through monocular or visual-inertial input, progressively constructing the 3D Gaussian map from depth and pose estimates obtained during the tracking process. Additionally, we introduce a submap-based strategy for managing large-scale maps, enabling the reconstruction of kilometer-scale environments. A loop closure detection module is also incorporated to reduce accumulated errors. Furthermore, we present a novel dynamic object removal method based on rendering loss that mitigates the interference of dynamic objects on the reconstruction. Our experiments on KITTI and KITTI-360 demonstrate that our method achieves localization performance comparable to traditional SLAM systems, while outperforming recent GS/NeRF-based SLAM approaches in terms of mapping and rendering quality.

Abstract:
Modern automated factories increasingly run manufacturing procedures using a matrix of programmable machines, such as 3D printers, interconnected by a programmable transport system, such as a fleet of tabletop robots. To embed a manufacturing procedure into a smart factory, an operator must: (a) assign each of its processes to a machine and (b) specify how agents should transport parts between machines. The problem of embedding a manufacturing process into a smart factory is termed the Smart Factory Embedding (SFE) problem. State-of-the-art SFE solvers can only scale to factories containing a couple dozen machines. Modern smart factories, however, may contain hundreds of machines. We fill this hole by introducing the first highly scalable solution to the SFE, TSACES, the Traffic System based Anytime Cyclic Embedding Solver. We show that TS-ACES is complete and can scale to SFE instances based on real industrial scenarios with more than a hundred machines.

Abstract:
Soil diseases around drainage pipelines are a major factor in road collapse. Robots designed to detect these diseases face multiple challenges, including harsh internal environments, size limitations, difficulties in achieving full external space coverage, and the impact of pose misalignment on disease localization. To address these challenges, this work presents the design and development of a pipeline robot equipped with Ground-Penetrating Radar (GPR), capable of adapting to a pipe diameter range of 500-1000 millimeters and providing comprehensive detection of external space diseases. A radial offset estimation model is introduced, and by integrating multi-sensor data, the robot achieves full-pose perception, overcoming challenges related to angular and positional misalignment during disease localization. Experimental results demonstrate that the robot can achieve a maximum detection speed of up to 0.5 meters per second and is capable of adapting to various field drainage pipeline scenarios, including full water, rough terrain, pose misalignment, and 90-degree bends. Azimuth errors for external disease localization are controlled within 1 degree, and axial displacement errors are controlled within 2 centimeters.

Abstract:
Class-agnostic 3D instance segmentation is critical for robotic systems operating in unknown environments, enabling perception of previously unseen objects for reliable manipulation and navigation. Existing approaches typically project per-frame 2D instance masks into 3D and merge them, which often breaks object identities across time and yields fragmented 3D instances. We introduce Cross-Dimensional Class-Agnostic 3D Instance Segmentation (CDIS), a zero-shot framework that explicitly tracks 2D instance masks across frames and associates them with 3D superpoints, creating a feedback loop between 2D and 3D. This cross-dimensional reasoning links temporally stable 2D tracks with spatially coherent 3D regions, producing globally consistent 3D instance labels without any 3D-specific training. Experiments on benchmark datasets demonstrate that CDIS achieves higher accuracy and consistency than state-of-the-art zero-shot methods, while remaining efficient and scalable to diverse real-world environments.

Abstract:
Current lidar and camera-based solutions for low-cost indoor mobile robots have limitations such as poor performance in visually obscured environments, high computational overhead for data processing, and high costs for lidars. In contrast, mmWave radar sensors offer a cost-effective and lightweight alternative, providing accurate ranging regardless of visibility. However, existing radar-based localization suffers from sparse point cloud generation, noise, and false detections. Thus, in this work, we introduce RaGNNarok, a real-time, lightweight, and generalizable graph neural network (GNN)-based framework to enhance radar point clouds, even in complex and dynamic environments. With an inference time of only 7.3 ms on the low-cost Raspberry Pi 5, RaGNNarok runs even on such resource-constrained devices, without additional computational resources. We evaluate its performance across key tasks, including localization, SLAM, and autonomous navigation, in three different environments. Our results demonstrate strong reliability and generalizability, making RaGNNarok a robust solution for low-cost indoor mobile robots.

Abstract:
To perform target-following tasks in unknown environments, a robot must identify the target’s position and plan an efficient path to reach it. Traditional LiDAR-based localization systems face challenges in distinguishing the target from objects with similar appearances. Meanwhile, existing target-following approaches often neglect target visibility during path planning, leading to target occlusion by obstacles and ultimately resulting in following failure. In this paper, we propose a sequence matching method for target-localization using LiDAR and Ultra-Wideband (UWB) ranging. We determine the position of the target by analyzing the similarities between UWB ranging sequence and LiDAR cluster trajectories. To achieve visibility-aware target-following, we incorporate a visibility objective function into the Dynamic Window Approach (DWA) to generate a following path that minimizes the risk of target loss. This function evaluates the target loss risk based on the positional relationships between the robot, the target, and the nearest obstacle to the target. Extensive experiments were conducted using both human and robot as targets. The results show that our approach achieves higher completion rates when compared to the target-following using traditional DWA.

Abstract:
Food waste management plays a vital role in maintaining a sustainable ecosystem, however, the presence of inorganic contaminants within food waste significantly hinders this potential. Robotic automation offers a promising solution to accelerate waste sorting, yet the diverse and unpredictable nature of contaminants poses major challenges to robotic perception and grasping. This benchmark study explores the feasibility and limitations of conventional robotic grasping systems, replicating real-world industrial conditions to highlight the complexities of food waste sorting. A comprehensive automated robotic grasping pipeline is introduced, integrating advanced 6D grasping pose detection, collision-free robotic arm motion planning, and effective grasping with three top-performing robotic end-effectors. Extensive experimental evaluations (up to 1500 robotic grasps) compare the performance of different gripper designs and the corresponding grasping strategies under three high-fidelity environmental scenes, providing valuable insights into the limitations of the current robotic system. Experiment results demonstrate the significant strengths of each gripper when dealing with objects of varying types or in different environments. This is critical for enhancing robotic sorting capabilities, particularly in advancing multimodal gripper technology.

Abstract:
Autonomous driving requires an understanding of the static environment from sensor data. Learned Bird’s-Eye View (BEV) encoders are commonly used to fuse multiple inputs, and a vector decoder predicts a vectorized map representation from the latent BEV grid. However, traditional map construction models provide deterministic point estimates, failing to capture uncertainty and the inherent ambiguities of real-world environments, such as occlusions and missing lane markings. We propose MapDiffusion, a novel generative approach that leverages the diffusion paradigm to learn the full distribution of possible vectorized maps. Instead of predicting a single deterministic output from learned queries, MapDiffusion iteratively refines randomly initialized queries, conditioned on a BEV latent grid, to generate multiple plausible map samples. This allows aggregating samples to improve prediction accuracy and deriving uncertainty estimates that directly correlate with scene ambiguity. Extensive experiments on the nuScenes dataset demonstrate that MapDiffusion achieves state-of-the-art performance in online map construction, surpassing the baseline by 5% in single-sample performance. We further show that aggregating multiple samples consistently improves performance along the ROC curve, validating the benefit of distribution modeling. Additionally, our uncertainty estimates are significantly higher in occluded areas, reinforcing their value in identifying regions with ambiguous sensor input. By modeling the full map distribution, MapDiffusion enhances the robustness and reliability of online vectorized HD map construction, enabling uncertainty-aware decision-making for autonomous vehicles in complex environments.

Abstract:
We present and tackle the problem of Embodied Question Answering (EQA) with Situational Queries (S-EQA) in a household environment. Unlike prior EQA work tackling simple queries that directly reference target objects and properties ("What is the color of the car?"), situational queries (such as "Is the house ready for sleeptime?") are challenging as they require the agent to correctly identify multiple object-states (Doors: Closed, Lights: Off, etc.) and reach a consensus on their states for an answer. Towards this objective, we first introduce a novel Prompt-Generate-Evaluate (PGE) scheme that wraps around an LLM’s output to generate unique situational queries and corresponding consensus object information. PGE is used to generate 2K datapoints in the VirtualHome simulator, which is then annotated for ground truth answers via a large scale user-study conducted on M-Turk. With a high rate of answerability (97.26%) on this study, we establish that LLMs are good at generating situational data. However, in evaluating the data using an LLM, we observe a low correlation of 46.2% with the ground truth human annotations; indicating that while LLMs are good at generating situational data, they struggle to answer them according to consensus. When asked for reasoning, we observe the LLM often goes against commonsense in justifying its answer. Finally, we utilize PGE to generate situational data in a real-world environment, exposing LLM hallucination in generating reliable object-states when a structured scene graph is unavailable. To the best of our knowledge, this is the first work to introduce EQA in the context of situational queries and also the first to present a generative approach for query creation. We aim to foster research on improving the real-world usability of embodied agents through this work.

Abstract:
This paper presents a simple algebraic method to estimate the pose of a camera relative to a planar target from n ≥ 4 reference points with known coordinates in the target frame and their corresponding bearing measurements in the camera frame. The proposed approach follows a hierarchical structure; first, the unit vector normal to the target plane is determined, followed by the camera’s position vector, its distance to the target plane, and finally, the full orientation. To improve the method’s robustness to measurement noise, an averaging methodology is introduced to refine the estimation of the target’s normal direction. The accuracy and robustness of the approach are validated through extensive experiments.

Abstract:
Liquids and granular media (e.g., oats, rice, lentils) are pervasive throughout human environments, yet remain challenging for robots to sense and manipulate precisely. In this work, we present a systematic approach to integrating capacitive sensing within robotic end effectors, enabling robust sensing and precise manipulation of liquids and granular media. We introduce the parallel-jaw RoboCAP Gripper with embedded capacitive sensing arrays that enable a robot to directly sense the materials and dynamics of liquids inside diverse containers. Our system achieves 82.8% classification accuracy across 81 container–substance combinations, and enables a robotic manipulator to perform precision pouring with a mean error of 3.2g over 200 trials. Code, designs, and build details are available on the project website 1.

Abstract:
Model predictive control (MPC) relies on an accurate dynamics model to achieve precise and safe robot operation. In complex and dynamic aquatic environments, developing an accurate model that captures hydrodynamic details and accounts for environmental disturbances like waves, currents, and winds is challenging for aquatic robots. In this paper, we propose an online residual model learning framework for MPC, which leverages approximate models to learn complex unmodeled dynamics and environmental disturbances in dynamic aquatic environments. We integrate offline learning from previous simulation experience with online learning from the robot’s real-time interactions with the environments. These three components—residual modeling, offline learning, and on-line learning—enable a highly sample-efficient learning process, allowing for accurate real-time inference of model dynamics in complex and dynamic conditions. We further integrate this online learning residual model into a nonlinear model predictive controller, enabling it to actively choose the optimal control actions that optimize the control performance. Extensive simulations and real-world experiments with an autonomous surface vehicle demonstrate that our residual model learning MPC significantly outperforms conventional MPCs in dynamic field environments.

Abstract:
In this paper, we present a novel online global reactive motion planner that synergizes the benefits of reactive control and model predictive control (MPC). By applying circular fields, the planner significantly simplifies the problem of determining control inputs for mobile robots and manipulators, making real-time MPC feasible even in complex and dynamic three-dimensional environments. This approach utilizes the performance advantages of optimal control while maintaining reactivity to environmental changes and computational efficiency. The proposed motion planner is evaluated in various simulated scenarios, including complex dynamic environments with up to 100 moving obstacles, and is compared to different state-of-the-art approaches.

Abstract:
Accurate localization is crucial for the autonomous operation of mobile robots. Specifically for indoor scenarios, localization algorithms typically rely on a previously generated map. However, many real-world sites like warehouses or healthcare environments violate the underlying assumption that the robot’s surroundings are mainly static. In this paper, we introduce a new dataset plus a benchmark that enables evaluating and comparing indoor localization methods in complex and changing real-world scenarios. While several datasets for indoor scenes exist, only a few combine the long-term localization aspect of repeatedly revisiting the same environment under varying conditions with precise ground truth over multiple rooms. Our dataset comprises various sequences recorded with a wheeled robot covering an office environment. We provide data from two 2D LiDARs, multiple consumer-grade RGB-D cameras, and the robot’s wheel odometry. By densely placing fiducial markers on every room ceiling, we can also provide accurate pose information within a single global frame for the whole environment, estimated through an additional upward-facing camera. We evaluate existing localization algorithms on our data and make the dataset together with a server-based benchmark evaluation publicly available. This facilitates an unbiased evaluation of localization approaches and enables further research on their application in challenging indoor scenarios.

Abstract:
Hand exoskeletons can recognize user’s intent and provide active resistance training to enhance finger strength in stroke patients. However, achieving fine human-robot interaction (HRI) while maintaining system simplicity for lightweight design remains a key challenge. In this work, we present an improved flexible hand exoskeleton with series elastic actuator (SEA) for hand strength estimation and progressive resistance exercise. The SEA design allows the hand exoskeleton to have backdrivability to improve HRI performance. By combining the flexible linkage with the flex sensor, we propose a novel user interface that is able to sensitively acquire hand motion intent. An Extended Kalman Filter (EKF) based tracking errors estimation is designed to evaluate the finger strength. The results of the finger strength estimation are used to adjust the parameters of the admittance model to provide small or large damping when the user’s finger strength is low or high, achieving active admittance control based progressive resistance exercise. The feasibility has been demonstrated by two sets of experiments, and this work has established a hand exoskeleton solution for finger strength estimation and fine human-robot interaction.

Abstract:
This paper presents opt-in camera, a concept of privacy-preserving camera systems capable of recording only specific individuals in a crowd who explicitly consent to be recorded. Our system utilizes a mobile wireless communication tag attached to personal belongings as proof of opt-in and as a means of localizing tag carriers in video footage. Specifically, the on-ground positions of the wireless tag are first tracked over time using the unscented Kalman filter (UKF). The tag trajectory is then matched against visual tracking results for pedestrians found in videos to identify the tag carrier. Technically, we devise a dedicated trajectory matching technique based on constrained linear optimization, as well as a novel calibration technique that handles wireless tag-camera calibration and hyperparameter tuning for the UKF, which mitigates the non-lineof-sight (NLoS) issue in wireless localization. We implemented the proposed opt-in camera system using ultra-wideband (UWB) devices and an off-the-shelf webcam. Experimental results demonstrate that our system can perform opt-in recording of individuals in real-time at 10 fps, with reliable identification accuracy in crowds of 8–23 people in a confined space.

Abstract:
Event-related potentials (ERPs) are essential for the development of brain-computer interface (BCI) systems, particularly for their ability to facilitate communication by detecting specific brain activity patterns. To improve the detection accuracy of these signals, advanced filtering techniques are employed to enhance the signal-to-noise ratio (SNR), enabling more reliable classification of ERPs. This study evaluates the performance of three widely used filtering methods—Common Spatial Pattern (CSP), Common Spatio-Temporal Pattern (CSTP), and Max-SNR—in detecting the P300 component, a prominent ERP used in many BCI applications. Building upon the CSTP method, we propose a novel Max-SNR-based spatio-temporal filter designed to leverage both spatial and temporal features of the signal. The features extracted using these filters were classified with the Stepwise Linear Discriminant Analysis (SWLDA) classifier, a commonly adopted method in the BCI domain. Our results demonstrate that the proposed Max-SNR-based spatio-temporal filter outperformed other approaches, achieving an average classification accuracy of 96.0%. These findings highlight the potential of the proposed method to enhance P300 detection and improve the overall efficiency of BCI systems.

Abstract:
Magnetic microrobots (MMs) have emerged as promising tools for targeted therapies, including non-invasive in vivo treatments and precise drug delivery, owing to their untethered controllability and biocompatibility. Current actuation strategies for MMs primarily rely on two magnetic field (MF) generation approaches: gradient-based and rotational methods. Unlike the gradient method, rotational actuation enables efficient manipulation of MMs under significantly weaker magnetic fields. To fully leverage the potential of rotationally driven MMs, a comprehensive understanding of their fundamental spin motility is essential. Achieving accurate characterization of these MMs necessitates the development of an MF generation system equipped with rapid motion-tracking and broad-range measurement capabilities. This study proposes a high-speed rotating states observation scheme by developing a tracking-based optimal local imaging and estimation scheme, simultaneously meeting the broad-range observation capability and the high imaging speed requirement. Specifically, the CSR-DCF tracking method is adopted to detect the MM’s location, and based on this, the observation system adjusts the imaging region optimally. An estimation scheme based on the asynchronous rectification method is derived to measure the MM rotating states consistently using measured MF data and local optical images of the target. Experimental studies are carried out to validate the effectiveness of the proposed scheme.

Abstract:
Humanoid robots are vital tools for substituting humans in various operational scenarios. A sufficiently large stationary reachability is a key factor in ensuring their operational capability. To address this challenge, this paper proposes a whole-body reachability enhancing approach for humanoid robots based on gradient optimization, referred to as Gradient-based Hierarchical Optimization Whole-Body Control (GHO-WBC). The goal of the proposed approach is to extend the end-effector reachability of the humanoid robot while maintaining its stationary state. The proposed approach first derives the gradient of the robot’s whole-body center of mass (CoM) position, ensuring stationary stability across extreme reachable ranges. Next, the gradient of the key joint segment singularity is derived to achieve the stability of the humanoid robot’s end effector at extreme operational distances. Finally, a multi-level optimization approach is employed to compute a feasible solution for the whole-body joint kinematics, and experimental validation is conducted on the humanoid robot. Compared to the conventional whole-body control optimization approach, the present approach improves the reachable range by more than 89%.

Abstract:
Depth estimation is crucial for intelligent systems, enabling applications from autonomous navigation to augmented reality. While traditional stereo and active depth sensors have limitations in cost, power, and robustness, dual-pixel (DP) technology, ubiquitous in modern cameras, offers a compelling alternative. This paper introduces DiFuse-Net, a novel modality decoupled network design for disentangled RGB and DP based depth estimation. DiFuse-Net features a window bi-directional parallax attention mechanism (WBiPAM) specifically designed to capture the subtle DP disparity cues unique to smartphone cameras with small aperture. A separate encoder extracts contextual information from the RGB image, and these features are fused to enhance depth prediction. We also propose a Cross-modal Transfer Learning (CmTL) mechanism to utilize large-scale RGB-D datasets in the literature to cope with the limitations of obtaining large-scale RGB-DP-D dataset. Our evaluation and comparison of the proposed method demonstrates its superiority over the DP and stereo-based baseline methods. Additionally, we contribute a new, high-quality, real-world RGB-DP-D training dataset, named Dual-Camera Dual-Pixel (DCDP) dataset, created using our novel symmetric stereo camera hardware setup, stereo calibration and rectification protocol, and AI stereo disparity estimation method.

Abstract:
We propose VISO-Grasp, a novel vision-language-informed system designed to systematically address visibility constraints for grasping in severely occluded environments. By leveraging Foundation Models (FMs) for spatial reasoning and active view planning, our framework constructs and updates an instance-centric representation of spatial relationships, enhancing grasp success under challenging occlusions. Furthermore, this representation facilitates active Next-Best-View (NBV) planning and optimizes sequential grasping strategies when direct grasping is infeasible. Additionally, we introduce a multi-view uncertainty-driven grasp fusion mechanism that refines grasp confidence and directional uncertainty in real-time, ensuring robust and stable grasp execution. Extensive real-world experiments demonstrate that VISO-Grasp achieves a success rate of 87.5% in target-oriented grasping with the fewest grasp attempts outperforming baselines. To the best of our knowledge, VISO-Grasp is the first unified framework integrating FMs into target-aware active view planning and 6-DoF grasping in environments with severe occlusions and entire invisibility constraints. Code is available at: https://github.com/YitianShi/vMF-Contact

Abstract:
Recently, with the development of Large Language Models (LLMs), Embodied AI represented by Vision-Language-Action Models (VLAs) has played a significant role in realizing the natural language interaction between humans and robots. Current VLA models can process and understand visual information and language instructions, while guiding robots to complete interactive tasks with the environment based on human language instructions. However, when tackling with the real-time and dynamic tasks, VLAs have poor robustness and real-time planning and adjustment ability against changes in target objects, instructions, and environments. To handle these limitations, we propose VLIN-RL, a unified framework that consists of the Vision-Language Interpreter (VLIN) that owns excellent vision language information understanding and advanced task planning abilities and reinforcement learning (RL)-based motion planner with enhanced flexibility and broader applicability. If the environmental state changes during task execution, the RL planning module in VLIN-RL will directly make dynamic adjustments at the subtask level based on visual feedback to achieve the task goals, without the need for time-consuming reprocessing by VLIN. Experiments demonstrate that our model can complete multi-robot manipulation tasks more efficiently and stably. Finally, our work is verified by the pick-grasp tasks and real manipulators experiments. The test video is available at https://github.com/jzwsoulferryman/VLIN-RL.git.

Abstract:
Soft robotic wearables, with lightweight and flexible actuation, have shown promising results in assistive applications. However, it remains unclear whether they can be made fully comfortable and suitable for everyday use. In this study, we introduce a clothing-type soft robotic wearable embedded with shape memory alloy (SMA) based artificial muscle fibers for ankle plantarflexion assistance. We conducted force characterization of SMA wires, analyzing the effects of thickness, strain, applied current, and number of wires. The actuator was designed to achieve assistive force of 80 N and implemented a closed-loop controller with a PI controller to enable precise force and displacement control. Last, we demonstrate that the developed system can deliver controlled and repeatable forces in bench tests and modulate peak force according to target assistive levels on the ankle during walking in a user study.

Abstract:
This work proposes a path generation policy for self-actuating soft grippers by converting external deformation conditions into intrinsic load conditions. This transformation enables anisotropic material orientation control of functional materials — which can deform under stimuli — aligning with the deformation requirements of soft grippers to enhance controllability. Given a desired deformation for an arbitrary geometry, finite element method (FEM) analysis is used to determine the internal stress distribution. The second-order stress tensor is transformed into a traction vector field, guiding the alignment of material anisotropy for optimal deformation. A computational framework is developed to generate smooth, continuous printing paths by integrating along the vector field, ensuring internal morphology control of the target geometry. The proposed method is validated through FEM analysis, demonstrating a positional deviation rate of under 5% relative to the largest geometric feature in each test case across various tested shapes and deformation conditions. The results demonstrate that the algorithm effectively generates 4D printing paths that enable soft grippers to achieve target deformations with high matching rate.

Abstract:
Haptic interaction plays a crucial role in enhancing realism and immersion in virtual environments, particularly in applications such as robotic rehabilitation. In these environments, therapy based on serious games that combine VR and haptic feed promise to offer engaging, affordable and reliable treatments. Most 6-DoF haptic methods have focused on precision in the kinematic fidelity of the haptic feedback. However, rehabilitation games for upper limb focus more on the forces exerted by the patient than the kinematic precision of the contacts with the virtual environment, since these exerted forces will be what powers the neuromuscular recovery of lost movement. We believe that a haptic control method with a focus on providing stable compliant and safe collisions is required to further the research into VR rehabilitation gamification. In addition, these solutions often rely on specifically tailored simulators running at very high frequencies. In contrast, commercial game engine that bring the most capabilities for gamification run at very low frequencies in comparison (≃50 Hz). In this work, we present a novel approach that integrates god object and penalty-based methods to achieve a balance between stability and computational efficiency, to enable 6-DoF haptic force generation on a collaborative robot. This is achieved through a relaxed quasi-god object simulation in the game engine and geodesic constraints in the robot’s haptic loop.

Abstract:
This paper presents a dynamic walking corridor generation (DWCG) algorithm designed to enhance navigation safety for visually impaired individuals in crowded pedestrian environments. Current physical human-robot interaction (pHRI) systems struggle with random pedestrian movements and interaction disturbances in such settings. To address these limitations, we propose a safety-critical framework that integrates Safe Flight Corridor concepts with pedestrian dynamics modeling. The method constructs time-varying Safe Walking Corridors (SWCs) through convex polyhedra decomposition, constrained by social force model predictions. Simulation experiments demonstrate a 100% success rate in moderate crowds (50 pedestrians or fewer) with 10.1 ms average computation time, and 86.3% success in high-density environments (100 pedestrians), establishing a foundation for reliable assistive navigation systems in complex urban settings.

Abstract:
The visual perception system of unmanned surface vessels (USVs) is often subjected to various adversarial attacks (e.g., lens stains, sun glare, ship painting, etc.), impacting the safety of autonomous navigation in maritime environments. To enhance the reliability and robustness of situational awareness in complex environments, we proposed a defensive model to effectively counteract multiple attacks targeting the perception system. Specifically, we first constructed a maritime instance segmentation dataset including various adversarial attack samples, with accurate annotations for the sky, water, land, ships and obstacles. To address the degradation in perception accuracy caused by adversarial attacks, we introduced a Monte Carlo-based random fusion module (MC Fusion) to enhance the adaptability of USVs in various dynamic environments. Additionally, as USVs are always equipped with onboard PC with limited computing resources, we incorporated the lightweight universal inverted bottleneck (UIB) module into the backbone to ensure effective feature extraction while reducing model parameters. Finally, we conducted comparative experiments under various adversarial attack scenarios. Our results demonstrate that, even in the presence of multiple adversarial attacks, our method improves ship detection accuracy by 13.9% and increases the mean accuracy of segmentation masks by over 10% compared to state-of-the-art models, enhancing the safety of USVs in navigation. The source code and datasets are available at https://github.com/huangyanh/ADNet.

Abstract:
Multi-agent cooperative SLAM often encounters challenges in similar indoor environments characterized by repetitive structures, such as corridors and rooms. These challenges can lead to significant inaccuracies in shared location identification when employing point cloud-based techniques. To mitigate these issues, we introduce TWC-SLAM, a multi-agent cooperative SLAM framework that integrates text semantics and WiFi signal features to enhance location identification and loop closure detection. TWC-SLAM comprises a single-agent front-end odometry module based on FAST-LIO2, a location identification and loop closure detection module that leverages text semantics and WiFi features, and a global mapping module. The agents are equipped with sensors capable of capturing textual information and detecting WiFi signals. By correlating these data sources, TWC-SLAM establishes a common location, facilitating point cloud alignment across different agents’ maps. Furthermore, the system employs loop closure detection and optimization modules to achieve global optimization and cohesive mapping. We evaluated our approach using an indoor dataset featuring similar corridors, rooms, and text sign. The results demonstrate that TWC-SLAM significantly improves the performance of cooperative SLAM systems in complex environments with repetitive architectural features.

Abstract:
Sensor simulation is pivotal for scalable validation of autonomous driving systems, yet existing Neural Radiance Fields (NeRF) based methods face applicability and efficiency challenges in industrial workflows. This paper introduces a Gaussian Splatting (GS) based system to address these challenges: We first break down sensor simulator components and analyze the possible advantages of GS over NeRF. Then in practice, we refactor three crucial components through GS, to leverage its explicit scene representation and real-time rendering: (1) choosing the 2D neural Gaussian representation for physics-compliant scene and sensor modeling, (2) proposing a scene editing pipeline to leverage Gaussian primitives library for data augmentation, and (3) coupling a controllable diffusion model for scene expansion and harmonization. We implement this framework on a proprietary autonomous driving dataset supporting cameras and LiDAR sensors. We demonstrate through ablation studies that our approach reduces frame-wise simulation latency, achieves better geometric and photometric consistency, and enables interpretable explicit scene editing and expansion. Furthermore, we showcase how integrating such a GS-based sensor simulator with traffic and dynamic simulators enables full-stack testing of end-to-end autonomy algorithms. Our work provides both algorithmic insights and practical validation, establishing GS as a cornerstone for industrial-grade sensor simulation.

Abstract:
We present a novel machine learning framework and synthetic dataset for performing absolute localization on planetary surfaces where satellite navigation systems are unavailable. Current approaches involve manual surface-to-satellite image matching by human rover operators, limiting the rate of planetary exploration and scientific utilization. Our framework leverages deep neural networks to perform image similarity matching between a rover’s onboard cameras and corresponding ground-view images from a digital twin environment created from satellite and elevation maps. The rover views, satellite, and elevation maps are taken from a photorealistic lunar environment simulated in a 3D graphics engine (Unreal Engine 4). The synthetic twin ground-view re-projections are generated using an open-source 3D graphics software (Blender). In total, we generate a dataset of 1.68 million images at 210,000 locations. The images and corresponding metadata are then used to train a DINOv2 vision transformer image similarity model through supervised fine-tuning to determine matching locations between the rover views and candidate re-projections. Through this method, our model is able to determine the ground truth location within 5 m using just 2.5% of the search space, outperforming other deep learning and classical image comparison benchmarks.

Abstract:
LiDAR-based place recognition (LPR) is a key component for autonomous driving, and its resilience to environmental corruption is critical for safety in high-stakes applications. While state-of-the-art (SOTA) LPR methods perform well in clean weather, they still struggle with weather-induced corruption commonly encountered in driving scenarios. To tackle this, we propose ResLPRNet, a novel LiDAR data restoration network that largely enhances LPR performance under adverse weather by restoring corrupted LiDAR scans using a wavelet transform-based network. ResLPRNet is efficient, lightweight and can be integrated plug-and-play with pretrained LPR models without substantial additional computational cost. Given the lack of LPR datasets under adverse weather, we introduce ResLPR, a novel benchmark that examines SOTA LPR methods under a wide range of LiDAR distortions induced by severe snow, fog, and rain conditions. Experiments on our proposed WeatherKITTI and WeatherNCLT datasets demonstrate the resilience and notable gains achieved by using our restoration method with multiple LPR approaches in challenging weather scenarios. Our code and benchmark are publicly available here: https://github.com/nubot-nudt/ResLPR.

Abstract:
A number of recent studies have focused on developing surgical simulation platforms to train machine learning (ML) agents or models with synthetic data for surgical assistance. While existing platforms excel at tasks such as rigid body manipulation and soft body deformation, they struggle to simulate more complex soft body behaviors like cutting and suturing. A key challenge lies in modeling soft body fracture and splitting using the finite-element method (FEM), which is the predominant approach in current platforms. Additionally, the two-way suture needle/thread contact inside a soft body is further complicated when using FEM. In this work, we use the material point method (MPM) for such challenging simulations and propose new rigid geometries and soft-rigid contact methods specifically designed for them. We introduce CRESSim-MPM, a GPU-accelerated MPM library that integrates multiple MPM solvers and incorporates surgical geometries for cutting and suturing, serving as a specialized physics engine for surgical applications. It is further integrated into Unity, requiring minimal modifications to existing projects for soft body simulation. We demonstrate the simulator’s capabilities in real-time simulation of cutting and suturing on soft tissue and provide an initial performance evaluation of different MPM solvers when simulating varying numbers of particles. The source code is available at https://github.com/yafei-ou/CRESSim-MPM.

Abstract:
In minimally invasive robotic thoracic surgery, the unavoidable respiratory motion of the patient causes lung lesions to move and deform, making precise tumor localiza-tion a significant challenge for surgeons. To address this, we introduce an RDDM (Recursive Deformable Diffusion Model)-based framework designed for real-time intraoperative tumor tracking, which can be used for registration and navigation in robot-assisted thoracic surgery. The RDDM reduces training complexity and enhances dataset utilization by employing a simplified DDM (Diffusion Deformable Model) iteratively, significantly lowering computational demands while maximizing the extraction of valuable information from limited 4D-CT (four-dimensional computed tomography) datasets. Considering the robustness required for intraoperative registration and navigation, we incorporate an ICP (Iterative Closest Point)-based point cloud registration method into the framework and validate our approach using publicly available datasets and volunteer trials. This innovation has the potential to reduce radiation exposure, trauma, and the risk of complications for patients undergoing minimally invasive thoracic surgery, and enables downstream tasks such as RAPNB (robot-assisted percutaneous needle biopsy) and radiation therapy.

Abstract:
This paper presents HyperGraph ROS, an open-source robot operating system that unifies intra-process, inter-process, and cross-device computation into a computational hypergraph for efficient message passing and parallel execution. In order to optimize communication, HyperGraph ROS dynamically selects the optimal communication mechanism while maintaining a consistent API. For intra-process messages, Intel-TBB Flow Graph is used with C++ pointer passing, which ensures zero memory copying and instant delivery. Meanwhile, inter-process and cross-device communication seamlessly switch to ZeroMQ. When a node receives a message from any source, it is immediately activated and scheduled for parallel execution by Intel-TBB. The computational hypergraph consists of nodes represented by TBB flow graph nodes and edges formed by TBB pointer-based connections for intra-process communication, as well as ZeroMQ links for inter-process and cross-device communication. This structure enables seamless distributed parallelism. Additionally, HyperGraph ROS provides ROS-like utilities such as a parameter server, a coordinate transformation tree, and visualization tools. Evaluation in diverse robotic scenarios demonstrates significantly higher transmission and throughput efficiency compared to ROS 2. Our work is available at https://github.com/wujiazheng2020a/hyper_graph_ros.

Abstract:
Auto-calibration of the rotor thrust coefficient and estimation of ground effect are both challenging aspects of multirotor dynamics control and planning. Conventional approaches address these issues separately and typically rely on experimental rigs for bench testing. In this paper, we propose a low-cost onboard sensor array and a well-designed unified algorithm to enable fast auto-calibration of the rotor thrust coefficient alongside ground effect estimation (TRACER). Our sensor array consists of four force-sensitive resistors compactly placed under the quadrotor landing gear to measure contact force during the lift-off phase, capturing changes in thrust. The joint calibration and estimation problem is formulated to rapidly decouple ground effect influence from free-air rotor thrust based on sensor readings. Furthermore, our approach is adaptable to various flight control input formats (duty cycle, rotor throttle, or rpm), ensuring general applicability across different multirotor operations. Experimental results demonstrate that the proposed method provides reliable joint calibration and estimation of rotor thrust and ground effect in a short lift-off process, achieving less than 10% mean absolute percentage error compared to the ground truth.

Abstract:
Medical tracking is a significant issue in vision-based robotic-assisted surgical navigation, especially for distal locking of intramedullary nails. Existing solutions face limitations such as high manufacturing costs for targets, complex tracking schemes, and low positioning1 precision. This paper proposes a novel method to estimate the pose of medical targets through pre-calibration and the Perspective-n-Point (PnP) algorithm, which determines the position of the distal intramedullary nail hole and projects this position onto the monitor. The precision of medical target positioning is highly affected by the distortion coefficient of camera internal parameters. To address this, we design and construct a distortion compensation model to reduce its impact on positioning precision. Additionally, to mitigate the effect of illumination variations, automatic exposure of polarized vision is utilized. Through 50 reprojection experiments, the proposed distortion model achieves a positioning precision of 0.284 mm at a working distance of one meter, significantly outperforming the 0.4 mm precision of the division model and the 0.426 mm precision of the polynomial model with a relative improvement of 30% and 34.2%. This method enhances the accuracy and reliability of robot-assisted surgical navigation, facilitating more precise and efficient surgical procedures.

Abstract:
Datasets for object detection often do not account for enough variety of glasses, due to their transparent and reflective properties. Specifically, open-vocabulary object detectors, widely used in embodied robotic agents, fail to distinguish subclasses of glasses. This scientific gap poses an issue for robotic applications that suffer from accumulating errors between detection, planning, and action execution. This paper introduces a novel method for acquiring real-world data from RGB-D sensors that minimizes human effort. We propose an auto-labeling pipeline that generates labels for all the acquired frames based on the depth measurements. We provide a novel real-world glass object dataset 3 that was collected on the Neuro-Inspired COLlaborator (NICOL), a humanoid robot platform. The dataset consists of 7850 images recorded from five different cameras. We show that our trained baseline model outperforms state-of-the-art open-vocabulary approaches. In addition, we deploy our baseline model in an embodied agent approach to the NICOL platform, on which it achieves a success rate of 81% in a human-robot bartending scenario.

Abstract:
This study addresses the challenge of robust object detection in maritime environments, where dynamic conditions such as fog, brightness variations, and motion blur can degrade accuracy. We propose a novel framework, Joint Semantic Learning (JSL), which combines ocean scene segmentation and object detection to improve both performance and robustness. JSL incorporates the ocean scene segmentation module into the detection network during training and removes it during inference, ensuring no additional computational overhead. Through ocean scene segmentation, the feature extractor learns to understand the overall context of the image and extract detailed information about objects. Extensive experiments show that JSL, applied to various convolutional neural network-based detectors, achieves significant performance improvements on maritime datasets SMD and SeaShips. Notably, the proposed method shows substantial performance gains on the SMD-C and SeaShips-C datasets, which include adverse conditions, demonstrating the robustness of the proposed method. Furthermore, experiments comparing our method with existing state-of-the-art multi-task methods on the Cityscapes dataset validate its effectiveness in generalizing to urban environments. The efficient integration of spatial and semantic information of JSL ensures accurate and reliable object detection across diverse applications. Our code is available at: https://github.com/gistailab/JSL.

Abstract:
Motion forecasting represents a critical challenge in autonomous driving systems, requiring accurate prediction of surrounding agents’ future trajectories. While existing approaches predict future motion states with the extracted scene context feature from historical agent trajectories and road layouts, they suffer from the information degradation during the scene feature encoding. To address the limitation, we propose HAMF, a novel motion forecasting framework that learns future motion representations with the scene context encoding jointly, to coherently combine the scene understanding and future motion state prediction. We first embed the observed agent states and map information into 1D token sequences, together with the target multi-modal future motion features as a set of learnable tokens. Then we design a unified Attention-based encoder, which synergistically combines self-attention and cross-attention mechanisms to model the scene context information and aggregate future motion features jointly. Complementing the encoder, we implement the Mamba module in the decoding stage to further preserve the consistency and correlations among the learned future motion representations, to generate the accurate and diverse final trajectories. Extensive experiments on Argoverse 2 benchmark demonstrate that our hybrid Attention-Mamba model achieves state-of-the-art motion forecasting performance with the simple and lightweight architecture.

Abstract:
Multirotor Uncrewed Aerial Vehicle (UAV)s have recently become an important instrument for the magnetic method for mineral exploration (MMME), enabling more effective and accurate geological investigations. This paper explores the difficulties in mounting high-sensitivity sensors on a UAV platform, including electromagnetic interference, payload dynamics, and maintaining stable sensor performance while in flight. It is highlighted how the specific solutions provided to deal with these problems have the potential to alter the collection of data using the MMME, assisted by UAVs. The work also shows experimental findings that demonstrate the creative potential of these solutions in UAV-based data collection for the MMME, leading to improvements in effective mineral exploration through careful design, testing, and assessment of these systems. These innovations resulted in a platform that is quickly deployable in remote areas and able to operate more efficiently compared to traditional crewed aircraft or multirotor UAVs while still producing equal or higher quality results. This allows for much higher efficiency and lower operating costs for high production UAV-based data collection for the MMME.

Abstract:
The presence of Non-Line-of-Sight (NLoS) blind spots resulting from roadside parking in urban environments poses a significant challenge to road safety, particularly due to the sudden emergence of pedestrians. mmWave technology leverages diffraction and reflection to observe NLoS regions, and recent studies have demonstrated its potential for detecting obscured objects. However, existing approaches predominantly rely on predefined spatial information or assume simple wall reflections, thereby limiting their generalizability and practical applicability. A particular challenge arises in scenarios where pedestrians suddenly appear from between parked vehicles, as these parked vehicles act as temporary spatial obstructions. Furthermore, since parked vehicles are dynamic and may relocate over time, spatial information obtained from satellite maps or other predefined sources may not accurately reflect real-time road conditions, leading to erroneous sensor interpretations. To address this limitation, we propose an NLoS pedestrian localization framework that integrates monocular camera image with 2D radar point cloud (PCD) data. The proposed method initially detects parked vehicles through image segmentation, estimates depth to infer approximate spatial characteristics, and subsequently refines this information using 2D radar PCD to achieve precise spatial inference. Experimental evaluations conducted in real-world urban road environments demonstrate that the proposed approach enhances early pedestrian detection and contributes to improved road safety. Supplementary materials are available at https://hiyeun.github.io/NLoS/.

Abstract:
In this paper, we present a novel curve-to-surface registration method, termed Bi-directional Hybrid Mixture Model Registration based on Dual-constrained Tangent and Normal Vectors (BiHMM-DTN), where two different tangent vectors at the intraoperative point are simultaneously used with the normal vector at the corresponding preoperative point to construct the geometric constraints. While hybrid registration models incorporating tangent and normal vectors (HMM-TN) demonstrate success, their geometric constraints prove inadequate or inappropriate for sparse intraoperative point sets, frequently yielding suboptimal optimization outcomes. By critically revisiting the geometric constraints of HMM-TN, we propose a dual-constraints-based hybrid mixture model registration framework with enhanced intraoperative point set acquisition protocols. To deal with noise and outliers in preoperative and intraoperative point sets—caused by reconstruction inaccuracies and tracking errors, respectively—our approach employs a bi-directional registration mechanism for curve-to-surface registration. We provide rigorous proofs validating the geometric completeness of the dual constraints within this mechanism. The BiHMM-DTN framework is formulated as a maximum likelihood estimation (MLE) problem and optimized using an expectation-maximization (EM) algorithm. Furthermore, to enhance convergence stability and accelerate optimization, the rotation matrix is updated iteratively through successive incremental steps. Extensive experiments on human femur and hip models demonstrate that our method outperforms state-of-the-art approaches, including both traditional optimization and deep learning methods, under various noise and outlier conditions. Furthermore, real-world phantom experiments highlight the potential clinical value of our method for surgical navigation applications. The codes and data are available at https://github.com/sam-zyzhang/BiHMM-DTN.git.

Abstract:
Regular temperature measurement of critical parts of an annealing furnace has always been a difficult task. Due to the harsh environment of high temperature, high noise, and darkness in the annealing furnace operation area, unmanned vehicles equipped with the RGB-T semantic segmentation model are usually adopted in most factories for inspection. However, existing RGB-T semantic segmentation models usually rely on good lighting or thermal conditions, which are generally difficult to fulfill in annealing furnace operation areas. In this paper, we propose a new hierarchical fusion-based semantic enhancement network, HFSENet. We first adopt the two-stream structure and the siamese structure to extract the low-level and high-level features of unimodal modalities, respectively. Then, considering the differences between the features in different hierarchical levels, we introduce a novel low-level feature spatial fusion module and a high-level feature channel fusion module to perform the multi-modal feature hierarchical fusion. On this basis, we also propose the semantic feature complementary enhancement module, which utilizes the appearance information set and object information set extracted from RGB and thermal infrared (TIR) branches to enhance the fused features and give them more semantic information. Finally, segmentation results with refined edges are obtained by an edge refinement decoder that includes a local search extraction module. The unmanned inspection vehicle we built with the proposed HFSENet has successfully passed the test, and the recognition performance of the four targets exceeds the current state-of-the-art (SOTA) method on our homemade annealing furnace operation area dataset.

Abstract:
The multi-agent pickup and delivery problem is central to coordinating multiple agents in real-world applications such as warehouse automation, urban logistics, and robotic delivery networks, where efficient task assignment and pathfinding are vital for maximizing production efficiency. However, existing approaches often struggle to seamlessly integrate task allocation with path planning while also failing to address the demands of continuous pickup and delivery tasks, resulting in suboptimal performance and limited scalability in dynamic environments. To address these problems, we first introduce a novel task allocation approach, which constructs a cost matrix to satisfy pickup and delivery timing constraints for tasks and employs a Mixed-Integer Linear Programming (MILP) model to compute a task assignment matrix queue. Next, the CBS-TAPF framework is proposed, which constructs search forests for tasks and paths to address the joint optimization of task allocation and path planning. This framework is further extended to Continuous Multi-Agent Pickup and Delivery (CMAPD) tasks by dynamically updating the task allocation matrix queue, enhancing robustness and adaptability for real-world, sustained scenarios. Finally, through simulation and real-world experiments, we validated the effectiveness of the proposed methods. The experimental results demonstrate its superiority across diverse environments, ensuring robust performance in various operational scenarios.

Abstract:
Safety assurance using all perceptual information to predict the motion of dynamic agents is critical in urban environments and remains an open challenge. For Autonomous Vehicles (AV) operating around vulnerable road users, the risk assessment strategy often needs to address stochastic uncertainties in the multiple possible trajectories (or multimodal motion) of the surrounding traffic agents. However, this increases the complexity of the navigation problem using the existing planners. To address this issue, this paper presents a multi-level optimization strategy that combines sampling-based and direct optimization methods for decision-making and control with improved safety and trajectory smoothness. In the primary stage, a sampling-based optimization framework systematically identifies safe candidate trajectories by employing the Fusion of stochastic Predictive Inter-Distance Profile (F-sPIDP). F-sPIDP encapsulates the multimodal dynamics of traffic agents and explicitly computes the uncertainties in their estimated or tracked states. From the set of trajectories, a reference optimal trajectory and its F-sPIDP setpoints are selected, adhering to stringent safety constraints and motion smoothness. Subsequently, a secondary local control optimization refines the optimal trajectory to ensure compliance with the AV's kinematic and dynamic constraints while accounting for the quantified uncertainty within the F-sPIDP framework. The performance of the proposed method was assessed through simulations and statistical analyses, evaluating its robustness to diverse levels of uncertainty.

Abstract:
4D mmWave radar provides the point cloud with range, azimuth, elevation, Doppler velocity and operates normally in severe weather conditions. However, due to wavelength characteristics, the noisy and sparse point cloud that 4D radar collects poses great challenges for SLAM research. In this paper, we propose VoxEKF-RIO, a 4D radar inertial odometry. VoxEKF-RIO filters out noisy points and estimates ego-velocity through a preprocessing module and maintains an incremental voxel map to represent the probabilistic models of environments. To improve the accuracy, a reliable scan-to-submap matching method is designed based on the voxel map, using a point filter to obtain valid points with reliable matches, and adopting a distribution-to-distribution matching distance. Iterative Kalman filter is used to fuse radar velocity, point cloud registration, and IMU data for estimating the platform’s motion. The experiments on publicly available 4D radar datasets demonstrate the reliability and high accuracy of VoxEKF-RIO. The ablation studies reveal the benefit of voxel map in describing the environment characteristics and the reliable matching method.

Abstract:
This paper proposes a novel passive dynamic walker with a body shape similar to an eight-legged rimless wheel that performs a natural swinging motion of the swing leg through storage and release of elastic energy. The generated motion is period-1 and asymptotically stable, but the inherent limit cycle stability is nontrivial because it does not achieve constraint on impact posture. Since it has almost linear dynamics, however, its walkability can be instantaneously determined using a linearized model without numerical integration. With the equations of linearized motion and exact collision, the step period and the state at the next collision can be obtained numerically and instantaneously using a bisection method based on the geometric constraint condition at impact. Then, by updating the state for each collision and repeating the same calculation, it is possible to instantaneously determine whether or not the walking motion continues stably for a long period of time. By comparing the results of this calculation with those of the numerical integration of the nonlinear and linearized models, the effectiveness of the proposed method is confirmed. Furthermore, using the proposed method, we analyze the period-doubling bifurcation phenomenon and the change in the singular values of the Poincaré map that occurs with the change in the elastic modulus.

Affiliations: Department of Mechanical and Industrial Engineering, University of Toronto, Canada; Department of Computer Science and the Medical Computer Vision and Robotics Lab, University of Toronto, Canada; Wilfred and Joyce Posluns Centre for Image Guided Innovation and Therapeutic Intervention, Hospital for Sick Children, Toronto, Canada; Mount Sinai Hospital, Toronto, Canada; Department of Computer Science, Johns Hopkins University, Baltimore, USA

Abstract:
Open Spina Bifida (OSB) is a congenital neural tube defect that affects approximately 1 in 1000 births worldwide. Robotic in-utero OSB repair provides a minimally invasive alternative to open-surgery, which places significant strain on both baby and mother. Recent advancements in da Vinci miniature continuum tools reduce port sizes through the uterus for access to the fetus with lower maternal risk. However, idiosyncrasies in continuum tool behaviour further complicate an already difficult procedure. Consequently, a high-fidelity da Vinci OSB repair simulator is presented featuring continuum tools for surgeon skills training. The simulator incorporates a plugin for suture physics handling, soft body physics for deformable tissues and implements haptic virtual fixtures for improved situational awareness during suturing. Quantitative validation demonstrated virtual tool accuracy, with a mean-squared continuum backbone error of 0.64 mm2 and system-level end-effector trajectory errors averaging 3.25 mm for a helix tracing task. During suturing, high-fidelity performance was maintained. Four expert surgeons from relevant specialties provided positive qualitative feedback, reporting that the simulator accurately replicates real tool control and offers a realistic and valuable training experience. Ultimately, the simulator shows promise as a training platform for safer robotic in-utero OSB repair and facilitating the adoption of novel continuum wristed tools in clinical settings.

Abstract:
Human crowd simulation in virtual reality (VR) is a powerful tool with potential applications including emergency evacuation training and assessment of building layout. While haptic feedback in VR enhances immersive experience, its effect on virtual walking behavior in dense and dynamic pedestrian flows is unknown. Through a user study, we investigated how haptic feedback changes user walking motion in crowded pedestrian flows in VR. The results indicate that haptic feedback changed users’ collision avoidance movements, as measured by increased walking trajectory length and change in pelvis angle. The displacements of users’ lateral position and pelvis angle were also increased in the instantaneous response to a collision with a non-player character (NPC), even when the NPC was inside the field of view. Haptic feedback also enhanced users’ awareness and visual exploration when an NPC approached from the side or back. Furthermore, variation in walking speed was increased by the haptic feedback. These results suggest that the haptic feedback enhances users’ sensitivity to collisions in VR environments.

Abstract:
In this paper, we consider the case of carrying an object with a quadruped-wheeled robot and examine and verify the angle planning method for the robot’s base link (load section) that suppresses the relative tilt of the carried object. It is important to suppress the tilt of the carried object relative to the base link. We formulated the angle of the base link at which the carried object does not begin to tilt relative to the base link. We examined an angle planning method for the base link that would suppress the relative tilting of the carried object. Then, we verified the method using a simulation and a quadruped-wheeled robot: MELEW-3 (Meiji Leg-Wheeled Robot - No.3). As a result, both in the simulation and on the actual robot, we succeeded in carrying the object without turning over, and the base link was tilted based on the planning method.

Abstract:
Unmanned aerial vehicles (UAVs) are critical in the automated inspection of wind turbine blades. Nevertheless, several issues persist in this domain. Firstly, existing inspection platforms encounter challenges in meeting the demands of automated inspection tasks and scenarios. Moreover, current blade stop angle estimation methods are vulnerable to environmental factors, restricting their robustness. Additionally, there is an absence of real-time blade detail prioritized exposure adjustment during capture, where lost details cannot be restored through post-optimization. To address these challenges, we introduce a platform and two approaches. Initially, a UAV inspection platform is presented to meet the automated inspection requirements. Subsequently, a Fermat point based blade stop angle estimation approach is introduced, achieving higher precision and success rates. Finally, we propose a blade detail prioritized exposure adjustment approach to ensure appropriate brightness and preserve details during image capture. Extensive tests, comprising over 120 flights across 10 wind turbine models in 5 operational wind farms, validate the effectiveness of the proposed approaches in enhancing inspection autonomy.

Abstract:
Flapping-wing aerial robots offer significant advantages over conventional multirotors, including lower noise signatures, higher energy efficiency, and enhanced maneuverability. Despite these benefits, their application in surveillance, particularly person detection and tracking, remains largely underexplored. This paper proposes a bioinspired framework for person detection and tracking, specifically designed for flapping-wing aerial robots. Drawing inspiration from the dual pathways in biological vision, our method integrates an event-by-event blob tracker with a more accurate but slower frame-based detector. The event-based tracker leverages the high temporal resolution and robustness to motion blur of event cameras, effectively compensating for the strong vibrations caused by the flapping strokes of these robots. The frame-based detection (implemented using a deep neural network) periodically corrects and enhances the event-based tracking estimates, globally achieving a balanced trade-off between accuracy, responsiveness, and computational cost. Evaluation with both multirotor and flapping-wing aerial robots validates the effectiveness and efficiency of the approach.

Abstract:
Impedance-based control represents a prevalent strategy in the powered transfemoral prostheses because of its ability to reproduce natural walking. However, most existing studies have developed impedance-based prosthesis controllers for specific tasks, while creating a task-adaptive controller for variable-task walking continues to be a significant challenge. This article proposes a task-adaptive quasi-stiffness control framework for powered prostheses that generalizes across various walking tasks, enhancing the gait symmetry between the prosthesis and intact leg. A Gaussian Process Regression (GPR) model is introduced to predict the target features of the human joint’s angle and torque in a new task. Subsequently, a Kernelized Movement Primitives (KMP) is employed to reconstruct the torque-angle relationship of the new task from multiple human reference trajectories and estimated target features. Based on the torque-angle relationship of the new task, a quasi-stiffness control approach is designed for a powered prosthesis. Finally, the proposed framework is validated through practical examples, including varying speeds and inclines walking tasks. Notably, the proposed framework not only aligns with but frequently surpasses the performance of a benchmark finite state machine impedance controller (FSMIC) without necessitating manual impedance tuning and has the potential to expand to variable walking tasks in daily life for the transfemoral amputees.

Abstract:
In recent years, significant progress has been made in the prototype design and control methodologies of modular snake robots. However, there is still relatively little research on the potential enabled by the active morphological transformation of robots. This paper presents a novel modular snake robot capable of morphing into a bipedal configuration. The robot, ZBOT, is composed of some independent and homogeneous unit modules (named ZBot) connected in series. Each ZBot module has a dual-motor-driven 1-DoF rotational joint, which can rotate continuously, provide a large output torque and achieve backlash elimination. There are four connection orientations between adjacent modules. This paper proposes an articulation configuration, which enables the snake robot to achieve the active transformation from a snake form to a bipedal form. Meanwhile, through reinforcement learning (RL), movements including the stand-up gait are trained and verified in the IsaacSim/Lab simulation environment. This research will advance snake robots beyond surface-dependent locomotion, endowing them with more possibilities, unlocking greater potential for versatile applications.

Abstract:
In this work, we propose a novel motion planning algorithm to facilitate safety-critical navigation for autonomous mobile robots. The proposed algorithm integrates a real-time dynamic obstacle tracking and mapping system that categorizes point clouds into dynamic and static components. For dynamic point clouds, the Kalman filter is employed to estimate and predict their motion states. Based on these predictions, we extrapolate the future states of dynamic point clouds, which are subsequently merged with static point clouds to construct the forward-time-domain (FTD) map. By combining control barrier functions (CBFs) with nonlinear model predictive control, the proposed algorithm enables the robot to effectively avoid both static and dynamic obstacles. The CBF constraints are formulated based on risk points identified through collision detection between the predicted future states and the FTD map. Experimental results from both simulated and real-world scenarios demonstrate the efficacy of the proposed algorithm in complex environments. In simulation experiments, the proposed algorithm is compared with two baseline approaches, showing superior performance in terms of safety and robustness in obstacle avoidance. The source code is released for the reference of the robotics community.

Abstract:
Improving ride comfort can help accelerate the adoption of autonomous vehicles (AVs). Unfortunately, very few studies directly consider comfort in the controller design and among the existing ones, most of them solely focus on instantaneous acceleration and jerk. Such an approach cannot fully capture ride comfort. In fact, the International Organization for Standardization (ISO) emphasizes that ride comfort should be evaluated based on acceleration patterns over time. To bridge this research gap, this study proposes a comfort-centric Model Predictive Control (MPC) framework that optimizes both tangential and lateral acceleration patterns for optimal 2D maneuvers, including longitudinal acceleration and steering rate. The framework is subsequently tested against turning trajectories from the Waymo Open Dataset. Results demonstrate that our approach improves ride comfort compared to the original Waymo trajectories. Here, more comfort improvement can be achieved at higher lateral acceleration, implying that the proposed MPC framework can lead to more gentle turning behaviors. These findings highlight the effectiveness of the proposed MPC framework in enhancing ride comfort.

Abstract:
Terrain classification is crucial for assessing terrain traversability and supporting locomotion control of legged robots. By integrating multi-source sensor information, including exteroceptive sensors and proprioceptive sensors, legged robots can acquire terrain geometric features and surface cover types. However, single-sensor approaches exhibit inherent limitations, where exteroceptive sensors are susceptible to environmental interference while proprioceptive sensors struggle to identify surface cover types. To address these challenges, this paper proposes a robust terrain classification framework that overcomes the limitations of single-modal perception through fusion of exteroceptive and proprioceptive sensors. The framework comprises a Golden Sine Optimization Algorithm-based random forest model using proprioceptive sensors to determine optimal hyperparameter combinations based on classification requirements, and a YOLOv11 network integrated with intersection over union object tracking algorithm to achieve stable image extraction during robot movement. Final terrain classification is accomplished through Kalman filter-based decision fusion. Experimental validation demonstrated classification accuracies of 94.4% for the proprioceptive module and 94.2% for the visual module in offline testing. In online fusion testing, the system achieved 95.9% overall classification accuracy, confirming the effectiveness and engineering practicality of the proposed method.

Abstract:
This paper studies the experimental comparison of two different whole-body control formulations for humanoid robots: inverse dynamics whole-body control (ID-WBC) and passivity-based whole-body control (PB-WBC). The two controllers fundamentally differ from each other as the first is formulated in task acceleration space and the latter is in task force space with passivity considerations. Even though both control methods predict stability under ideal conditions in closed-loop dynamics, their robustness against joint friction, sensor noise, unmodeled external disturbances, and non-perfect contact conditions is not evident. Therefore, we analyze and experimentally compare the two controllers on a humanoid robot platform through swing foot position and orientation control, squatting with and without unmodeled additional weights, and jumping. We also relate the observed performance and characteristic differences with the controller formulations and highlight each controller’s advantages and disadvantages.

Abstract:
Wearable lower limb robots are promising technologies to assist human locomotion. Soft robotic exosuits introduce a promising solution for reducing muscle effort and metabolic cost as they are lightweight, transparent and inherently safe. However, it is challenging to effectively control such soft robots and personalize the assistance for individual users. With the difficulty in developing robust dynamic model of the human-soft robot system, especially the interacting dynamics between the human and the robot, traditional control methods have seen limited success in addressing these challenges. Reinforcement learning (RL), a data-driven optimal control method, provides a naturally promising alternative. In this study, we propose an innovative control design approach to enable human normative walking with reduced physical effort. To achieve this goal, we propose to first offline learn an exosuit controller for typical human normative walking which is then used in the online phase of control tuning for individual users. Four participants are recruited to test the exosuit controller in treadmill walking. Our results show that online tuning for individual users reaches convergence quickly, typically in one experimental trial due to using an efficient offline pre-trained policy. Furthermore, the RL control of the exosuit results in an average muscle effort reduction of 8.8% and 2.8% for the vastus lateralis and biceps femoris as measured by electromyography (EMG) sensors. These results provide the first evidence of customizing the soft exosuit assistance for individual users.

Abstract:
Training and deploying reinforcement learning (RL) policies for robots is a complex task, requiring careful design of reward functions, sim-to-real transfer, and performance evaluation across various robot configurations. These tasks traditionally demand significant human expertise and effort. To address these challenges, this paper introduces Anybipe, a novel, fully automated, end-to-end framework for training and deploying bipedal robots, leveraging large language models (LLMs) for reward function generation, while supervising model training, evaluation, and deployment. The framework integrates comprehensive quantitative metrics to assess policy performance, deployment effectiveness, and safety. Additionally, it allows users to incorporate prior knowledge and preferences, improving the accuracy and alignment of generated policies with expectations. We demonstrate how Anybipe reduces human labor while maintaining high levels of accuracy and safety, examined on three different bipedal robots, showcasing its potential for autonomous RL training and deployment.

Abstract:
Pouring fluids is a routine task for humans but challenging for high-DoF robots, particularly given fluid simulation’s computational demands while training policies. In this paper, we propose DexPour, a novel reinforcement learning method with hierarchical rewards and Approximated Proxy Abstraction (APA) method. APA efficiently approximates liquid behavior using a small set of spheres, reducing computational overhead. Meanwhile, our hierarchical reward framework breaks down the intricate pouring process into four distinct stages—approach, grasp, transport, and pour—providing fine-grained feedback and fostering stable policy learning. Extensive experiments demonstrate that DexPour achieves a 92% fluid transfer efficiency with a 70% cup fill and a 99% efficiency at 30% fill, highlighting its robust performance across varying liquid volumes. Ablation studies highlight the contribution of each component, confirming the necessity of detailed stage-wise guidance for complex dexterous manipulation. In addition, we compare DexPour with a full fluid simulation baseline, showing comparable pouring efficiency while reducing training time by 81.6%, demonstrating DexPour’s efficiency and practical viability for fluid manipulation tasks.

Abstract:
This paper presents an approach for the localization of an Unmanned Underwater Vehicle (UUV) in a cooperative team with a tethered Unmanned Surface Vehicle (USV). For the localization, the UUV and the USV carry a camera and a sonar respectively to observe each other. The vehicle states are split between Extended Kalman Filter and grid-based estimators based on which sensors provide Gaussian or non-Gaussian observations of each state. Specifically, the horizontal position of the UUV is estimated using a grid-based method because the camera and sonar that observe these states provide non-Gaussian observations when they cannot detect their target. Additionally, the tether to the USV is treated as a non-Gaussian observation that prevents unbounded error growth. Validation of the technique was performed in simulations using sensor models developed based on testing in a lake and pool.

Abstract:
Transperineal prostate puncture is challenging for the physician to manually place a needle and presents a steep learning curve. This paper proposes a novel ultrasound -guided series-parallel hybrid robot with the aim to enhance transperineal procedures. For maximum prostate coverage with flexibility and accuracy in needle placement, a 5 degrees of freedom series-parallel hybrid mechanism with two serial manipulators, a linear feeding unit, and an US probe positioning mechanism is be designed. In addition to mechanical design and kinematics modeling, an QPSO algorithm is proposed to optimize the mechanical parameters. Upon comprehensive comparison with alternative algorithms, the optimization outcomes fully align with clinical requirements. The prototype was fabricated and verified through needle insertion experiments in different scenarios to assess its feasibility. The absolute positioning error of the robot is 1.47 mm in water and 1.75 mm in gel phantom.

Abstract:
Knot tying is a fundamental dexterous surgical subtask that is a key step in suturing. One challenge to robot augmentation is limited depth perception due to the small baseline of surgical endoscopic cameras. In this work, we present Surgical D-Knot: an augmented dexterity pipeline combining learned perception with model-based methods to perform surgical double knots using only one monocular RGB camera. This pipeline includes 2D grasp point identification, 3D suture thread grasping using local feature servoing, suture thread wrapping using relative motion and 2D re-grasp point identification. Human dexterity is required for initial thread setup and thread cutting after each double knot. Physical experiments with 120 double knot trials result in a success rate of 80.83% for the initial knot and 55.83% for the second knot. Translation of surgical knot tying to chicken skin results in success rates of 73.75% for the initial knot and 40% for the second knot. Each double knot requires on average 70 seconds. This is the first work to our knowledge that augments human dexterity for double knot tying. https://sites.google.com/view/surgicaldknot/

Abstract:
This paper presents an active control strategy for lower limb exoskeletons by proposing an end-to-end framework employing deep reinforcement learning (DRL) to enable smooth transitions between different movement patterns. The majority of existing methods in exoskeleton literature employ finite state machines (FSM) that have proven successful in predicting the control strategy for the next state on the basis of sensor data such as IMU data, force, etc. However, one drawback of FSM occurs due to their inflexibility regarding sudden changes. Specifically, FSM is based on clear state transitions, which makes it hard to manage smooth continuous movements and increases the chance of sudden changes in control during transitions. These, in turn, raise safety concerns for the user. While learning-based control approaches have been suggested in recent years, the validation was performed in simulation environments. Therefore, the real-world applicability remains an open research question to date. To address this issue, we provide the first contribution in this field that proposes an end-to-end learning framework with a Deep Deterministic Policy Gradient (DDPG) module to enable smooth transitions between movement patterns under real-world conditions. By introducing several evaluation metrics, we demonstrate that our framework outperforms existing methods in terms of the adaptability and smoothness in movement transitions.

Abstract:
To minimize discomfort and injury risk in exoskeleton users, this paper addresses the misalignment between the device and the human knee joint. The knee's spatial motion complexity, characterized by multi-planar rotation axes as flexion angle changes, cannot be accurately replicated by existing single-axis or planar multi-center designs. A novel spherical cross four-bar linkage-based knee joint structure is proposed, leveraging its kinematic properties to mimic the knee's actual spatial motion. This design undergoes optimization calculations to determine the bionic knee's specific structure. A quantitative evaluation method using pneumatic sensor pads measures internal pressure distribution, comparing human-machine misalignment across different joint mechanisms. Experimental results demonstrate that the bionic knee significantly reduces unintended interaction forces, with maximum pressure values only one-third those of single-axis knee joints. This innovation addresses critical limitations in existing exoskeleton knee designs, enhancing comfort and safety during movement.

Abstract:
Modular robotics holds immense potential for space exploration, where reliability, repairability, and reusability are critical for cost-effective missions. Coordination between heterogeneous units is paramount for precision tasks - whether in manipulation, legged locomotion, or multi-robot interaction. Such modular systems introduce challenges far exceeding those in monolithic robot architectures. This study presents a robust method for synchronizing the trajectories of multiple heterogeneous actuators, adapting dynamically to system variations with minimal system knowledge. This design makes it inherently robot-agnostic, thus highly suited for modularity. To ensure smooth trajectory adherence, the multidimensional state is constrained within a hypersphere representing the allowable deviation. The distance metric can be adapted hence, depending on the task and system under control, deformation of the constraint region is possible. This approach is compatible with a wide range of robotic platforms and serves as a core interface for Motion Stack, our new open-source universal framework for limb coordination (available at https://github.com/2lian/Motion-Stack). The method is validated by synchronizing the end-effectors of six highly heterogeneous robotic limbs, evaluating both trajectory adherence and recovery from significant external disturbances.

Abstract:
With the rapid deployment of unmanned bolt-tightening robots on transmission towers, traditional control algorithms face challenges in balancing long-term task logic and real-time adaptability, especially under unstructured environments such as missing bolts and unexpected obstacles. This paper proposes HTP-TV, a framework for hierarchical task planning with temporal logic and visual servoing, which integrates temporal logic-based planning with a vision-based reactive mechanism. HTP-TV decouples semantic goals such as bolt-tightening sequences from geometric path planning, enabling offline pre-planning via LTL-RRT to generate constraint-compliant trajectories. In the online phase, real-time camera data dynamically updates the environmental model, which triggers adjustments in the incremental Büchi automaton to address missing bolts or obstacles. A semantic ID system encodes the bolt topology, supporting re-planning with axial constraints, while visual servoing techniques correct execution deviations. Through comparisons with the two methods (Offline LTL-RRT and Online RRT-based), the simulation results in CoppeliaSim demonstrate the efficiency, high safety compliance, and superiority of HTP-TV.

Abstract:
Event-based visual odometry has recently gained attention for its high accuracy and real-time performance in fast-motion systems. Unlike traditional synchronous estimators that rely on constant-frequency (zero-order) triggers, event-based visual odometry can actively accumulate information to generate temporally high-order estimation triggers. However, existing methods primarily focus on adaptive event representation after estimation triggers, neglecting the decision-making process for efficient temporal triggering itself. This oversight leads to the computational redundancy and noise accumulation. In this paper, we introduce a temporally high-order event-based visual odometry with spiking event accumulation networks (THE-SEAN). To the best of our knowledge, it is the first event-based visual odometry capable of dynamically adjusting its estimation trigger decision in response to motion and environmental changes. Inspired by biological systems that regulate hormone secretion to modulate heart rate, a self-supervised spiking neural network is designed to generate estimation triggers. This spiking network extracts temporal features to produce triggers, with rewards based on block matching points and Fisher information matrix (FIM) trace acquired from the estimator itself. Finally, THE-SEAN is evaluated across several open datasets, thereby demonstrating average improvements of 13% in estimation accuracy, 9% in smoothness, and 38% in triggering efficiency compared to the state-of-the-art methods.

Abstract:
Many optimization problems in robotics involve the optimization of time-expensive black-box functions, such as those involving complex simulations or evaluation of real-world experiments. Furthermore, these functions are often stochastic as repeated experiments are subject to unmeasurable disturbances. Bayesian optimization can be used to optimize such methods in an efficient manner by deploying a probabilistic function estimator to estimate with a given confidence so that regions of the search space can be pruned away. Consequently, the success of the Bayesian optimization depends on the function estimator’s ability to provide informative confidence bounds. Existing function estimators require many function evaluations to infer the underlying confidence or depend on modeling of the disturbances. In this paper, it is shown that the confidence bounds provided by the Wilson Score Kernel Density Estimator (WS-KDE) are applicable as excellent bounds to any stochastic function with an output confined to the closed interval [0;l] regardless of the distribution of the output. This finding opens up the use of WS-KDE for stable global optimization on a wider range of cost functions. The properties of WS-KDE in the context of Bayesian optimization are demonstrated in simulation and applied to the problem of automated trap design for vibrational part feeders.

Abstract:
An overground track-walking scheme with a body-weight support system can provide task-oriented and repetitive training. Furthermore, it improves gait stability and endurance more effectively than the conventional treadmill walking method. However, it does not improve asymmetry and reduces gait speed. Accordingly, we developed a 4-DOF mobile manipulator in which the position of the handle is controlled to provide continuous somatosensory information (cutaneous & proprioception information), such as a fixed rail to the user’s hand during overground walking for gait enhancement (velocity, symmetry, and balance). The system consists of 3 omni wheel-based mobile robot for robust track following and a 1-DOF revolute joint-based manipulator for gait enhancement during track-based gait guidance. To demonstrate the feasibility of the system, we conducted a pilot experiment with one stroke patient on a 15 m track. The experimental results showed that the robot could guide the patient along the track and enhance symmetry and balance, especially in the curved section of the track. Furthermore, the preferred walking speed of the participant on the track improved. Therefore, the system demonstrated promising potential for providing quantitative, repetitive, and safe track-based overground gait rehabilitation training.

Affiliations: National Key Laboratory of General Artificial Intelligence, Key Laboratory of Machine Perception (MoE), School of Intelligence Science and Technology, Peking University, Beijing, China; School of Electrical and Electronic Engineering, Nanyang Technological University, Singapore, Singapore; School of Instrument Science and Engineering, Southeast University, Nanjing, China; China Automotive Innovation Corporation, Nanjing, China; China Sanjiang Space Group Co., Ltd, Chengdu, China

Abstract:
Modern robots must coexist with humans in dense urban environments. A key challenge is the ghost probe problem, where pedestrians or objects unexpectedly rush into traffic paths. This issue affects both autonomous vehicles and human drivers. Existing works propose vehicle-to-everything (V2X) strategies and non-line-of-sight (NLOS) imaging for ghost probe zone detection. However, most require high computational power or specialized hardware, limiting real-world feasibility. Additionally, many methods do not explicitly address this issue. To tackle this, we propose DPGP, a hybrid 2D-3D fusion framework for ghost probe zone prediction using only a monocular camera during training and inference. With unsupervised depth prediction, we observe ghost probe zones align with depth discontinuities, but different depth representations offer varying robustness. To exploit this, we fuse multiple feature embeddings to improve prediction. To validate our approach, we created a 12K-image dataset annotated with ghost probe zones, carefully sourced and cross-checked for accuracy. Experimental results show our framework outperforms existing methods while remaining cost-effective. To our knowledge, this is the first work extending ghost probe zone prediction beyond vehicles, addressing diverse non-vehicle objects. We will open-source our code and dataset for community benefit.

Abstract:
Therapies targeting neurodegenerative diseases via brain ventricles and spinal parenchyma face delivery challenges. Systemic administration is ineffective due to the blood-brain barrier, while direct surgical access, especially for multi-site delivery, is highly invasive. The spinal subarachnoid space offers potential for microcatheter-based delivery, but existing robotic catheter technologies are unsuitable due to spinal anatomy constraints. This paper presents a miniaturised and sensorised steerable eversion-growing robot tailored to navigation of the subarachnoid space of the spine. The property of eversion reduces interaction forces with the anatomy, rendering our approach safer than microcatheters that need to be pushed. Our system is capable of real-time tip force estimation with three degrees of freedom (DoF) using fibre Bragg gratings (FBG). Additionally, it incorporates a micro-endoscope and a steerable tip, all within a tiny 2mm outer diameter. The system’s navigation, sensing, and imaging capabilities were evaluated using a realistic up-scaled phantom of the subarachnoid space covering the cervical spine, demonstrating interaction forces within the safe range of 2-5N during phantom navigation. Comparison study of instrument-tissue interactions further approved its clinical relevance, presenting a 73.78% decrease of the mean absolute forces to traditional insertion without the sheath in global measurements.

Abstract:
This paper presents the SPARK finger, an innovative passive adaptive robotic finger capable of executing both parallel pinching and scooping grasps. The SPARK finger incorporates a multi-link mechanism with Kempe linkages to achieve a vertical linear fingertip trajectory. Furthermore, a parallelogram linkage ensures the fingertip maintains a fixed orientation relative to the base, facilitating precise and stable manipulation. By integrating these mechanisms with elastic elements, the design enables effective interaction with surfaces, such as tabletops, to handle challenging objects. The finger employs a passive switching mechanism that facilitates seamless transitions between pinching and scooping modes, adapting automatically to various object shapes and environmental constraints without additional actuators. To demonstrate its versatility, the SPARK Hand, equipped with two SPARK fingers, has been developed. This system exhibits enhanced grasping performance and stability for objects of diverse sizes and shapes, particularly thin and flat objects that are traditionally challenging for conventional grippers. Experimental results validate the effectiveness of the SPARK design, highlighting its potential for robotic manipulation in constrained and dynamic environments.

Abstract:
Unmanned Surface Vessels (USVs) face significant control challenges due to uncertain environmental disturbances like waves and currents. This paper proposes a trajectory tracking controller based on Active Disturbance Rejection Control (ADRC) implemented on the DUS V2500. A custom simulation incorporating realistic waves and current disturbances is developed to validate the controller’s performance, supported by further validation through field tests in the harbour of Scheveningen, the Netherlands, and at sea. Simulation results demonstrate that ADRC significantly reduces cross-track error across all tested conditions compared to a baseline PID controller but increases control effort and energy consumption. Field trials confirm these findings while revealing a further increase in energy consumption during sea trials compared to the baseline. Videos can be found at https://autonomousrobots.nl/paper_websites/adrc-demcon.

Abstract:
Traditional one-step preview planning algorithms for bipedal locomotion struggle to generate viable gaits when walking across terrains with restricted footholds, such as stepping stones. To overcome such limitations, this paper introduces a novel multi-step preview foot placement planning algorithm based on the step-to-step discrete evolution of the Divergent Component of Motion (DCM) of walking robots. Our proposed approach adaptively changes the step duration and the swing foot trajectory for optimal foot placement under constraints, thereby enhancing the long-term stability of the robot and significantly improving its ability to navigate environments with tight constraints on viable footholds. We demonstrate its effectiveness through various simulation scenarios with complex stepping-stone configurations and external perturbations. These tests underscore its improved performance for navigating foothold-restricted terrains, even with external disturbances.

Abstract:
Recently, the research on mobile manipulators has attracted increasing attention. Ensuring that mobile manipulators can meet obstacle avoidance constraints and efficiently accomplish assigned tasks in dynamic environments remains a significant challenge. To address this issue, this paper proposes an integrated framework for environment perception, real-time planning, and control optimization. Firstly, we develop a fusion map that combines euclidean signed distance field (ESDF) with clustered point clouds occupying cubes, enabling robots to perceive more precise environmental information in complex and changing conditions. Secondly, we introduce a novel rapid generation strategy for 6-DOF guide point sequences, which directs the mobile manipulator to follow the most efficient path to the target location while making real-time adjustments to avoid dynamic obstacles. Additionally, utilizing optimized nonlinear model predictive control (NMPC), we design a whole-body motion controller for the mobile manipulator to prevent the system from becoming trapped in local optima, thereby allowing the manipulator to adjust its state tracking guide points promptly in complex indoor environments. Finally, the proposed algorithm was implemented on a mobile manipulator with an Ackerman base and tested through both simulations and real-world experiments.

Abstract:
The growing prevalence of stroke necessitates advanced lower-limb exoskeleton control. This paper proposes HybridFusionAtt, a novel model for continuous joint angle estimation using surface electromyography (sEMG). Unlike conventional approaches, our framework uniquely integrates traditional time-domain features with CNN-extracted high-dimensional spatial features through an attention mechanism, where traditional features dynamically guide feature fusion as attention queries. The model was validated using data collected from eight participants performing four activities of daily living (walking, stair climbing, stair descending, and obstacle crossing). The proposed model achieves average R2 values for knee and hip joint angle prediction of 0.8682 (walking), 0.8482 (obstacle crossing), 0.9294 (stair climbing), and 0.8676 (stair descending). Experimental results show that the proposed model significantly outperforms traditional LSTM and CNN-LSTM models in terms of accuracy and robustness, particularly in handling non-periodic actions such as obstacle crossing. The model achieves high performance by effectively fusing features and adaptively focusing on key features, enabling it to maintain robustness even under noisy conditions and significant individual differences. This demonstrates the model’s broad application potential, especially in rehabilitation and prosthetic control systems.

Abstract:
Recent years have seen a growing interest in the development of shared control strategies for upper limb prostheses. In this work, we take a critical step towards developing transhumeral devices by proposing a biomimetic control strategy for cooperative hand placement. This is achieved through a novel adaptation of Dynamic Movement Primitives (DMPs), enabling the generation of smooth trajectories from a rest position to arbitrary points within a user’s reach space. Our method revolves around a key observation that DMP forcing-function weights can be modeled (p < 0.05) for ≥90% of values as a simple function of Cartesian position, achieving median R2 values of 0.63, 0.43, and 0.02 (horizontal, vertical, and depth). Validation on 519 trajectories via 5-fold cross-validation showed significant improvements (p < 0.01) over Extended DMP and kernel-based methods. Real-time human-in-the-loop experiments revealed a median minimum cumulative-distance deviation of 0.0733 m (8.5% error) motion with a prosthesis as compared with an intact limb. To our knowledge, this is the first study to explore shared control for transhumeral prostheses, and our observations on human motion modeling may inspire future Learning-from-Demonstration studies.

Abstract:
Contact-rich manipulation remains a major challenge in robotics. Optical tactile sensors like GelSight Mini offer a low-cost solution for contact sensing by capturing softbody deformations of the silicone gel. However, accurately inferring shear and normal force distributions from these gel deformations has yet to be fully addressed. In this work, we propose a machine learning approach using a U-net architecture to predict force distributions directly from the sensor’s raw images. Our model, trained on force distributions inferred from Finite Element Analysis (FEA), demonstrates promising accuracy in predicting normal and shear force distributions for the commercially available GelSight Mini sensor. It also shows potential for generalization across indenters, sensors of the same type, and for enabling real-time application. The codebase, dataset and models are open-sourced and available at https://feats-ai.github.io.

Abstract:
Autonomous driving is a high-performance, safety-critical task. Effectively controlling autonomous vehicles to enhance both performance and safety is crucial, especially in complex and dynamic environments. However, in real-time obstacle avoidance (OA) scenarios, the planning layer often fails due to high computational complexity and response delays. This trade-off between computational efficiency and safety performance presents a key challenge: how to achieve an optimal balance between autonomous driving safety and real-time performance. In recent years, addressing OA at the control layer has become a major research focus for improving the safety of autonomous vehicles. Given the advantages of Model Predictive Control (MPC) in prediction and constraint handling, this paper integrates an OA safety distance constraint into MPC to effectively handle OA in Unmanned Ground Vehicles (UGVs). First, a Taylor expansion is used to construct the UGV’s error model. Then, safe distance constraints for obstacle avoidance are formulated, considering both tracking errors and proximity to obstacles. Additionally, a Safe Obstacle Avoidance MPC (SOAMPC) is developed by integrating safety distance constraints and physical limitations. Furthermore, key control-theoretic properties are established, including recursive feasibility, guaranteed collision avoidance, and system stability. Simulations and experiments in a multi-obstacle environment validate SOAMPC’s effectiveness. Results show that SOAMPC not only ensures obstacle avoidance and stability but also outperforms other methods in efficiency and path tracking accuracy.

Abstract:
Localization of magnetically actuated capsule endoscope (MCE) is essential for accurate actuation. Despite extensive progress in pose estimation using internal magnetic field sensors and external magnetic sources, it remains challenging to achieve localization when a time-varying internal magnetic field (IMF) exists. This study presents a compound sensing method for the magnetic ultrasound capsule endoscope (MUSCE) based on an internal magnetic field sensor array and an external permanent magnet source, achieving simultaneous 6-degree-of-freedom (DOF) localization for magnetic navigation and real-time ultrasound (US) beam scanning angle detection for distortion-free US imaging reconstruction. Firstly, a MUSCE consisting of an internal magnet, a US transducer, and hall sensors is designed, enabling simultaneous spiral structure-based locomotion and high-quality endoluminal US imaging. Then, a compound sensing strategy is presented, realizing the separation of time-varying IMF and external magnetic field (EMF), allowing synchronous 6-DOF MUSCE localization and US beam scanning angle detection. Finally, the effectiveness of the presented method is validated by tests. The demonstrated static localization error is 4.08 ± 1.91 mm in position norm and 2.46 ± 1.31° in orientation, in a workspace shared with the robotic manipulator. Also, the scanning angle detection can rectify distortion in US image, showing potential clinical applications.

Abstract:
Robot pick and place systems have traditionally decoupled grasp, placement, and motion planning to build sequential optimization pipelines with an assumption that the individual components will be able to work together. However, this separation introduces sub-optimality, as grasp choices may limit, or even prohibit, feasible motions for a robot to reach the target placement pose, particularly in cluttered environments with narrow passages. To this end, we propose a forest-based planning framework to simultaneously find grasp configurations and feasible robot motions that explicitly satisfy downstream placement configurations paired with the selected grasps. Our proposed framework leverages a bidirectional sampling-based approach to build a start forest, rooted at the feasible grasp regions, and a goal forest, rooted at the feasible placement regions, to facilitate the search through randomly explored motions that connect valid pairs of grasp and placement trees. We demonstrate that the framework’s inherent parallelism enables superlinear speedup, making it scalable for applications for redundant robot arms, e.g., 7 DoF, to work efficiently in highly cluttered environments. Extensive experiments in simulation demonstrate the robustness and efficiency of the proposed framework in comparison with multiple baselines under diverse scenarios.

Abstract:
The quantitative evaluation of the improvement of physical function is crucial for patients with impaired motor function, such as stroke, in conducting related rehabilitation training activities. Specially, a practical and easy-to-operate gait feature detection and extraction system for a home is urgently needed. In this study, a home gait feature extraction method based on the inertial measurement unit is proposed. The subjects’ walking distance and speed are calculated using the double integral and the number of strides is calculated using the local maximum peak approach, while the stance phase and swing phase are calculated using the local trough approach. The compared result shows that the average walking distance accuracy is about 91.32 % and the average stride accuracy is about 96.55%. The proportion of the stance period (59.01%) and swing period (40.99%) estimated by the inertial measurement unit is close to the ratio of the two at normal speed. The experimental results demonstrate that the great accuracy of the gait spatio-temporal features is retrieved. The proposed method facilitates gait evaluation in clinics and at home, including the extraction of gait features and real-time evaluation.

Abstract:
In this paper, we propose a concept of Teleoperated Teaching of Task and Impedance (TTTI) with a novel multi-modal interface that enables online teleoperated teaching of combined low-level impedance-regulation skills and high-level task decision-making skills using a single hand-held haptic device. To this end, we interactively switch the functionality of the haptic device for two modes of operation. To teach impedance-regulation low-level skills, we developed a novel stiffness command interface where the human operator uses the haptic device to manipulate the stiffness ellipsoid of the remote robotic arm endpoint in 3D space. For teaching high-level skills of how and when to employ low-level actions, we developed a GUI that enables a haptic device to remotely modify Behaviour Trees used to encode the robot’s task decision-making process. The interface connects both teaching modes, where a newly demonstrated low-level skill appears in the Behaviour Tree at an operator-specified index. To demonstrate the main features of the proposed interface, we performed several proof-of-concept experiments on a teleoperation setup operating a remote shelf-stocker robot in a simulated supermarket environment. We examined the task of placing a product on a shelf that consists of several sub-tasks, where each involves different stiffness strategies, while the Behaviour Tree has to encode the task sequencing and decision-making process.

Abstract:
Accurate 6D object pose estimation is essential for robotic grasping and manipulation, particularly in agriculture, where fruits and vegetables exhibit high intra-class variability in shape, size, and texture. The vast majority of existing methods rely on instance-specific CAD models or require depth sensors to resolve geometric ambiguities, making them impractical for real-world agricultural applications. In this work, we introduce PLANTPose, a novel framework for category-level 6D pose estimation that operates purely on RGB input. PLANT-Pose predicts both the 6D pose and deformation parameters relative to a base mesh, allowing a single category-level CAD model to adapt to unseen instances. This enables accurate pose estimation across varying shapes without relying on instance-specific data. To enhance realism and improve generalization, we also leverage Stable Diffusion to refine synthetic training images with realistic texturing, mimicking variations due to ripeness and environmental factors and bridging the domain gap between synthetic data and the real world. Our evaluations on a challenging benchmark that includes bananas of various shapes, sizes, and ripeness status demonstrate the effectiveness of our framework in handling large intraclass variations while maintaining accurate 6D pose predictions, significantly outperforming the state-of-the-art RGB-based approach MegaPose. Our code, data, and models are publicly available at https://github.com/mariosgly/PLANTPose.

Abstract:
This paper presents a fast and accurate model of a deformable linear object (DLO) – e.g., a rope, wire, or cable – integrated into an established robot physics simulator, MuJoCo. Most accurate DLO models with low computational times exist in standalone numerical simulators, which are unable or require tedious work to handle external objects. Based on an existing state-of-the-art DLO model – Discrete Elastic Rods (DER) – our implementation provides an improvement in accuracy over MuJoCo’s own native cable model. To minimize computational load, our model utilizes force-lever analysis to adapt the Cartesian stiffness forces of the DER into its generalized coordinates. As a key contribution, we introduce a novel parameter identification pipeline designed for both simplicity and accuracy, which we utilize to determine the bending and twisting stiffness of three distinct DLOs. We then evaluate the performance of each model by simulating the DLOs and comparing them to their real-world counterparts and against theoretically proven validation tests.

Abstract:
Patients suffering from neurological and musculoskeletal disorders often experience impaired upper limb function, significantly reducing their quality of life. In recent years, wearable robots have emerged as a promising solution to facilitate rehabilitation and assist in daily activities. Among these, tendon-driven actuation has been widely adopted; however, such systems face challenges in achieving precise position control compared to direct motor-driven systems. This is primarily due to the hysteresis and backlash resulting from the high compliance and elasticity of tendons, necessitating effective compensation strategies. In this paper, we implement an embedded compact sensor system for end-effector-level position tracking in a soft wearable robot designed for forearm pronation/supination and grasping motions. By integrating sensors at the end-effector, we enable real-time motion data acquisition and establish a closed-loop feedback mechanism that effectively compensates for the limitations of tendon-driven actuation, thereby enhancing overall control accuracy. Based on the embedded end-effect-level sensing system, we introduce a novel wearable robot kit for motion tracking that comprising two parts: a sensor-only exosuit for real-time capture of user hand and forearm movements, and a motor-equipped exosuit that replicates and assists movements based on the sensor feedback. This Leader-Follower Control Mode allows for accurate capture and rapid response to user motion intent, offering a new solution for applications in tele-control, mirror therapy, and motion synchronization.

Abstract:
Humanoid robot soccer presents several challenges, particularly in maintaining system stability during aggressive kicking motions while achieving precise ball trajectory control. Current solutions, whether traditional position-based control methods or reinforcement learning (RL) approaches, exhibit significant limitations. Model predictive control (MPC) is a prevalent approach for ordinary quadruped and biped robots. While MPC has demonstrated advantages in legged robots, existing studies often oversimplify the leg swing progress, relying merely on simple trajectory interpolation methods. This severely constrains the foot’s environmental interaction capability, hindering tasks such as ball kicking. This study innovatively adapts the spatial-temporal trajectory planning method, which has been successful in drone applications, to bipedal robotic systems. The proposed approach autonomously generates foot trajectories that satisfy constraints on target kicking position, velocity, and acceleration while simultaneously optimizing swing phase duration. Experimental results demonstrate that the optimized trajectories closely mimic human kicking behavior, featuring a backswing motion. Simulation and hardware experiments confirm the algorithm’s efficiency, with trajectory planning times under 1 ms, and its reliability, achieving nearly 100 % task completion accuracy when the soccer goal is within the range of -90° to 90°.

Abstract:
Control Barrier Functions (CBFs) have emerged as an effective and non-invasive safety filter for ensuring the safety of autonomous systems in dynamic environments with formal guarantees. However, most existing works on CBF synthesis focus on fully known settings. Synthesizing CBFs online based on perception data in unknown environments poses particular challenges. Specifically, this requires the construction of CBFs from high-dimensional data efficiently in real time. This paper proposes a new approach for online synthesis of CBFs directly from local Occupancy Grid Maps (OGMs). Inspired by steady-state thermal fields, we show that the smoothness requirement of CBFs corresponds to the solution of the steady-state heat conduction equation with suitably chosen boundary conditions. By leveraging the sparsity of the coefficient matrix in Laplace’s equation, our approach allows for efficient computation of safety values for each grid cell in the map. Simulation and real-world experiments demonstrate the effectiveness of our approach. Specifically, the results show that our CBFs can be synthesized in an average of milliseconds on a 200×200 grid map, highlighting its real-time applicability.

Abstract:
The adaptivity and maneuvering capabilities of Autonomous Underwater Vehicles (AUVs) have drawn significant attention in oceanic research, due to the unpredictable disturbances and strong coupling among the AUV’s degrees of freedom. In this paper, we developed large language model (LLM)-enhanced reinforcement learning (RL)-based adaptive S-surface controller for AUVs. Specifically, LLMs are introduced for the joint optimization of controller parameters and reward functions in RL training. Using multi-modal and structured explicit task feedback, LLMs enable joint adjustments, balance multiple objectives, and enhance task-oriented performance and adaptability. In the proposed controller, the RL policy focuses on upper-level tasks, outputting task-oriented high-level commands that the S-surface controller then converts into control signals, ensuring cancellation of nonlinear effects and unpredictable external disturbances in extreme sea conditions. Under extreme sea conditions involving complex terrain, waves, and currents, the proposed controller demonstrates superior performance and adaptability in high-level tasks such as underwater target tracking and data collection, outperforming traditional PID and SMC controllers.3

Abstract:
We present DRACo-SLAM2, a distributed SLAM framework for underwater robot teams equipped with multibeam imaging sonar. This framework improves upon the original DRACo-SLAM by introducing a novel representation of sonar maps as object graphs and utilizing object graph matching to achieve time-efficient inter-robot loop closure detection without relying on prior geometric information. To better-accommodate the needs and characteristics of underwater scan matching, we propose incremental Group-wise Consistent Measurement Set Maximization (GCM), a modification of Pairwise Consistent Measurement Set Maximization (PCM), which effectively handles scenarios where nearby inter-robot loop closures share similar registration errors. The proposed approach is validated through extensive comparative analyses on simulated and real-world datasets.

Abstract:
The parallel continuum mechanism offers distinct advantages in the design of surgical manipulators, including enhanced stiffness, improved precision, and a simplified structure compared to traditional Tendon-driven systems. Conventional kinematic models based on constant-curvature assumptions are often inadequate for accurately capturing the complex bending behaviors of this mechanism. In contrast, the Cosserat rod theory provides a rigorous framework for precise kinematic modeling of flexible structures. However, its computational complexity results in slow solving speeds, particularly when dealing with spatial points that are widely separated. This paper focuses on a miniaturized parallel continuum manipulator and employs the Cosserat rod model for kinematic modeling, combined with a neural network-based inverse kinematics solver to achieve rapid real-time computation. To expedite inverse kinematics, a multilayer perceptron is trained on 5,000 samples generated from the Cosserat rod model, yielding the average absolute error of 0.046mm and the average relative error of 0.41% in predicting rod lengths. Experimental validation demonstrates that the neural network solver reduces computation time to about 0.16ms compared to 700–3100ms for conventional numerical methods, underscoring its potential for enhancing the precision and responsiveness of surgical systems in minimally invasive procedures.

Abstract:
Humanoid robotic hands need to be versatile and capable of providing environmental information in order to serve as a platform for intelligent grasp control. To facilitate the design process of such hands, we present the KIT Robotic Hands. They have been designed to meet diverse application requirements through their scalability in size, actuation, sensorization and computing resources. The hands integrate a multi-modal sensor system, in-hand embedded processing capabilities, an adaptive underactuated mechanism and a continuously controllable thumb rotation to enhance dexterity. The flexibility of the design is demonstrated through two application-specific hand implementations: one is the ARMAR-7 hand, which has human hand dimensions for grasping daily objects in household tasks, the other is the ARMAR-DE hand, a larger hand designed for grasping bigger objects in decontamination tasks. We describe the design and mechatronics of the hands as well as an evaluation of the grasp success and image segmentation based on an in-hand integrated camera and onboard processing of visual data.

Abstract:
The Visual-Inertial Odometry has been widely deployed on autonomous robots traveling in open outdoor scenarios. However, the visual measurements will be influenced heavily by the observation distances, perspectives, lighting and texture conditions, with distinct and time-varying noise distributions of measurements. Existing methods for handling time-varying noise in Visual-Inertial Odometry regard all measurement noise as identically distributed, unable to effectively deal with the distinct noise in open outdoor scenarios, which degrades the localization accuracy. In this paper, a Multi-State Constraint Kalman Filter with Adaptive multivariate noise parameters Clustering and estimation for visual-inertial odometry (CAMSCKF) is proposed to address the issue, which can separately track the measurement noise covariance matrix (MNCM) of different measurement clusters and adjust the MNCM in real-time. Firstly, the joint distribution of the state and the MNCM coefficients for each cluster is modeled as an Gaussian-Multivariate Generalized Inverse Gaussian distribution. Subsequently, an Expectation Maximization algorithm-based stepwise adaptive measurement clustering method is designed, which clusters measurements according to their corresponding innovations. Finally, an analytical update method for the joint posterior distribution without fixed-point iteration is implemented, achieving adaptive adjustment of the MNCM, thereby enabling accurate and robust Visual-Inertial Odometry localization. The superiority of the proposed method is demonstrated by simulations and dataset experiments, especially under the aggressive motion. In the experiments of the challenging outdoor dataset UZH-FPV, the proposed method has improved the average position and attitude estimation accuracy by 35.69% and 32.88%, respectively, compared with the state-of-the-art ANGIG-KF.

Abstract:
For quadrupeds, a flexible spine allows them to traverse space and make quick turns. From the perspective of mechanical design in quadruped robots, an active spine with 2 degrees of freedom (2-DOF) can achieve dynamic posture adjustment similar to biological organisms which allows for pitch and yaw control. In this work, we present a novel approach to enhance the flexibility of a quadruped robot, Yatsen Lion II, by incorporating a 2-DOF active spine, which is mechanically designed as a linkage-driven parallelogram mechanism. To optimize its motion, we utilize nonlinear model predictive control (NMPC), which combines centroidal dynamics with full kinematics. By incorporating the two extra DOFs of the spinal joint into the generalized coordinates and velocities, we represent the robot as a hybrid dynamic system, capturing the intricate interplay between the legs and spine. Centroidal dynamics act as a crucial bridge between joint movements and the robot’s overall momentum, enabling the controller to synchronize the quadruped’s movements with dynamic spinal adjustments and adaptive gait patterns. We validate our approach through both simulation and real-world experiments. We compare spinal quadruped robot to their rigid-spine counterparts across key locomotion metrics, including in-place turning, straight-line speed, and turning radius. The results indicate that the spined quadrupedal robot outperforms its rigid counterpart by up to 26%, highlighting its flexibility.

Abstract:
Global population aging has led to a sharp increase in patients of upper limb motor dysfunction. Robot assisted virtual training, as a novel solution, can offer safe and precise assistance for upper limb rehabilitation. However, it remains a critical challenge to compensate virtual interaction force and realize joint synergy movement. In this paper, we design an upper limb rehabilitation robot for virtual training (ULRVT II) which is a cable driven exoskeleton with high compatibility controlled by a joint synergy method. Moreover, we establish a rehabilitation platform with a virtual training environment and evaluation system for experimental validation. Tests for the performance of joint synergy and virtual training are carried out to show the effectiveness of our robot.

Abstract:
Bone drilling is a critical component of many clinical surgeries. In robot-assisted deep bone drilling procedures, the complex structure of bone tissue and individual variations in drilling paths often cause slender tools to skid on personalized bone surfaces, leading to deviations that significantly impact surgical precision and safety. This paper presents the development of an orthopedic surgical robot equipped with skidding sensing capabilities. A novel sensing solution for the bone drilling unit is proposed, which employs rigid body force transmission and decouples thrust and lateral force sensing. This approach addresses the challenge of acquiring force information from the deep tool tip within the body. We also introduce a tool tip skidding estimation method based on the deflection curve model and the Spatial-Beam Constraint Model (SBCM). A specialized simulation device for measuring tool tip offset and force was designed. The experimental results demonstrate that the system achieves average sensing errors of 31.8 mN and 43.5 mN for lateral forces at the tool tip along the X and Y directions, respectively. Additionally, the system's resolution for skidding estimation reaches 0.2 mm. Real bone drilling experiments confirm the system’s ability to effectively provide feedback on skidding during surgery. The proposed method enhances the safety of orthopedic surgical robots and offers crucial sensing information for lateral forces and skidding, paving the way for future autonomous bone drilling procedures.

Abstract:
Multi-legged robots offer enhanced stability to navigate complex terrains with their multiple legs interacting with the environment. However, how to effectively coordinate the multiple legs in a larger action exploration space to generate natural and robust movements is a key issue. In this paper, we introduce a motion prior-based approach, successfully applying deep reinforcement learning algorithms to a real hexapod robot. We generate a dataset of optimized motion priors, and train an adversarial discriminator based on the priors to guide the hexapod robot to learn natural gaits. The learned policy is then successfully transferred to a real hexapod robot, and demonstrate natural gait patterns and remarkable robustness without visual information in complex terrains. This is the first time that a reinforcement learning controller has been used to achieve complex terrain walking on a real hexapod robot.

Abstract:
Trust plays a crucial role in user performance during collaborative human-robot interaction. This study examines how varying levels of autonomy and system errors affect user trust and cognitive load in collaborative tasks between robots and humans. Participants performed a collaborative task using a UR5 robotic arm to place four bottles of different shapes into a box within a three-minute time frame under three conditions: (C1) full manual control by the user, (C2) autonomous operation with few errors—where the robot fails to correctly place one out of four bottles and the user can intervene upon detecting failures, and (C3) autonomous operation with frequent errors—where the robot fails to correctly place three out of four bottles, with user intervention allowed upon failure detection. Physiological indicators such as blink rate, galvanic skin response (GSR), and facial temperature, along with task performance metrics such as success rate and completion time were tracked. The results showed that participants experienced the highest cognitive load in Condition 1, as indicated by higher NASA-TLX scores, increased blink rates (average of 65 blinks per minute), elevated facial temperatures, and higher GSR readings. Trust levels were lowest in Condition 3, with 74% of participants reporting low trust, highlighting the significant impact of robot reliability on user’s trust. A strong negative correlation was found between cognitive load and trust in Condition 3 suggesting that increased cognitive load due to frequent robot errors leads to decreased trust. These findings contribute to understanding how system errors and autonomy levels influence cognitive load and trust in collaborative human-robot tasks. The insights gained can inform the design of collaborative robotic systems that balance autonomy and reliability, enhancing user experience and performance.

Abstract:
Various methods have been proposed to achieve high output torque and a wide output range for fast and high-load robotic motions. However, in robots composed of slender frames, such as humanoid robots, the limited space available for actuators and transmission components restricts the application of conventional methods. In this paper, we propose a Variable Chain Motor (VC Motor), an electric actuator that features both shape variability and speed-torque characteristics variability. Shape variability refers to the ability of the actuator to change its form during operation. This property enhances output torque by enabling a dense motor arrangement even under spatial constraints imposed by the frame structure. For example, the actuator can be placed across adjacent frames and deform according to joint rotation. Speed-torque characteristics variability allows switching output characteristics during operation using a dedicated electrical circuit. This enables an expanded range of output speed and torque without significantly increasing size or weight. We evaluated the performance of the developed VC Motor by measuring output torque and efficiency. Furthermore, by applying the VC Motor to the elbow joint of a humanoid robot, we demonstrated its capability for high-speed and high-load operations.

Abstract:
In this paper, we propose a novel design of an electrically actuated robotic leg, called the DecARt (Decoupled Actuation Robot) Leg, aimed at performing agile locomotion. This design incorporates several new features, such as the use of a quasi-telescopic kinematic structure with rotational motors for decoupled actuation, a near-anthropomorphic leg appearance with a forward facing knee, and a novel multi-bar system for ankle torque transmission from motors placed above the knee. To analyze the agile locomotion capabilities of the design numerically, we propose a new descriptive metric, called the "Fastest Achievable Swing Time" (FAST), and perform a quantitative evaluation of the proposed design and compare it with other designs. Then we evaluate the performance of the DecARt Leg-based robot via extensive simulation and preliminary hardware experiments.

Abstract:
This work aims to propose a new unsupervised infrared and visible image fusion method based on salient object segmentation, which can obtain a fused image with more information on salient object and realize the salient object segmentation under poor illumination.The new method can be divided into four steps: (1) A new superpixel segmentation method based on simple linear iterative clustering (SLIC) with K-means subdivision is used to initially process the infrared and visible image, which has better superpixel segmentation quality. (2) A new improved Density Peaks Clustering (DPC) based on superpixel is used to realize the salient object segmentation of the infrared image, which is improved to be automatically selecting the cluster centers with less computation cost. (3) A new GrabCut strategy using the eroded and dilated salient object regions of the infrared image to predetermine the foreground and background respectively is used to achieve the salient object segmentation of the visible image, which can be totally automatic with better salient object segmentation quality. (4) An image fusion strategy is used to realize the final image fusion, which treats the salient object region and background respectively.Experiments were carried out under different poor illumination scenes in the real world. The experimental results show that the new infrared and visible image fusion method is successful with QAB/F greater than 0.69. In addition, the provided superpixel segmentation method, salient object segmentation method and new GrabCut strategy are also effective. The research results provide an effective infrared and visible image fusion thought and three useful methods, which can provide a good reference for researchers. And, the research work reveals the application potential of DPC on image fusion and salient object segmentation, and broadens the application fields of DPC.

Abstract:
In this work, we developed a peristaltic bioreactor with an enhanced crushing capability, inspired by the structure of the avian gizzard. Existing peristaltic bioreactors have limited ability to crush boluses, which makes the fermentation of substances such as agar gel time-consuming. To improve crushing capacity, we focused on bird gizzard. Birds utilize pebbles in their gizzard to aid in food crushing. Our approach replicates this mechanism by incorporating both fixed and freely movable spherical solids, which are compressed during operation, inside the bioreactor. An agar gel crushing experiment demonstrated improved crushing efficiency. Furthermore, in a mixed fermentation experiment using milk agar gel and yogurt, the pH value declined compared with that observed using a conventional device, indicating an increase in lactic acid bacteria. These results confirm that the proposed method effectively enhances fermentation.

Abstract:
We address the problem of object arrangement and scheduling for sequential 3D printing. Unlike the standard 3D printing, where all objects are printed slice by slice at once, in sequential 3D printing, objects are completed one after other. In the sequential case, it is necessary to ensure that the moving parts of the printer do not collide with previously printed objects. We look at the sequential printing problem from the perspective of combinatorial optimization. We propose to express the problem as a linear arithmetic formula, which is then solved using a solver for satisfiability modulo theories (SMT). However, we do not solve the formula expressing the problem of object arrangement and scheduling directly, but we have proposed a technique inspired by counterexample guided abstraction refinement (CEGAR), which turned out to be a key innovation to efficiency.

Abstract:
Autonomous drifting is a complex and crucial maneuver for safety-critical scenarios like slippery roads and emergency collision avoidance, requiring precise motion planning and control. Traditional motion planning methods often struggle with the high instability and unpredictability of drifting, particularly when operating at high speeds. Recent learning-based approaches have attempted to tackle this issue but often rely on expert knowledge or have limited exploration capabilities. Additionally, they do not effectively address safety concerns during learning and deployment. To overcome these limitations, we propose a novel Safe Reinforcement Learning (RL)-based motion planner for autonomous drifting. Our approach integrates an RL agent with model-based drift dynamics to determine desired drift motion states, while incorporating a Predictive Safety Filter (PSF) that adjusts the agent’s actions online to prevent unsafe states. This ensures safe and efficient learning, and stable drift operation. We validate the effectiveness of our method through simulations on a Matlab-Carsim platform, demonstrating significant improvements in drift performance, reduced tracking errors, and computational efficiency compared to traditional methods. This strategy promises to extend the capabilities of autonomous vehicles in safety-critical maneuvers.

Abstract:
Autonomous berthing is a typical task for maritime operations of unmanned surface vessels (USVs). However, during the berthing operations, USVs are subject to multimodal disturbances, such as external ocean disturbances (EODs) and internal thruster faults (ITFs), as well as various constraints, including underactuated nonlinear dynamic constraints, actuator saturation constraints, and obstacle avoidance constraint. In this paper, a fault-tolerant model predictive control (FTMPC) framework is proposed for safety of USV berthing by uniformly considering both disturbances and constraints. Specially, a control density function is integrated into the FTMPC framework to model the obstacle avoidance constraint. Moreover, leveraging the backstepping method and fuzzy logic system, an auxiliary control law considering the EODs and ITFs is constructed as a constraint. Furthermore, sufficient conditions that ensure recursive feasibility, and thus closedloop stability, are provided analytically. Within this FTMPC framework, multimodal disturbances and various constraints can be naturally considered simultaneously. Multiple experimental results in autonomous berthing scenarios demonstrate that the proposed method has excellent fault tolerance and safety under multimodal disturbances and various constraints.

Abstract:
This paper proposes a precision autonomous landing system for unmanned aerial vehicles (UAVs) targeting high-speed moving platforms. By integrating gimbal-based precise positioning, smooth trajectory generation, and dynamically robust control, the system addresses key challenges in high-speed landing scenarios, such as significant visual localization deviations and difficulties in dynamic trajectory planning and control. The study introduces the Comprehensive Coordinate System (CCS-3AG) to eliminate dynamic optical-axis misalignment errors in the gimbal, thereby enhancing the gimbal’s ranging accuracy and control precision. We combine an enhanced single-stage minimum control (MINCO) trajectory framework (L-MINCO) with a bidirectional command update strategy to achieve fast and accurate trajectory planning that accounts for dynamic delays, and designs an Incremental Nonlinear Dynamic Inversion (INDI) controller for high-dynamic command tracking. Simulation and real-flight experiments demonstrate that, at target speeds between 0 and 7.7 m/s, the system attains an average landing precision of 0.108 meters, with a success rate of 97.78% across 90 actual landing tests, outperforming existing landing methods. This work provides a highly robust solution for UAV logistics delivery and emergency landing scenarios.

Abstract:
The unique optical characteristics of the underwater environment, such as light refraction and loss of salient features, pose a significant challenge to traditional vision sensors, especially in the swarm operation scenario where multiple autonomous underwater vehicles (AUVs) cooperate with the mother ship for positioning. To address these challenges, this study proposes the application of soft constraints on mirror position to improve the robustness of optical target positioning in shallow water. During the mapping phase, we establish and optimize the pose relationships between ArUco markers and their mirrors, thereby expanding the locatable space for AUVs (Autonomous Underwater Vehicles) . With the arrangement of ArUco markers unchanged, the number of usable markers doubles. Surface mirror-assisted positioning provides more visual features and additional computed corner points, enhancing the reliability of camera observations and improving positioning accuracy. Experimental results demonstrate that, compared to classical algorithms from the Artificial Vision Applications (A.V.A) laboratory, our method improves position accuracy by 25.8% in single-marker scenarios and by 14.7% in multi-marker scenarios. Therefore, our method provides enhanced mapping and positioning for AUVs in shallow water areas where optical markers can be deployed. This study provides a new mapping paradigm and multi-body localization solution for optical marker-guided underwater swarm operations.

Abstract:
The presence of robots is growing rapidly throughout the world. The robotics education community should follow the trend and modernize the tools used accordingly. The authors introduce here a Robotic Virtual Laboratory that runs in real time and covers four categories of robots, namely serial manipulators, parallel manipulators, wheeled mobile robots, and autonomous off-road vehicles obtained as a combination of wheeled platform with serial manipulators. The topics covered start with motion analysis, then control design, and end up with sensors perception, thus rendering this lab an essential tool for robotics and control engineers as well as computer scientists. The use of (i) robot simulation platform, (ii) real-time simulation, and (iii) cosimulation framework represents a backbone for the success of this laboratory. The results of a survey of students who participated in this pilot project are shown.

Abstract:
Microfluidic technology is currently a popular approach in the field of single-cell research, which is used to reveal the heterogeneity among cells. However, most of the existing microfluidic technologies for single-cell research lack the ability to control the microenvironment of single cells after isolating them. In this work, a technology that combines lateral-field optoelectronic tweezers (LOET) with electrowetting-on-dielectric (EWOD) is used to separate cells into single cells and then encapsulate each single cell within an individual droplet, generating single-cell droplets. More importantly, it also enables the control of the microenvironment of the separated single cells. The driving control of the single - cell droplets is achieved through the EWOD, which has good application prospects in the field of single- cell research.

Abstract:
Annotating real-world LiDAR point clouds for use in intelligent autonomous systems is costly. To overcome this limitation, self-training-based Unsupervised Domain Adaptation (UDA) has been widely used to improve point cloud semantic segmentation by leveraging synthetic point cloud data. However, we argue that existing methods do not effectively utilize unlabeled data, as they either rely on predefined or fixed confidence thresholds, resulting in suboptimal performance. In this paper, we propose a Dynamic Pseudo-Label Filtering (DPLF) scheme to enhance real data utilization in point cloud UDA semantic segmentation. Additionally, we design a simple and efficient Prior-Guided Data Augmentation Pipeline (PG-DAP) to mitigate domain shift between synthetic and real-world point clouds. Finally, we utilize data mixing consistency loss to push the model to learn context-free representations. We implement and thoroughly evaluate our approach through extensive comparisons with state-of-the-art methods. Experiments on two challenging synthetic-to-real point cloud semantic segmentation tasks demonstrate that our approach achieves superior performance. Ablation studies confirm the effectiveness of the DPLF and PG-DAP modules. We release the code of our method in this paper.

Abstract:
A key challenge in human-robot interaction research lies in developing robotic systems that can effectively perceive and interpret social cues, facilitating natural and adaptive interactions. In this work, we present a novel framework for enhancing the attention of the iCub humanoid robot by integrating advanced perceptual abilities to recognise social cues, understand surroundings through generative models, such as ChatGPT, and respond with contextually appropriate social behaviour. Specifically, we propose an interaction task implementing a narrative protocol (storytelling task) in which the human and the robot create a short imaginary story together, exchanging in turn cubes with creative images placed on them. To validate the protocol and the framework, experiments were performed to quantify the degree of usability and the quality of experience perceived by participants interacting with the system. Such a system can be beneficial in promoting effective humanrobot collaborations, especially in assistance, education and rehabilitation scenarios where the social awareness and the robot responsiveness play a pivotal role.

Abstract:
The evolution from motion capture and teleoperation to robot skill learning has emerged as a hotspot and critical pathway for advancing embodied intelligence. However, existing systems still face a persistent gap in simultaneously achieving four objectives: accurate tracking of full upper limb movements over extended durations (Accuracy), ergonomic adaptation to human biomechanics (Comfort), versatile data collection (e.g., force data) and compatibility with humanoid robots (Versatility), and lightweight design for outdoor daily use (Convenience). We present a wearable exoskeleton system, incorporating user-friendly immersive teleoperation and multi-modal sensing collection to bridge this gap. Due to the features of a novel shoulder mechanism with synchronized linkage and timing belt transmission, this system can adapt well to compound shoulder movements and replicate 100% coverage of natural upper limb motion ranges. Weighing 5.2 kg, NuExo supports backpack-type use and can be conveniently applied in daily outdoor scenarios. Furthermore, we develop a unified intuitive teleoperation framework and a comprehensive data collection system integrating multi-modal sensing for various humanoid robots. Experiments across distinct humanoid platforms and different users validate our exoskeleton’s superiority in motion range and flexibility, while confirming its stability in data collection and teleoperation accuracy in dynamic scenarios. The videos are available on our project website at https://nubot-nuexo.github.io/

Abstract:
In this study, we introduce a simulation-based modeling framework for the optimal design of our recently developed inflatable endoscopic vision-based tactile sensing balloon (E-VTSB). Of note, E-VTSB is designed for providing a safe and high-resolution textural mapping and morphology characterization of colorectal cancer (CRC) polyps to enhance the early diagnosis of cancerous polyps. Leveraging the Simulation Open Framework Architecture (SOFA) software and by performing complementary experimental validation, we thoroughly analyzed and investigated the impact of the elastic modulus of the material constitution of E-VTSB on its deformation behavior under different applied pressures. Our findings revealed a close correlation between the simulated outcomes and experimental data performed on two different E-VTSBs. In particular, with the maximum absolute deformation error of <12%, our results clearly validated the proposed framework’s accuracy in predicting the E-VTSB’s deformation trend and its potential use for optimizing the design parameters.

Abstract:
Monitoring of waterways such as remote and hazardous rivers and streams is important so as to assess the impact of external factors including construction runoff or climate change. Versatile, autonomous robotic boats can offer excellent environmental inspection and monitoring solutions for remote, dangerous, or access protected water bodies but they have several shortcomings in terms of maneuverability. This paper proposes an environmental inspection system consisting of an autonomous data collection buoy which is designed to be deployed to inaccessible river systems using a drone. The system can perform a drop off and pickup of the buoy depending on the requirements of a particular location and monitoring task. Utilising the natural flow of the river the buoy autonomously steers down, using GPS and magnetometers so as to maintain the desired trajectory. The buoy is capable of measuring water temperature but it can also be equipped with a range of sensors such as water oxygen meter, sonar for river bed inspection, or turbidity for water clarity. This paper describes the system design, presents an analysis of the self-righting capabilities of the buoy, and shows a full system demonstration at the Ōrewa River in Auckland, New Zealand.

Abstract:
In this study, we introduce a new concept for reconstruction of Volumetric Muscle Loss (VML) injuries and propose the spatial robotic embedded bioprinting technique. As opposed to the traditional layer-by-layer printing, we leverage the support-free nature of embedded bioprinting to print spatial and complex structures of fascicles in a fusiform muscle. To demonstrate feasibility of this concept, we first propose our robotic bioprinting framework including a robotic arm integrated with a custom-designed bioprinting injector. Complementary motion planning algorithms uniquely designed for this printing task are further proposed. Moreover, the effect of embedded bioprinting parameters, as well as the supporting bath and injecting materials compatibility on the uniformity and quality of the printed constructs has been analyzed. Finally, we perform a case study by printing a fusiform muscle-shape construct using the proposed concept and algorithms, and evaluate the quality of the printed structure.

Abstract:
Mechanical characterization of human oocytes holds great promise for improving the chances of pregnancy in assisted reproduction programs. However, the development of a high-performance device comes up against the numerous biological and normative constraints of medically Assisted Reproduction Technology (ART). The patented EggSense platform overcomes these difficulties and enables the mechanical characterization of human oocytes under clinical conditions. In this article, the focus is on the significant improvements achieved in EggSense by deploying advanced control techniques, based on Virtual Input Rejection COntrol (VIRCO). This approach is used to control the position of the force-sensing element in contact with the oocyte. Its implementation is fully described and experimentally validated. A comparison with a conventional controller is also provided to illustrate some of the benefits of VIRCO.

Abstract:
Virtual Reality (VR) robot embodiment is a popular method for teleoperation and generating data to train AI control systems. A potential flaw with this approach is an incongruence in human-robot haptics. Without tactile feedback, teleoperators depend on visual cues, leading to suboptimal performance (slow movement, faulty grasps, excessive contact force). This problem is exacerbated during robot hand teleoperation. Worse, many haptic hand wearables are cable-driven and only produce unidirectional resistive force, thus failing to cover the gamut of 3D finger interactions. In this paper, a critical gap in teleoperated human-robot hand haptics is addressed by transforming an inexpensive (~500 USD) 3D-printed robotic hand into a wearable exoskeleton—the "Hand-in-Hand" (HiH) system—which provides fingertip 3D force feedback. Methods for force control, null-space optimization, and human finger pose estimation are presented and experimentally validated. Each finger of the device can produce a maximum of 1N of force feedback in any 3D direction. HiH finger tracking is compared to an industry-grade device (MANUS Metagloves, 5000 USD) and realizes an average inferred joint position Normalized Root Mean Square Error of 33.46%. Lastly, the HiH is demonstrated within a VR robot embodiment experiment with force feedback. Operator hand-manipulation performance improved when using force feedback, emphasizing the HiH’s potential for teleoperated robot control and data collection.

Abstract:
For trajectory generation of whole-body jumping motions such as humanoid backflips, it is crucial to simultaneously optimize the takeoff, flight, and landing phases while considering full-body dynamics and kinematics. Although such methods have been proposed for standard jumping motions, they have not been applied to more dynamic actions such as frontflips, backflips, and yaw twist jumps, where strong nonlinearity and high sensitivity to certain parameters (e.g., rotor inertia and torque cost weights) pose significant challenges. To address these challenges, we apply a two-stage optimization strategy to an existing full-body dynamics optimization method that simultaneously optimizes the takeoff, flight, and landing phases. In our approach, the same initialization and reference trajectory generation rules are shared across motions, and the solution from the first optimization is used not only as an initial guess but also as a reference in the second optimization. This strategy improves the convergence of the KKT residuals across various jump types and mitigates sensitivity to parameters such as rotor inertia and torque cost weights. As a result, our method achieves unified trajectory generation for frontflips, backflips, yaw twist jumps, and standard jumps using the same initialization, cost weights, and constraints. We also analyze the sensitivity to rotor inertia and show that exceeding a certain threshold can lead to a sharp deterioration in KKT residual convergence.

Abstract:
Collision-free motion planning in complex outdoor environments relies heavily on perceiving the surroundings through exteroceptive sensors. A widely used approach represents the environment as a voxelized Euclidean distance field, where robots are typically approximated by spheres. However, for large-scale manipulators such as forestry cranes, which feature long and slender links, this conventional spherical approximation becomes inefficient and inaccurate.This work presents a novel collision detection algorithm specifically designed to exploit the elongated structure of such manipulators, significantly enhancing the computational efficiency of motion planning algorithms. Unlike traditional sphere decomposition methods, our approach not only improves computational efficiency but also naturally eliminates the need to fine-tune the approximation accuracy as an additional parameter. We validate the algorithm’s effectiveness using real-world LiDAR data from a forestry crane application, as well as simulated environment data.

Abstract:
Teleoperation with haptic feedback allows users to interact with remote environments while retaining a sense of touch. However, the stability and transparency of these systems are compromised under communication network delay. This paper presents an augmented Model-Mediated Teleoperation with 3D object and dynamic environment tracking (MMT-DET) by a vision-based algorithm, enabling users to receive haptic feedback in structured dynamic environments while maintaining robustness against network delays. A user study comparing the proposed method with teleoperation using the Time Domain Passivity Approach (TDPA) was conducted. The results demonstrate that our MMT-DET exhibits robustness to varying delays in task performance and outperforms TDPA at higher delay levels.

Abstract:
Position Based Dynamics (PBD) has been widely adopted for interactive simulation, particularly in applications such as virtual surgery and elastodynamics. However, many existing frameworks focus exclusively on interactive deformation, often neglecting the comprehensive analysis of stress distribution, which is a critical factor in engineering assessments. In remote disaster rescue scenarios, real-time stress visualization can provide vital insights that enable rescue teams to make informed decisions when interacting with deformable objects. In this work, we introduce a fully GPU-parallel, stress-driven simulation framework for real-time deformable human body dynamics, specifically designed for disaster rescue applications. Our approach computes the von Mises stress for each tetrahedral element using a practical Neo-Hookean material model, projects the stress onto the corresponding mesh vertices, and maps these values onto the surface for intuitive rendering. In particular, we address the convergence limitations of general Neo-Hookean constraints under a Jacobi parallel scheme by developing a robust, improved approach. This improved approach avoids the typical volume loss observed in conventional methods and better replicates the qualitative behavior of the Gauss-Seidel scheme. Quantitative experiments using 100 human body models with diverse shapes, heights, and weights demonstrate that our framework effectively maintains the original pose while delivering enhanced physical realism and informative stress visualization. This capability provides disaster rescue teams with critical insights to optimize decision-making during emergencies.

Abstract:
The interest in Physical Human-Robot Interaction (pHRI) has significantly increased over the last two decades, thanks to the availability of collaborative robots that ensure user safety during force exchanges. For this reason, stability concerns remain a key issue in the literature whenever new control schemes for pHRI applications are proposed. Due to the nonlinear nature of robots, stability analyses generally rely on passivity concepts and consider ideal models of robot manipulators. Therefore, the first objective of this paper is to conduct a detailed analysis of the sources of instability in proxy-based constrained admittance controllers for pHRI applications, taking into account parasitic effects such as transmission elasticity, motor velocity saturation, and actuation delay. The second objective of this paper is to perform a sensitivity analysis, supported by experimental results, to identify how the control parameters affect the stability of the overall system. Finally, the third objective is to propose an adaptation technique for the proxy parameters, with the goal of maximizing transparency in pHRI. The proposed adaptation method is validated through simulations and experimental tests.

Abstract:
In spinal surgery, ensuring surgical precision and safety is paramount. Traditionally, surgeons have relied on their experience to determine when to cease milling as the cutter approaches the spinal cord; However, improper technique during this process can lead to complications, such as vertebral plate fractures and spinal cord injuries. This paper investigates the development of a robot capable of high-precision recognition of the milling state. Initially, we identify vibration signals as the basis for state recognition, establishing their feasibility through theoretical analysis, which provides a foundation for the creation of datasets for subsequent milling experiments. We then conducted milling experiments using pig scapulae and designed neural networks for state identification. Vibration signals corresponding to varying milling depth and the proportion of cortical and cancellous bone layers were collected. A BiLSTM-based neural network was developed to identify the milling depth and the proportion of the bone layers, achieving the desired outcomes within an acceptable error range.The results demonstrate that the proposed system achieves high accuracy in state recognition, with errors falling within an acceptable range. This research highlights the potential of integrating advanced neural networks and vibration analysis into robotic systems to enhance precision and safety in spinal surgery.

Abstract:
This paper presents a novel underactuated adaptive robotic hand, Hockens-A Hand, which integrates the Hoeckens mechanism, a double-parallelogram linkage, and a specialized four-bar linkage to achieve three adaptive grasping modes: parallel pinching, asymmetric scooping, and enveloping grasping. Hockens-A Hand requires only a single linear actuator, leveraging passive mechanical intelligence to ensure adaptability and compliance in unstructured environments. Specifically, the vertical motion of the Hoeckens mechanism introduces compliance, the double-parallelogram linkage ensures line contact at the fingertip, and the four-bar amplification system enables natural transitions between different grasping modes. Additionally, the inclusion of a mesh-textured silicone phalanx further enhances the ability to envelop objects of various shapes and sizes. This study employs detailed kinematic analysis to optimize the push angle and design the linkage lengths for optimal performance. Simulations validated the design by analyzing the fingertip motion and ensuring smooth transitions between grasping modes. Furthermore, the grasping force was analyzed using power equations to enhance the understanding of the system’s performance. Experimental validation using a 3D-printed prototype demonstrates the three grasping modes of the hand in various scenarios under environmental constraints, verifying its grasping stability and broad applicability.

Abstract:
Object detection models often struggle with class imbalance, where rare categories appear significantly less frequently than common ones. Existing sampling-based rebalancing strategies, such as Repeat Factor Sampling (RFS) and Instance-Aware Repeat Factor Sampling (IRFS), mitigate this issue by adjusting sample frequencies based on image and instance counts. However, these methods are based on linear adjustments, which limit their effectiveness in long-tailed distributions. This work introduces Exponentially Weighted Instance-Aware Repeat Factor Sampling (E-IRFS), an extension of IRFS that applies exponential scaling to better differentiate between rare and frequent classes. E-IRFS adjusts sampling probabilities using an exponential function applied to the geometric mean of image and instance frequencies, ensuring a more adaptive rebalancing strategy. We evaluate E-IRFS on a dataset derived from the Fireman-UAV-RGBT Dataset and four additional public datasets, using YOLOv11 object detection models to identify fire, smoke, people and lakes in emergency scenarios. The results show that E-IRFS improves detection performance by 22% over the baseline and outperforms RFS and IRFS, particularly for rare categories. The analysis also highlights that E-IRFS has a stronger effect on lightweight models with limited capacity, as these models rely more on data sampling strategies to address class imbalance. The findings demonstrate that E-IRFS improves rare object detection in resource-constrained environments, making it a suitable solution for real-time applications such as UAV-based emergency monitoring. The code is available at: https://github.com/futurians/E-IRFS.

Abstract:
Current research on undulatory propulsion robots has predominantly centered on hydrodynamic performance simulations. However, challenges such as limited mobility and difficulties in parameter identification during underwater bio-mimetic motion remain unresolved. To address these issues, this study proposes a novel undulating fin robot featuring passive rotational joints, aiming to enhance motion capabilities and facilitate more accurate modeling. These joints enhance both the agility and stability of the robot's movements. Initially, the research develops models for the undulatory motion of the undulating fin and the rotational passive degrees of freedom in the fin rays. Based on fluid drag theory, a hydrodynamic model for undulating fin propulsion is constructed to analyze the thrust, lateral force, and lift generated at varying frequencies. Furthermore, a comprehensive dynamics model for the underwater motion of the biomimetic undulating fin robot is developed. Numerical simulations of the robot's non-steady-state motion are conducted to identify the hydrodynamic parameters of the model, thereby enabling the solution of the dynamic model. The experimental results demonstrate that the robot achieves an underwater straight-line motion speed exceeding 0.5m/s, a turning speed of approximately 45°/s, and an inclined upward motion speed of 0.21 m/s. This study provides a novel approach for the design of underwater undulating fin robots and the resolution of kinematic models for underwater robots. It is hoped that this research can contribute to the further development of undulatory propulsion robot technology.

Abstract:
The authors have proposed a passive rotating locomotion robot that forms a convex heptagonal body by connecting seven identical linear rigid frames via viscoelastic rotational joints. In our previous study, it was confirmed through both numerical simulations and actual experiments that stable and passive rotating motion on a downhill could be generated. This paper proposes two new models in which the seven rigid frames are used as robust exoskeletons as they are, but the elastic elements attached to the rotating joints are removed and repositioned on the diagonals of the convex heptagon to reproduce the flexibility of the internal tissue. The elastic elements form a star-shaped polygon called a heptagram, which is formed by connecting seven vertices with a single stroke. The seven vertices can be connected in two different ways to form two different heptagram shapes. We report the basic numerical results of the change in the motion characteristics of the two models with respect to the slope angle and elastic modulus. An overview of the prototypes developed and the results of basic experiments are also reported.

Affiliations: Department of Industrial Design, Eind-Hoven University of Technology, Eindhoven, The Netherlands; Department of Biomedical Engineering, Eindhoven University of Technology, Eindhoven, The Netherlands; Department of Engineering, Institute of Science Tokyo, Yokohama, Japan; Department of Engineering, Utrecht University, Utrecht, The Netherlands; Department of Industrial Design, Eindhoven University of Technology, Eindhoven, The Netherlands

Abstract:
Socially Assistive Robotics (SAR) has shown promise in supporting emotion regulation for neurodivergent children. Recently, there has been increasing interest in leveraging advanced technologies to assist parents in co-regulating emotions with their children. However, limited research has explored the integration of large language models (LLMs) with SAR to facilitate emotion co-regulation between parents and children with neurodevelopmental disorders. To address this gap, we developed an LLM-powered social robot by deploying a speech communication module on the MiRo-E robotic platform. This supervised autonomous system integrates LLM prompts and robotic behaviors to deliver tailored interventions for both parents and neurodivergent children. Pilot tests were conducted with two parent-child dyads, followed by a qualitative analysis. The findings reveal MiRo-E’s positive impacts on interaction dynamics and its potential to facilitate emotion regulation, along with identified design and technical challenges. Based on these insights, we provide design implications to advance the future development of LLM-powered SAR for mental health applications.

Abstract:
This paper presents a hierarchical decision-making framework for autonomous navigation in four-wheel independent steering and driving (4WISD) systems. The proposed approach integrates deep reinforcement learning (DRL) for high-level navigation with fuzzy logic for low-level control to ensure both task performance and physical feasibility. The DRL agent generates global motion commands, while the fuzzy logic controller enforces kinematic constraints to prevent mechanical strain and wheel slippage. Simulation experiments demonstrate that the proposed framework outperforms traditional navigation methods, offering enhanced training efficiency and stability and mitigating erratic behaviors compared to purely DRL-based solutions. Real-world validations further confirm the framework’s ability to navigate safely and effectively in dynamic industrial settings. Overall, this work provides a scalable and reliable solution for deploying 4WISD mobile robots in complex, real-world scenarios.

Abstract:
This paper presents GARMI’s successful outdoor demonstration during the Kandahar Ski World Cup, where it performed trophy handovers in sub-zero temperatures. The event highlighted challenges in deploying robots in extreme conditions, including fluctuating temperatures and uneven terrain. GARMI achieved and completed the trophy handover during the live event, streamed to 60 million viewers. This experience raised two key research questions: the feasibility of high-precision robotics in harsh weather and strategies to compensate for environmental effects. To address them, we extended our previous framework to estimate the mass of the lifted trophy in real-time, incorporating IMU data and conducting experiments under varying temperatures and orientations. Experimental results showed that even slight variations in the robot’s base orientation had a significant impact on the accuracy of mass estimation. For instance, a 5° tilt in the robot’s base orientation resulted in a more than 100% increase in mass estimation error. Online mass estimation, performed using a quasi-static model, demonstrated improved accuracy when incorporating IMU-based corrections for base orientation. Additionally, temperature variations were found to affect robot control performance, with tracking errors increasing outside the manufacturer’s recommended temperature range. The findings highlighted the need for real-time corrections and compensations for base orientation and temperature in robot dynamics, ensuring safe human-robot interaction.

Abstract:
Navigating unknown three-dimensional (3D) rugged environments is challenging for multi-robot systems. Traditional discrete systems struggle with rough terrain due to limited individual mobility, while modular systems—where rigid, controllable constraints link robot units—improve traversal but suffer from high control complexity and reduced flexibility. To address these limitations, we propose the Multi-Robot System with Controllable Weak Constraints (MRS-CWC), where robot units are connected by constraints with dynamically adjustable stiffness. This adaptive mechanism softens or stiffens in real time during environmental interactions, ensuring a balance between flexibility and mobility. We formulate the system’s dynamics and control model and evaluate MRS-CWC against six baseline methods and an ablation variant in a benchmark dataset with 100 different simulation terrains. Results show that MRS-CWC achieves the highest navigation completion rate and ranks second in success rate, efficiency, and energy cost in the highly rugged terrain group, outperforming all baseline methods without relying on environmental modeling, path planning, or complex control. Even where MRS-CWC ranks second, its performance is only slightly behind a more complex ablation variant with environmental modeling and path planning. Finally, we develop a physical prototype and validate its feasibility in a constructed rugged environment. For videos, simulation benchmarks, and code, please visit https://wyd0817.github.io/project-mrs-cwc/.

Abstract:
Lizards are capable of climbing stably on various terrains. Their tails are key to this ability. The lizard uses its flexible tail with graded stiffness as a fifth limb and climbing aid. The tail also enables soft landings, preventing injury from falls. Inspired by this, tails have been incorporated into many climbing robots to enhance their mobility, mimicking lizards. These robotic tails are generally classified as either rigid (stiff) or flexible (soft). A rigid tail can provide a large preload for pitch-back prevention but has a limited contact area for surface adhesion to avoid sliding backward on slopes. In contrast, a flexible tail conforms to the terrain’s contours, increasing the contact area and thereby improving surface adhesion. However, it provides limited preload. Therefore, in this study, we propose a novel hybrid rigid-flexible robotic tail (HIFLEX) that achieves a balanced combination of preload and contact area. The tail structure design features double-sided inclined ribs and is divided into three modular segments (base, middle, and tip), with graded stiffness decreasing progressively from the base to the tip. The asymmetric (inclined) ribbed structure allows the tail to generate anisotropic friction, resulting in high adhesion (tail-to-surface attachment) to prevent backward sliding and low friction (tail-to-surface release) to facilitate upward climbing. The proposed tail is attached to a climbing robot via an actuator capable of pressing the tail downward to generate sufficient preload. The experimental results demonstrate that this unique tail enhances the robot’s climbing performance on rough and deformable slopes while preventing damage to the robot during falls.

Abstract:
Magnetically actuated capsule endoscopic robots (MACERs) are becoming increasingly popular because they can reach deep diseased regions inside the body that are difficult or inaccessible to traditional endoscopes without the restriction of mechanical transmission medium. However, MACERs are highly nonlinear, hence achieving obstacle avoidance, safe, and stable target tracking control of MACERs remains a challenging research topic. Therefore, to satisfy the diagnosis and treatment needs of the deep diseased regions inside the body, this paper designs a MACER target tracking neural network control method with obstacle avoidance and noise-resistant capabilities. Firstly, the kinematics and obstacle avoidance model of the MACER are established, and then a moving target tracking control scheme of robot with joint motion constraints and obstacle avoidance capabilities is designed. Next, a noise-resistant neural network is designed to quickly solve the MACER’s control scheme, thereby achieving safe, obstacle avoidance, and stable target tracking control of the MACER. Finally, the effectiveness and practicability of the proposed method are checked by simulation analysis and experiment on MACER, and compared with the existing methods. The experimental results indicate that the neural network method proposed can effectively control the MACER to track the target motion along the gastric wall curve. Compared with existing methods, the designed method has stronger anti-noise interference ability, the convergence accuracy of the proposed method is improved by 1.3 times, and the computational burden is reduced by 26.7 times.

Abstract:
In situ tissue biopsy with an endoluminal catheter is an efficient approach for disease diagnosis, featuring low invasiveness and few complications. However, the endoluminal catheter struggles to adjust the biopsy direction by distal endoscope bending or proximal twisting for tissue sampling within the tortuous luminal organs, due to friction-induced hysteresis and narrow spaces. Here, we propose a pneumatically-driven robotic catheter enabling the adjustment of the sampling direction without twisting the catheter for an accurate in situ omnidirectional biopsy. The distal end of the robotic catheter consists of a pneumatic bending actuator for the catheter’s deployment in torturous luminal organs and a pneumatic rotatable biopsy mechanism (PRBM). By hierarchical airflow control, the PRBM can adjust the biopsy direction under low airflow and deploy the biopsy needle with higher airflow, allowing for rapid omnidirectional sampling of tissue in situ. This paper describes the design, modeling, and characterization of the proposed robotic catheter, including repeated deployment assessments of the biopsy needle, puncture force measurement, and validation via phantom tests. The PRBM prototype has six sampling directions evenly distributed across 360 degrees when actuated by a positive pressure of 0.3 MPa. The pneumatically-driven robotic catheter provides a novel biopsy strategy, potentially facilitating in situ multidirectional biopsies in tortuous luminal organs with minimum invasiveness.

Abstract:
This study develops an autonomous cleaning robot designed to remove accumulated grease in restaurant kitchen ducts, where human access and manual cleaning are not feasible. Prior studies have developed cleaning mechanisms for round ducts employing planetary gear systems, demonstrating their efficiency in grease removal. However, these systems lack propulsion mechanisms, and cleaning experiments have been limited to short-distance, small-diameter pipes (140 mm, 100A). Therefore, no system has been developed for cleaning grease in long-distance, large-diameter ducts in real-world environments. To address this limitation, we developed a self-propelled cleaning robot integrating a planetary gear-based cleaning mechanism and an inchworm-inspired propulsion mechanism. The design of the propulsion mechanism involved modeling brush rotational torque, gripping torque, and gripping force. Based on this model, a duct inspection and cleaning robot equipped with both propulsion and cleaning mechanisms was developed. Subsequently, the developed robot was tested in a 9 m mock-up duct to evaluate its self-propelled cleaning performance. The robot removed an average of over 85% of the grease under all test conditions while operating autonomously. Finally, a cleaning experiment was conducted in a butcher shop duct, where the robot removed most of the adhered grease. These experiments demonstrated that the developed robot can autonomously clean and inspect ducts in real-world environments where human entry is impractical.

Abstract:
Current pelvic fixation techniques rely on rigid drilling tools, which inherently constrain the placement of rigid medical screws in the complex anatomy of pelvis. These constraints prevent medical screws from following anatomically optimal pathways and force clinicians to fixate screws in linear trajectories. This suboptimal approach, combined with the unnatural placement of the excessively long screws, lead to complications such as screw misplacement, extended surgery times, and increased radiation exposure due to repeated X-ray images taken ensure to safety of procedure. To address these challenges, in this paper, we present the design and development of a unique 4-degree-of-freedom (DoF) pelvic concentric tube steerable drilling robot (pelvic CT-SDR). The pelvic CT-SDR is capable of creating long S-shaped drilling trajectories that follow the natural curvatures of the pelvic anatomy. The performance of the pelvic CT-SDR was thoroughly evaluated through several S-shape drilling experiments in simulated bone phantoms.

Abstract:
Spinal fixation procedures rely on pedicle screws to stabilize the vertebral column, but conventional rigid pedicle screws (RPS) face challenges such as misplacement, pullout, and loosening, particularly in patients with low bone mineral density (BMD). To overcome these limitations, we recently proposed a flexible pedicle screw (FPS) inserted inside a J-shape trajectory drilled by a steerable drilling robot. Towards biomechanical evaluation of our proposed FPS for spinal fixation procedures, in this paper, we introduce the design, integration, calibration, and evaluation of an optical frequency domain reflectometry (OFDR) strain sensor into an FPS. This sensor-integrated FPS (Si-FPS) provides real-time strain and shape-sensing information, facilitating improved implant functionality assessment and optimization. To thoroughly evaluate the Si-FPS, we first additively manufacture a special FPS and integrate a OFDR shape sensing assembly within its structure. We then assess shape sensing performance of this sensorized FPS using static and dynamic FPS insertion experiments.

Abstract:
This paper presents the SABD hand, a 16-degree-of-freedom (DoF) robotic hand that departs from purely anthropomorphic designs to achieve an expanded grasp envelope, enable manipulation poses beyond human capability, and reduce the required number of actuators. This is achieved by combining the adduction/abduction (Add/Abd) joint of digits four and five into a single joint with a large range of motion. The combined joint increases the workspace of the digits by 400% and reduces the required DoFs while retaining dexterity. Experimental results demonstrate that the combined Add/Abd joint enables the hand to grasp objects with a side distance of up to 200 mm. Reinforcement learning-based investigations show that the design enables grasping policies that are effective not only for handling larger objects but also for achieving enhanced grasp stability. In teleoperated trials, the hand successfully performed 86% of attempted grasps on suitable YCB objects, including challenging non-anthropomorphic configurations. These findings validate the design’s ability to enhance grasp stability, flexibility, and dexterous manipulation without added complexity, making it well-suited for a wide range of applications.

Abstract:
This paper proposes an in-pipe robot capable of creating route maps for narrow pipelines with inner diameters of both 3-in and 4-in. Pipe route-drawing often relies on external sensors, such as cameras and light detection and ranging (LiDAR). However, due to lighting, dirt, and spatial constraints within the pipeline, miniaturization has been challenging. Therefore, our method uses only internal sensors, such as a tiny inertia measurement unit (IMU), for robot posture acquisition and an encoder for distance measurement. Furthermore, a sensor-less joint torque control system was also implemented by using a specially designed printed circuit board with motor current regulation, allowing the robot to traverse the pipeline while suppressing excessive torque generation and slippage. The experiments to verify the performance of our route-drawing method were conducted on two types of pipelines with 3-in and 4-in inner diameters. It was revealed that the mean absolute error in the length of the straight sections was within 3% for all pipes, that in the rotational angle of the bent pipes was within 2 deg, and that in the direction of the straight pipes was within 2 deg.

Abstract:
In this work, a multimodal point cloud registration method using CT and video frames is proposed to optimize the modeling of the bronchial cavity environment. Preoperative CT data improve the quality of point clouds acquired from intraoperative video frames. Initially, preoperative CT scans are used to obtain bronchial point clouds and airway centerlines, while intraoperative bronchial point clouds and endoscope trajectories are captured in real-time using SLAM. Given that intraoperative frame-by-frame mapping cannot be directly globally registered, multi-modal point clouds undergo local segmentation. Subsequently, the preoperative bronchial airway centerlines guide iterative scaling and adjustment of the preoperative CT point clouds, achieving precise registration between the CT and the video frame point clouds. Experimental results demonstrate a rapid and accurate enhancement in the quality of the intraoperative bronchial point cloud, providing more precise maps of the cavity environment for surgical robots. The method is validated and evaluated using CT and video frame data collected from ex vivo pig lungs, achieving intraoperative mapping accuracy of 0.5 millimeter, respectively. These results surpass those of methods relying solely on SLAM for intraoperative mapping.

Abstract:
As autonomous personal mobility vehicles (APMVs) are increasingly integrated into shared spaces, short-distance interactions between pedestrians and APMVs will become more frequent. To facilitate communication in shared spaces, APMVs equipped with external human-machine interfaces (eHMIs). Although the eHMI is primarily designed to communicate with pedestrians, its communication also affects the APMV passenger due to the short-distance interaction. This paper focused on the effect of passengers’ personality traits on their user experience when the APMV exhibits different eHMIs. An experiment was conducted in the field with 24 participants as APMV passengers who experienced three distinct eHMI types: eHMI-T (text-based), eHMI-NV (neutral voice-based), and eHMI-AV (affective voice-based). Through causal discovery analysis, our findings revealed that when the APMV is equipped with eHMI-T, various personality traits of passengers collectively influenced their user experience. In contrast, the eHMI-NV design demonstrated that personality traits had no direct influence on user experience. The eHMI-AV design primarily showed that agreeableness and extraversion negatively influenced concerns about drawing attention, which subsequently affected other user experience. Based on the results, this paper recommends designing different eHMIs based on the APMV ownerships, such as private or public shared APMVs.

Abstract:
To address the screw loosening and pullout limitations of rigid pedicle screws in spinal fixation procedures, and to leverage our recently developed Concentric Tube Steerable Drilling Robot (CT-SDR) and Flexible Pedicle Screw (FPS), in this paper, we introduce the concept of Augmented Bridge Spinal Fixation (AB-SF). In this concept, two connecting J-shape tunnels are first drilled through pedicles of vertebra using the CT-SDR. Next, two FPSs are passed through this tunnel and bone cement is then injected through the cannulated region of the FPS to form an augmented bridge between two pedicles and reinforce strength of the fixated spine. To experimentally analyze and study the feasibility of AB-SF technique, we first used our robotic system (i.e., a CT-SDR integrated with a robotic arm) to create two different fixation scenarios in which two J-shape tunnels, forming a bridge, were drilled at different depth of a vertebral phantom. Next, we implanted two FPSs within the drilled tunnels and then successfully simulated the bone cement augmentation process.