RSS2022

Abstract:
Transporter Net is a recently proposed framework for pick and place that is able to learn good manipulation policies from a very few expert demonstrations. A key reason why Transporter Net is so sample efficient is that the model incorporates rotational equivariance into the pick-conditioned place module, i.e. the model immediately generalizes learned pick-place knowledge to objects presented in different pick orientations. This paper proposes a novel version of Transporter Net that is equivariant to both pick and place orientation. As a result, our model immediately generalizes pick-place knowledge to different place orientations in addition to generalizing the pick orientation as before. Ultimately, our new model is more sample efficient and achieves better pick and place success rates than the baseline Transporter Net model.

Abstract:
The success of deep learning depends heavily on the availability of large datasets, but in robotic manipulation there are many learning problems for which such datasets do not exist. Collecting these datasets is time-consuming and expensive, and therefore learning from small datasets is an important open problem. Within computer vision, a common approach to a lack of data is data augmentation. Data augmentation is the process of creating additional training examples by modifying existing ones. However, because the types of tasks and data differ, the methods used in computer vision cannot be easily adapted to manipulation. Therefore, we propose a data augmentation method for robotic manipulation. We argue that augmentations should be valid, relevant, and diverse. We use these principles to formalize augmentation as an optimization problem, with the objective function derived from physics and knowledge of the manipulation domain. This method applies rigid body transformations to trajectories of geometric state and action data. We test our method in two scenarios: 1) learning the dynamics of planar pushing of rigid cylinders, and 2) learning a constraint checker for rope manipulation. These two scenarios have different data and label types, yet in both scenarios, training on our augmented data significantly improves performance on downstream tasks. We also show how our augmentation method can be used on real-robot data to enable more data-efficient online learning.

Abstract:
Cables are ubiquitous in many settings, but often become tangled. Cables are prone to self-occlusions and knots making them difficult to perceive and manipulate. This challenge is exacerbated as cable length increases: long cables require slack management and new primitives to facilitate observability and reachability. In this paper, we focus on autonomously untangling cables of lengths up to 2.7 meters using a bilateral robot. We develop new motion primitives to efficiently untangle and manage the slack of long cables, as well as specialized grippers for this task. We then propose Sliding and Grasping for Tangle Manipulation (SGTM), an algorithm that composes these primitives to untangle cables from starting configurations consisting of knots and several self-crossings. We demonstrate that SGTM successfully untangles cables with a success rate of 67% on isolated overhand and figure 8 knots and 50% on more complex configurations. Supplementary material, visualizations, and videos can be found at https://sites.google.com/ view/rss-2022-untangling/home.

Abstract:
Differential Dynamic Programming (DDP) is an efficient trajectory optimization algorithm relying on second-order approximations of a system's dynamics and cost function, and has recently been applied to optimize systems with time-invariant parameters. Prior works include system parameter estimation and identifying the optimal switching time between modes of hybrid dynamical systems. This paper generalizes previous work by proposing a general parameterized optimal control objective and deriving a parametric version of DDP, titled Parameterized Differential Dynamic Programming (PDDP). A rigorous convergence analysis of the algorithm is provided, and PDDP is shown to converge to a minimum of the cost regardless of initialization. The effects of varying the optimization to more effectively escape local minima are analyzed. Experiments are presented applying PDDP on multiple robotics systems to solve model predictive control (MPC) and moving horizon estimation (MHE) tasks simultaneously. Finally, PDDP is used to determine the optimal transition point between flight regimes of a complex urban air mobility (UAM) class vehicle exhibiting multiple phases of flight.

Abstract:
A robot's deployment environment often involves perceptual changes that differ from what it has experienced during training. Standard practices such as data augmentation attempt to bridge this gap by augmenting source images in an effort to extend the support of the training distribution to better cover what the agent might experience at test time. In many cases, however, it is impossible to know test-time distribution-shift a priori, making these schemes infeasible. In this paper, we introduce a general approach, called Invariance through Latent Alignment (ILA), that improves the test-time performance of a visuomotor control policy in deployment environments with unknown perceptual variations. ILA performs unsupervised adaptation at deployment-time by matching the distribution of latent features on the target domain to the agent's prior experience, without relying on paired data. Although simple, we show that this idea leads to surprising improvements on a variety of challenging adaptation scenarios, including changes in lighting conditions, the content in the scene, and camera poses. We present results on calibrated control benchmarks in simulation --the distractor control suite-- and a physical robot under a sim-to-real setup.

Abstract:
Mechanical search is a robotic problem where a robot needs to retrieve a target item that is partially or fully-occluded from its camera. State-of-the-art approaches for mechanical search either require an expensive search process to find the target item, or they require the item to be tagged with a radio frequency identification tag (e.g., RFID), making their approach beneficial only to tagged items in the environment. We present FuseBot, the first robotic system for RF-Visual mechanical search that enables efficient retrieval of both RF-tagged and untagged items in a pile. Rather than requiring all target items in a pile to be RF-tagged, FuseBot leverages the mere existence of an RF-tagged item in the pile to benefit both tagged and untagged items. Our design introduces two key innovations. The first is RF-Visual Mapping, a technique that identifies and locates RF-tagged items in a pile and uses this information to construct an RF-Visual occupancy distribution map. The second is RF-Visual Extraction, a policy formulated as an optimization problem that minimizes the number of actions required to extract the target object by accounting for the probabilistic occupancy distribution, the expected grasp quality, and the expected information gain from future actions. We built a real-time end-to-end prototype of our system on a UR5e robotic arm with in-hand vision and RF perception modules. We conducted over 180 real-world experimental trials to evaluate FuseBot and compare its performance to a state-of-the-art vision-based system named X-Ray. Our experimental results demonstrate that FuseBot outperforms X-Ray's efficiency by more than 40% in terms of the number of actions required for successful mechanical search. Furthermore, in comparison to X-Ray's success rate of 84%, FuseBot achieves a success rate of 95% in retrieving untagged items, demonstrating for the first time that the benefits of RF perception extend beyond tagged objects in the mechanical search problem.

Abstract:
Simultaneous localization and mapping (SLAM) in the deformable environment has encountered several barricades. One of them is the lack of a global registration technique. Thus current SLAM systems heavily rely on template based methods. We propose KernelGPA, a novel global registration technique to bridge the gap. We define nonrigid transformations using a kernel method, and show that the principal axes of the map can be solved globally in closed-form, up to a global scale ambiguity along each axis. We propose to solve both the global scale ambiguity and rigid poses in a unified optimization framework, yielding a cost that can be readily incorporated in sensor fusion frameworks. We demonstrate the registration performance of KernelGPA using various datasets, with a special focus on computerized tomography (CT) registration. We release our code and data to foster future research in this direction.

Abstract:
Agile maneuvers such as sprinting and high-speed turning in the wild are challenging for legged robots. We present an end-to-end learned controller that achieves record agility for the MIT Mini Cheetah, sustaining speeds up to 3.9 m/s. This system runs and turns fast on natural terrains like grass, ice, and gravel and responds robustly to disturbances. Our controller is a neural network trained in simulation via reinforcement learning and transferred to the real world. The two key components are (i) an adaptive curriculum on velocity commands and (ii) an online system identification strategy for sim-to-real transfer leveraged from prior work. Videos of the robot’s behaviors are available at https://agility.csail.mit.edu/.

Abstract:
We approach the problem of learning from watching humans in the wild. While traditional approaches in Imitation and Reinforcement Learning are promising for learning in the real world, they are either sample inefficient or are constrained to lab settings. Meanwhile, there has been a lot of success in processing passive, unstructured human data. We propose tackling this problem via an efficient one-shot robot learning algorithm, centered around learning from a third person perspective. We call our method WHIRL: In the Wild Human-Imitated Robot Learning. In WHIRL, we aim to use human videos to gather a prior over the intent of the demonstrator, advances in computer vision, and use this to initialize our agent's policy. We introduce an efficient real-world policy learning scheme, that improves over the human prior using interactions. Our key contributions are a simple sampling-based policy optimization approach, a novel objective function for aligning human and robot videos as well as an exploration method to boost sample efficiency. We show, one-shot, generalization and success in real world settings, including 20 different manipulation tasks in the wild.

Abstract:
Increasing the density of the 3D LiDAR point cloud is appealing for many applications in robotics. However, high-density LiDAR sensors are usually costly and still limited to a level of coverage per scan (e.g., 128 channels). Meanwhile, denser point cloud scans and maps mean larger volumes to store and longer times to transmit. Existing works focus on either improving point cloud density or compressing its size. This paper aims to design a novel 3D point cloud representation that can continuously increase point cloud density while reducing its storage and transmitting size. The pipeline of the proposed Continuous, Ultra-compact Representation of LiDAR (CURL) includes four main steps: meshing, upsampling, encoding, and continuous reconstruction. It is capable of transforming a 3D LiDAR scan or map into a compact spherical harmonics representation which can be used or transmitted in low latency to continuously reconstruct a much denser 3D point cloud. Extensive experiments on four public datasets, covering college gardens, city streets, and indoor rooms, demonstrate that much denser 3D point clouds can be accurately reconstructed using the proposed CURL representation while achieving up to 80% storage space-saving. We open-source the CURL codes for the community.

Abstract:
In this paper, we present a novel algorithm for point cloud registration for range sensors capable of measuring per-return instantaneous radial velocity: Doppler ICP. Existing variants of ICP that solely rely on geometry or other features generally fail to estimate the motion of the sensor correctly in scenarios that have non-distinctive features and/or repetitive geometric structures such as hallways, tunnels, highways, and bridges. We propose a new Doppler velocity objective function that exploits the compatibility of each point's Doppler measurement and the sensor's current motion estimate. We jointly optimize the Doppler velocity objective function and the geometric objective function which sufficiently constrains the point cloud alignment problem even in feature-denied environments. Furthermore, the correspondence matches used for the alignment are improved by pruning away the points from dynamic targets which generally degrade the ICP solution. We evaluate our method on data collected from real sensors and from simulation. Our results show that with the added Doppler velocity residual terms, our method achieves a significant improvement in registration accuracy along with faster convergence, on average, when compared to classical point-to-plane ICP that solely relies on geometric residuals.

Abstract:
Robotic assembly is one of the oldest and most challenging applications of robotics. In other areas of robotics, such as perception and grasping, simulation has rapidly accelerated research progress, particularly when combined with modern deep learning. However, accurately, efficiently, and robustly simulating the range of contact-rich interactions in assembly remains a longstanding challenge. In this work, we present Factory, a set of physics simulation methods and robot learning tools for such applications. We achieve real-time or faster simulation of a wide range of contact-rich scenes, including simultaneous simulation of 1000 nut-and-bolt interactions. We provide 60 carefully-designed part models, 3 robotic assembly environments, and 7 robot controllers for training and testing virtual robots. Finally, we train and evaluate proof-of-concept reinforcement learning policies for nut-and-bolt assembly. We aim for Factory to open the doors to using simulation for robotic assembly, as well as many other contact-rich applications in robotics. Please see our website for supplementary content, including videos.

Abstract:
Collocation methods for numerical optimal control commonly assume that the system dynamics is expressed as a first order ODE of the form xdot = f(x,u,t), where x is the state and u the control vector. However, in many systems in robotics, the dynamics adopts the second order form qddot = g(q,qdot,u,t), where q is the configuration. To preserve the first order form, the usual procedure is to introduce the velocity variable v = qdot and define the state as x=(q,v), where q and v are treated as independent in the collocation method. As a consequence, the resulting trajectories do not fulfill the mandatory relationships v(t) = qdot(t) for all times, and even violate qddot = g(q,qdot,u,t) at the collocation points. This prevents the possibility of reaching a correct solution for the problem, and makes the trajectories less compliant with the system dynamics. In this paper we propose a formulation for the trapezoidal and Hermite-Simpson collocation methods that is able to deal with second order dynamics and grants the mutual consistency of the trajectories for q and v while ensuring qddot = g(q,qdot,u,t) at the collocation points. As a result, we obtain trajectories with a much smaller dynamical error in similar computation times, so the robot will behave closer to what is predicted by the solution. We illustrate these points by way of examples, using well-established benchmark problems from the literature.

Abstract:
Collision detection between two convex shapes is an essential feature of any physics engine or robot motion planner. It has been often tackled as a computational geometry problem, with the Gilbert, Johnson and Keerthi (GJK) algorithm being the most common approach today. In this work we show that collision detection is fundamentally a convex optimization problem. In particular, we establish that the GJK algorithm is a specific sub-case of the well-established Frank-Wolfe (FW) algorithm in convex optimization. We introduce a new collision detection algorithm by adapting recent works linking Nesterov acceleration and Frank-Wolfe methods. We benchmark the proposed accelerated collision detection method on two datasets composed of strictly convex and non-strictly convex shapes. Our results show that our approach significantly reduces the number of iterations to solve collision detection problems compared to the state-of-the-art GJK algorithm, leading to up to two times faster computation times.

Abstract:
Robotic Information Gathering (RIG) relies on the uncertainty of a probabilistic model to identify critical areas for efficient data collection. Gaussian processes (GPs) with stationary kernels have been widely adopted for spatial modeling. However, real-world spatial data typically does not satisfy the assumption of stationarity, where different locations are assumed to have the same degree of variability. As a result, the prediction uncertainty does not accurately capture prediction error, limiting the success of RIG algorithms. We propose a novel family of nonstationary kernels, named the Attentive Kernel (AK), which is simple, robust, and can extend any existing kernel to a nonstationary one. We evaluate the new kernel in elevation mapping tasks, where AK provides better accuracy and uncertainty quantification over the commonly used RBF kernel and other popular nonstationary kernels. The improved uncertainty quantification guides the downstream RIG planner to collect more valuable data around the high-error area, further increasing prediction accuracy. A field experiment demonstrates that the proposed method can guide an Autonomous Surface Vehicle (ASV) to prioritize data collection in locations with high spatial variations, enabling the model to characterize the salient environmental features.

Abstract:
In multi-agent settings, game theory is a natural framework for describing the strategic interactions of agents whose objectives depend upon one another's behavior. Trajectory games capture these complex effects by design. In competitive settings, this makes them a more faithful interaction model than traditional "predict then plan" approaches. However, current game-theoretic planning methods have important limitations. In this work, we propose two main contributions. First, we introduce an offline training phase which reduces the online computational burden of solving trajectory games. Second, we formulate a lifted game which allows players to optimize multiple candidate trajectories in unison and thereby construct more competitive "mixed" strategies. We validate our approach on a number of experiments using the pursuit-evasion game "tag".

Abstract:
This paper introduces DextAIRity, an approach to manipulate deformable objects using active airflow. In contrast to conventional contact-based quasi-static manipulations, DextAIRity allows the system to apply dense forces on out-of-contact surfaces, expands the system's reach range, and provides safe high-speed interactions. These properties are particularly advantageous when manipulating under-actuated deformable objects with large surface areas or volumes. We demonstrate the effectiveness of DextAIRity through two challenging deformable object manipulation tasks: cloth unfolding and bag opening. We present a self-supervised learning framework that learns to effectively perform a target task through a sequence of grasping or air-based blowing actions. By using a closed-loop formulation for blowing, the system continuously adjusts its blowing direction based on visual feedback in a way that is robust to the highly stochastic dynamics. We deploy our algorithm on a real-world three-arm system and present evidence suggesting that DextAIRity can improve system efficiency for challenging deformable manipulation tasks, such as cloth unfolding, and enable new applications that are impractical to solve with quasi-static contact-based manipulations (e.g., bag opening)

Abstract:
Robotic navigation has been approached as a problem of 3D reconstruction and planning, as well as an end-to-end learning problem. However, long-range navigation requires both planning and reasoning about local traversability, as well as being able to utilize general knowledge about global geography, in the form of a roadmap, GPS, or other side information providing important cues. In this work, we propose an approach that integrates learning and planning, and can utilize side information such as schematic roadmaps, satellite maps and GPS coordinates as a planning heuristic, without relying on them being accurate. Our method, ViKiNG, incorporates a local traversability model, which looks at the robot's current camera observation and a potential subgoal to infer how easily that subgoal can be reached, as well as a heuristic model, which looks at overhead maps for hints and attempts to evaluate the appropriateness of these subgoals in order to reach the goal. These models are used by a heuristic planner to identify the best waypoint in order to reach the final destination. Our method performs no explicit geometric reconstruction, utilizing only a topological representation of the environment. Despite having never seen trajectories longer than 80 meters in its training dataset, ViKiNG can leverage its image-based learned controller and goal-directed heuristic to navigate to goals up to 3 kilometers away in previously unseen environments, and exhibit complex behaviors such as probing potential paths and backtracking when they are found to be non-viable. ViKiNG is also robust to unreliable maps and GPS, since the low-level controller ultimately makes decisions based on egocentric image observations, using maps only as planning heuristics. For videos of our experiments, please check out our project page: sites.google.com/view/viking-release

Abstract:
In human-robots collaboration scenarios, a human would give robots an instruction that is intuitive for the human himself to accomplish. However, the instruction given to robots is likely ambiguous for them to understand as some information is implicit in the instruction. Therefore, it is necessary for the robots to jointly reason the operation details and perform the embodied multi-agent task planning given the ambiguous instruction. This problem exhibits significant challenges in both language understanding and dynamic task planning with the perception information. In this work, an embodied multi-agent task planning framework is proposed to utilize external knowledge sources and dynamically perceived visual information to resolve the high-level instructions, and dynamically allocate the decomposed tasks to multiple agents. Furthermore, we utilize the semantic information to perform environment perception and generate sub-goals to achieve the navigation motion. This model effectively bridges the difference between the simulation environment and the physical environment, thus it can be simultaneously applied in both simulation and physical scenarios and avoid the notorious sim2real problem. Finally, we build a benchmark dataset to validate the embodied multi-agent task planning problem, which includes three types of high-level instructions in which some target objects are implicit in instructions. We perform the evaluation experiments on the simulation platform and in physical scenarios, demonstrating that the proposed model can achieve promising results for multi-agent collaborative tasks.

Abstract:
There is a growing need for computational tools to automatically design and verify autonomous systems, especially complex robotic systems involving perception, planning, control, and hardware in the autonomy stack. Differentiable programming has recently emerged as powerful tool for modeling and optimization. However, very few studies have been done to understand how differentiable programming can be used for robust, certifiable end-to-end design optimization. In this paper, we fill this gap by combining differentiable programming for robot design optimization with a novel statistical framework for certifying the robustness of optimized designs. Our framework can conduct end-to-end optimization and robustness certification for robotics systems, enabling simultaneous optimization of navigation, perception, planning, control, and hardware subsystems. Using simulation and hardware experiments, we show how our tool can be used to solve practical problems in robotics. First, we optimize sensor placements for robot navigation (a design with 5 subsystems and 6 tunable parameters) in under 5 minutes to achieve an 8.4x performance improvement compared to the initial design. Second, we solve a multi-agent collaborative manipulation task (3 subsystems and 454 parameters) in under an hour to achieve a 44% performance improvement over the initial design. We find that differentiable programming enables much faster (32% and 20x, respectively for each example) optimization than approximate gradient methods. We certify the robustness of each design and successfully deploy the optimized designs in hardware. An open-source implementation is available at https://github.com/ MIT- REALM/architect.

Abstract:
Mobile robots need to navigate in crowded environments to provide services to humans. Traditional approaches to crowd-aware navigation decouple people motion prediction from robot motion planning, leading to undesired robot behaviours. Recent deep learning-based methods integrate crowd forecasting in the planner, assuming precise tracking of the agents in the scene. To do this they require expensive LiDAR sensors and tracking algorithms that are complex and brittle. In this paper we propose a two-step approach to first learn a robot navigation policy based on privileged information about exact pedestrian locations available in simulation. A second learning step distills the knowledge acquired by the first network into an adaptation network that uses only narrow field-of-view image data from the robot camera. While the navigation policy is trained in simulation without any expert supervision such as trajectories computed by a planner, it exhibits state-of-the-art performance on a broad range of dense crowd simulations and real-world experiments. Video results at https://europe.naverlabs.com/research/dipcan.

Abstract:
In this paper we present a control barrier function-based (CBF) resilience controller that provides resilience in a multi-robot network to adversaries. Previous approaches provide resilience by virtue of specific linear combinations of multiple control constraints. These combinations can be difficult to find and are sensitive to the addition of new constraints. Unlike previous approaches, the proposed CBF provides network resilience and is easily amenable to multiple other control constraints, such as collision and obstacle avoidance. The inclusion of such constraints is essential in order to implement a resilience controller on realistic robot platforms. We demonstrate the viability of the CBF-based resilience controller on real robotic systems through case studies on a multi-robot flocking problem in cluttered environments with the presence of adversarial robots.

Abstract:
When humans design cost or goal specifications for robots, they often produce specifications that are ambiguous, under-specified, or beyond planners’ ability to solve. In these cases, corrections provide a valuable tool for human-in-the-loop robot control. Corrections might take the form of new goal specifications, new constraints (e.g. to avoid specific objects), or hints for planning algorithms (e.g. to visit specific waypoints). Existing correction methods (e.g. using a joystick or direct manipulation of an end effector) require full teleoperation or real-time interaction. In this paper, we explore natural language as an expressive and flexible tool for robot correction. We describe how to map from natural language sentences to transformations of cost functions. We show that these transformations enable users to correct goals, update robot motions to accommodate additional user preferences, and recover from planning errors. These corrections can be leveraged to get 81% and 93% success rates on tasks where the original planner failed, with either one or two language corrections. Our method makes it possible to compose multiple constraints and generalizes to unseen scenes, objects and sentences in simulated and the real world environments. Additional visualizations are available at sites.google.com/view/language-costs

Abstract:
Gradient-based approaches in reinforcement learning (RL) have achieved tremendous success in learning policies for autonomous vehicles. While the performance of these approaches warrants real-world adoption, these policies lack interpretability, limiting deployability in the safety-critical and legally-regulated domain of autonomous driving (AD). AD requires interpretable and verifiable control policies that maintain high performance. We propose Interpretable Continuous Control Trees (ICCTs), a tree-based model that can be optimized via modern, gradient-based, RL approaches to produce high-performing, interpretable policies. The key to our approach is a procedure for allowing direct optimization in a sparse decision-tree-like representation. We validate ICCTs against baselines across six domains, showing that ICCTs are capable of learning interpretable policy representations that parity or outperform baselines by up to 33% in AD scenarios while achieving a 300x-600x reduction in the number of policy parameters against deep learning baselines. Furthermore, we demonstrate the interpretability and utility of our ICCTs through a 14-car physical robot demonstration.

Abstract:
In planar grasp detection, the goal is to learn a function from an image of a scene onto a set of feasible grasp poses in SE(2). In this paper, we recognize that the optimal grasp function is SE(2)-equivariant and can be modeled using an equivariant convolutional neural network. As a result, we are able to significantly improve the sample efficiency of grasp learning, obtaining a good approximation of the grasp function after only 600 grasp attempts. This is few enough that we can learn to grasp completely on a physical robot in about 1.5 hours. Code is available at https://github.com/ZXP-S-works/ SE2-equivariant-grasp-learning.

Abstract:
Self-occlusion is challenging for cloth manipulation, as it makes it difficult to estimate the full state of the cloth. Ideally, a robot trying to unfold a crumpled or folded cloth should be able to reason about the cloth's occluded regions. For example, suppose that a robot is trying to unfold a square towel with a corner folded beneath, the robot can successfully unfold it only if it knows the existence of occluded corner. We leverage recent advances in pose estimation for cloth to build a system that uses explicit occlusion reasoning to unfold a crumpled cloth. Specifically, we first learn a model to reconstruct the mesh of the cloth. However, the model will likely have errors due to the complexities of the cloth configurations and due to ambiguities from occlusions. Our main insight is that we can further refine the predicted reconstruction by performing test-time finetuning with self-supervised losses. The obtained reconstructed mesh allows us to use a mesh-based dynamics model for planning while reasoning about occlusions. We evaluate our system both on cloth flattening as well as on cloth canonicalization, in which the objective is to manipulate the cloth into a canonical pose. Our experiments show that our method significantly outperforms prior methods that do not explicitly account for occlusions or perform test-time optimization. Videos and visualizations can be found on our website: https://sites.google.com/view/occlusion-reason/home.

Abstract:
We present SymForce, a library for fast symbolic computation, code generation, and nonlinear optimization for robotics applications like computer vision, motion planning, and controls. SymForce combines the development speed and flexibility of symbolic math with the performance of autogenerated, highly optimized code in C++ or any target runtime language. SymForce provides geometry and camera types, Lie group operations, and branchless singularity handling for creating and analyzing complex symbolic expressions in Python, built on top of SymPy. Generated functions can be integrated as factors into our tangent-space nonlinear optimizer, which is highly optimized for real-time production use. We introduce novel methods to automatically compute tangent-space Jacobians, eliminating the need for bug-prone handwritten derivatives. This workflow enables faster runtime code, faster development time, and fewer lines of handwritten code versus the state-of-the-art. Our experiments demonstrate that our approach can yield order of magnitude speedups on computational tasks core to robotics. Code is available at https://github.com/symforce-org/symforce.

Abstract:
Robots have the potential to perform search for a variety of applications under different scenarios. Our work is motivated by humanitarian assistant and disaster relief (HADR) where often it is critical to find signs of life in the presence of conflicting criteria, objectives, and information. We believe ergodic search can provide a framework for exploiting available information as well as exploring for new information for applications such as HADR, especially when time is of the essence. Ergodic search algorithms plan trajectories such that the time spent in a region is proportional to the amount of information in that region, and is able to naturally balance exploitation (myopically searching high-information areas) and exploration (visiting all locations in the search space for new information). Existing ergodic search algorithms, as well as other information-based approaches, typically consider search using only a single information map. However, in many scenarios, the use of multiple information maps that encode different types of relevant information is common. Ergodic search methods currently do not possess the ability for simultaneous nor do they have a way to balance which information gets priority. This leads us to formulate a Multi-Objective Ergodic Search (MOES) problem, which aims at finding the so-called Pareto-optimal solutions, for the purpose of providing human decision makers various solutions that trade off between conflicting criteria. To efficiently solve MOES, we develop a framework called Sequential Local Ergodic Search (SLES) that converts a MOES problem into a "weight space coverage" problem. It leverages the recent advances in ergodic search methods as well as the idea of local optimization to efficiently approximate the Pareto-optimal front. Our numerical results show that SLES computes solutions of better quality than the popular multi-objective genetic algorithms and runs distinctly faster than a naive scalarization method on a laptop.

Abstract:
Conventional Multi-Agent Path Finding (MAPF) problems aim to compute an ensemble of collision-free paths for multiple agents from their respective starting locations to pre-allocated destinations. This work considers a generalized version of MAPF called Multi-Agent Combinatorial Path Finding (MCPF) where agents must collectively visit a large number of intermediate target locations along their paths before arriving at destinations. This problem involves not only planning collision-free paths for multiple agents but also assigning targets and specifying the visiting order for each agent (i.e. multi-target sequencing). To solve the problem, we leverage the well-known Conflict-Based Search (CBS) for MAPF and propose a novel framework called Conflict-Based Steiner Search (CBSS). CBSS interleaves (1) the conflict resolving strategy in CBS to bypass the curse of dimensionality in MAPF and (2) multiple traveling salesman algorithms to handle the combinatorics in multi-target sequencing, to compute optimal or bounded sub-optimal paths for agents while visiting all the targets. Our extensive tests verify the advantage of CBSS over baseline approaches in terms of computing shorter paths and improving success rates within a runtime limit for up to 20 agents and 50 targets. We also evaluate CBSS with several MCPF variants, which demonstrates the generality of our problem formulation and the CBSS framework.

Abstract:
We present a modular Bayesian optimization framework that efficiently generates time-optimal trajectories for a cooperative multi-agent system, such as a team of UAVs. Existing methods for multi-agent trajectory generation often rely on overly conservative constraints to reduce the complexity of this high-dimensional planning problem, leading to suboptimal solutions. We propose a novel modular structure for the Bayesian optimization model that consists of multiple Gaussian process surrogate models that represent the dynamic feasibility and collision avoidance constraints. This modular structure alleviates the stark increase in computational cost with problem dimensionality and enables the use of minimal constraints in the joint optimization of the multi-agent trajectories. The efficiency of the algorithm is further improved by introducing a scheme for simultaneous evaluation of the Bayesian optimization acquisition function and random sampling. The modular BayesOpt algorithm was applied to optimize multi-agent trajectories through six unique environments using multi-fidelity evaluations from various data sources. It was found that the resulting trajectories are faster than those obtained from two baseline methods. The optimized trajectories were validated in real-world experiments using four quadcopters that fly within centimeters of each other at speeds up to 7.4 m/s.

Abstract:
Humans perceive the world by interacting with objects, which often happens in a dynamic way. For example, a human would shake a bottle to guess its content. However, it remains a challenge for robots to understand many dynamic signals during contact well. This paper investigates dynamic tactile sensing by tackling the task of estimating liquid properties. We propose a new way of thinking about dynamic tactile sensing: by building a light-weighted data-driven model based on the simplified physical principle. The liquid in a bottle will oscillate after a perturbation. We propose a simple physics-inspired model to explain this oscillation and use a high-resolution tactile sensor GelSight to sense it. Specifically, the viscosity and the height of the liquid determine the decay rate and frequency of the oscillation. We then train a Gaussian Process Regression model on a small amount of the real data to estimate the liquid properties. Experiments show that our model can classify three different liquids with 100% accuracy. The model can estimate volume with high precision and even estimate the concentration of sugar-water solution. It is data-efficient and can easily generalize to other liquids and bottles. Our work posed a physically-inspired understanding of the correlation between dynamic tactile signals and the dynamic performance of the liquid. Our approach creates a good balance between simplicity, accuracy, and generality. It will help robots to better perceive liquids in different environments such as kitchens, food factories, and pharmaceutical factories.

Abstract:
Manipulating volumetric deformable objects in the real world, like plush toys and pizza dough, bring substantial challenges due to infinite shape variations, non-rigid motions, and partial observability. We introduce ACID, an action-conditional visual dynamics model for volumetric deformable objects based on structured implicit neural representations. ACID integrates two new techniques: implicit representations for action-conditional dynamics and geodesics-based contrastive learning. To represent deformable dynamics from partial RGB-D observations, we learn implicit representations of occupancy and flow-based forward dynamics. To accurately identify state change under large non-rigid deformations, we learn a correspondence embedding field through a novel geodesics-based contrastive loss. To evaluate our approach, we develop a simulation framework for manipulating complex deformable shapes in realistic scenes and a benchmark containing over 17,000 action trajectories with six types of plush toys and 78 variants. Our model achieves the best performance in geometry, correspondence, and dynamics predictions over existing approaches. The ACID dynamics models are successfully employed to goal-conditioned deformable manipulation tasks, resulting in a 30% increase in task success rate over the strongest baseline.

Abstract:
In this paper, we propose an optimization based SLAM approach to simultaneously optimize the robot trajectory and the occupancy map using 2D laser scans (and odometry) information. The key novelty is that the robot poses and the occupancy map are optimized together, which is significantly different from existing occupancy mapping strategies where the robot poses need to be obtained first before the map can be estimated. In our formulation, the map is represented as a continuous occupancy map where each 2D point in the environment has a corresponding evidence value. The Occupancy-SLAM problem is formulated as an optimization problem where the variables include all the robot poses and the occupancy values at the selected discrete grid cell nodes. We propose a variation of Gauss-Newton method to solve this new formulated problem, obtaining the optimized occupancy map and robot trajectory together with their uncertainties. Our algorithm is an offline approach since it is based on batch optimization and the number of variables involved is large. Evaluations using simulations and publicly available practical 2D laser datasets demonstrate that the proposed approach can estimate the maps and robot trajectories more accurately than the state-of-the-art techniques, when a relatively accurate initial guess is provided to our algorithm. The video shows the convergence process of the proposed Occupancy-SLAM and comparison of results to Cartographer can be found at https://youtu.be/4oLyVEUC4iY.

Abstract:
Image descriptor based place recognition is an important means for loop-closure detection in SLAM. The currently best performing image descriptors for this task are trained on large training datasets with the goal to be applicable in many different environments. In particular, they are not optimized for a specific environment, e.g. the city of Oxford. However, we argue that for place recognition, there is always a specific environment - not necessarily geographically defined, but specified by the particular set of descriptors in the database. In this paper, we propose SEER, a simple and efficient algorithm that can learn to create better descriptors for a specific environment from such a potentially very small set of database descriptors. The new descriptors are better in the sense that they will be more suited for image retrieval on these database descriptors. SEER stands for Sparse Exemplar Ensemble Representations. Both sparsity and ensemble representations are necessary components of the proposed approach. This is evaluated on a large variety of standard place recognition datasets where SEER considerably outperforms existing methods. It does not require any label information and is applicable in online place recognition scenarios. Open source code is available.

Abstract:
While visual imitation learning offers one of the most effective ways of learning from visual demonstrations, generalizing from them requires either hundreds of diverse demonstrations, task specific priors, or large, hard-to-train parametric models. One reason such complexities arise is because standard visual imitation frameworks try to solve two coupled problems at once: learning a succinct but good representation from the diverse visual data, while simultaneously learning to associate the demonstrated actions with such representations. Such joint learning causes an interdependence between these two problems, which often results in needing large amounts of demonstrations for learning. To address this challenge, we instead propose to decouple representation learning from behavior learning for visual imitation. First, we learn a visual representation encoder from offline data using standard supervised and self-supervised learning methods. Once the representations are trained, we use non-parametric Locally Weighted Regression to predict the actions. We experimentally show that this simple decoupling improves the performance of visual imitation models on both offline demonstration datasets and real-robot door opening compared to prior work in visual imitation.

Abstract:
We present iSDF, a continual learning system for real-time signed distance field (SDF) reconstruction. Given a stream of posed depth images from a moving camera, it trains a randomly initialised neural network to map input 3D coordinate to approximate signed distance. The model is self-supervised by minimising a loss that bounds the predicted signed distance using the distance to the closest sampled point in a batch of query points that are actively sampled. In contrast to prior work based on voxel grids, our neural method is able to provide adaptive levels of detail with plausible filling in of partially observed regions and denoising of observations, all while having a more compact representation. In evaluations against alternative methods on real and synthetic datasets of indoor environments, we find that iSDF produces more accurate reconstructions, and better approximations of collision costs and gradients useful for downstream planners in domains from navigation to manipulation. Code and video results can be found at our project page: https://joeaortiz.github.io/iSDF/.

Abstract:
We explore a novel method to perceive and manipulate 3D articulated objects that generalizes to enable a robot to articulate unseen classes of objects. We propose a vision-based system that learns to predict the potential motions of the parts of a variety of articulated objects to guide downstream motion planning of the system to articulate the objects. To predict the object motions, we train a neural network to output a dense vector field representing the point-wise motion direction of the points in the point cloud under articulation. The system then will deploy an analytical motion planning policy based on this vector field to achieve a policy that yields maximum articulation. We train the vision system entirely in simulation, and then demonstrate the capability of our system to generalize to unseen object instances and novel categories in both simulation and the real world, deploying our policy on a Sawyer robot with no retraining. Results suggest that our system achieves state-of-the-art performance in both simulated and real-world experiments. Code, data, and supplementary materials are available at https://sites.google.com/view/articulated-flowbot-3d/home

Abstract:
Communication is a hallmark of intelligence. In this work, we present MIRROR, an approach to (i) quickly learn human models from human demonstrations, and (ii) use the models for subsequent communication planning in assistive shared-control settings. MIRROR is inspired by social projection theory, which hypothesizes that humans use self-models to understand others. Likewise, MIRROR leverages self-models learned using reinforcement learning to bootstrap human modeling. Experiments with simulated humans show that this approach leads to rapid learning and more robust models compared to existing behavioral cloning and state-of-the-art imitation learning methods. We also present a human-subject study using the CARLA simulator which shows that (i) MIRROR is able to scale to complex domains with high-dimensional observations and complicated world physics and (ii) provides effective assistive communication that enabled participants to drive more safely in adverse weather conditions.

Abstract:
We propose a Model Predictive Control (MPC) method for collision-free navigation that uses amortized variational inference to approximate the distribution of optimal control sequences by training a normalizing flow conditioned on the start, goal and environment. This representation allows us to learn a distribution that accounts for both the dynamics of the robot and complex obstacle geometries. We can then sample from this distribution to produce control sequences which are likely to be both goal-directed and collision-free as part of our proposed FlowMPPI sampling-based MPC method. However, when deploying this method, the robot may encounter an out-of-distribution (OOD) environment, i.e. one which is radically different from those used in training. In such cases, the learned flow cannot be trusted to produce low-cost control sequences. To generalize our method to OOD environments we also present an approach that performs projection on the representation of the environment as part of the MPC process. This projection changes the environment representation to be more in-distribution while also optimizing trajectory quality in the true environment. Our simulation results on a 2D double-integrator and a 3D 12DoF underactuated quadrotor suggest that FlowMPPI with projection outperforms state-of-the-art MPC baselines on both in-distribution and OOD environments, including OOD environments generated from real-world data.

Abstract:
Multi-robot systems are often made of physically small robots, meaning obstacles that could be overcome by larger robots pose a greater challenge to them. This paper considers how a group of such robots could self-assemble into bridges to cross large gaps in their environment. We build on previous work demonstrating construction of cantilevers to show how they can be modified once the other side of the gap is reached. Two distributed algorithms are presented: one to reduce the number of agents in the initial structure once it is supported at both ends, and another to deconstruct this leaner structure when it is no longer required. A force-aware approach is taken to ensure that structures do not collapse under self-weight. The first algorithm is shown to be capable of reducing the number of agents in the structure to close to the optimum amount, whereas the second achieves safe and reliable deconstruction.

Abstract:
We are motivated by the problem of performing failure prediction for safety-critical robotic systems with high-dimensional sensor observations (e.g., vision). Given access to a black-box control policy (e.g., in the form of a neural network) and a dataset of training environments, we present an approach for synthesizing a failure predictor with guaranteed bounds on false-positive and false-negative errors. In order to achieve this, we utilize techniques from Probably Approximately Correct (PAC)-Bayes generalization theory. In addition, we present novel class-conditional bounds that allow us to trade-off the relative rates of false-positive vs. false-negative errors. We propose algorithms that train failure predictors (that take as input the history of sensor observations) by minimizing our theoretical error bounds. We demonstrate the resulting approach using extensive simulation and hardware experiments for vision-based navigation with a drone and grasping objects with a robotic manipulator equipped with a wrist-mounted RGB-D camera. These experiments illustrate the ability of our approach to (1) provide strong bounds on failure prediction error rates (that closely match empirical error rates), and (2) improve safety by predicting failures.

Abstract:
This paper presents a holistic approach to saliency-guided visual attention modeling (SVAM) for use by autonomous underwater robots. Our proposed model, named SVAM-Net, integrates deep visual features at various scales and semantics for effective salient object detection (SOD) in natural underwater images. The SVAM-Net architecture is configured in a unique way to jointly accommodate bottom-up and top-down learning within two separate branches of the network while sharing the same encoding layers. We design dedicated spatial attention modules (SAMs) along these learning pathways to exploit the coarse-level and fine-level semantic features for SOD at four stages of abstractions. The bottom-up branch performs a rough yet reasonably accurate saliency estimation at a fast rate, whereas the deeper top-down branch incorporates a residual refinement module (RRM) that provides fine-grained localization of the salient objects. Extensive performance evaluation of SVAM-Net on benchmark datasets clearly demonstrates its effectiveness for underwater SOD. We also validate its generalization performance by several ocean trials' data that include test images of diverse underwater scenes and waterbodies, and also images with unseen natural objects. Moreover, we analyze its computational feasibility for robotic deployments and demonstrate its utility in several important use cases of visual attention modeling.

Abstract:
This paper considers policy-centric optimal motion planning with limited reaction time. The motion planning queries are determined by their goal regions and cost functionals, and are generated over time from a distribution. Once a new query is requested, the robot needs to quickly generate a motion planner which can steer the robot to the goal region while minimizing a cost functional. We develop a meta-learning-based algorithm to compute a meta value function, which can be fast adapted using a small number of samples of a new query. Simulations on a unicycle are conducted to evaluate the developed algorithm and show the anytime property of the proposed algorithm.

Abstract:
Tactile predictive models can be useful across several robotic manipulation tasks, e.g. robotic pushing, robotic grasping, slip avoidance, and in-hand manipulation. However, available tactile prediction models are mostly studied for image-based tactile sensors and there is no comparison study indicating the best performing models. In this paper, we presented two novel data-driven action-conditioned models for predicting tactile signals during real-world physical robot interaction tasks (1) action condition tactile prediction and (2) action conditioned tactile-video prediction models. We use a magnetic-based tactile sensor that is challenging to analyse and test state-of-the-art predictive models and the only existing bespoke tactile prediction model. We compare the performance of these models with those of our proposed models. We perform the comparison study using our novel tactile enabled dataset containing 51,000 tactile frames of a real-world robotic manipulation task with 11 flat-surfaced household objects. Our experimental results demonstrate the superiority of our proposed tactile prediction models in terms of qualitative, quantitative and slip prediction scores.

Abstract:
This paper tackles the task of goal-conditioned dynamic manipulation of deformable objects. This task is highly challenging due to its complex dynamics (introduced by object deformation and high-speed action) and strict task requirements (defined by a precise goal specification). To address these challenges, we present Iterative Residual Policy (IRP), a general learning framework applicable to repeatable tasks with complex dynamics. IRP learns an implicit policy via delta dynamics -- instead of modeling the entire dynamical system and inferring actions from that model, IRP learns delta dynamics that predict the effects of delta action on the previously-observed trajectory. When combined with adaptive action sampling, the system can quickly optimize its actions online to reach a specified goal. We demonstrate the effectiveness of IRP on two tasks: whipping a rope to hit a target point and swinging a cloth to reach a target pose. Despite being trained only in simulation on a fixed robot setup, IRP is able to efficiently generalize to noisy real-world dynamics, new objects with unseen physical properties, and even different robot hardware embodiments, demonstrating its excellent generalization capability relative to alternative approaches.

Abstract:
A motion-based control interface promises flexible robot operations in dangerous environments by combining user intuitions with the robot's motor capabilities. However, designing a motion interface for non-humanoid robots, such as quadrupeds or hexapods, is not straightforward because different dynamics and control strategies govern their movements. We propose a novel motion control system that allows a human user to operate various motor tasks seamlessly on a quadrupedal robot. We first retarget the captured human motion into the corresponding robot motion with proper semantics using supervised learning and post-processing techniques. Then we apply the motion imitation learning with curriculum learning to develop a control policy that can track the given retargeted reference. We further improve the performance of both motion retargeting and motion imitation by training a set of experts. As we demonstrate, a user can execute various motor tasks using our system, including standing, sitting, tilting, manipulating, walking, and turning, on simulated and real quadrupeds. We also conduct a set of studies to analyze the performance gain induced by each component.

Abstract:
Shared control systems can make complex robot teleoperation tasks easier for users. These systems predict the user's goal, determine the motion required for the robot to reach that goal, and combine that motion with the user's input. Goal prediction is generally based on the user's control input (e.g., the joystick signal). In this paper, we show that this prediction method is especially effective when users follow standard noisily optimal behavior models. In tasks with input constraints like modal control, however, this effectiveness no longer holds, so additional sources for goal prediction can improve assistance. We implement a novel shared control system that combines natural eye gaze with joystick input to predict people's goals online, and we evaluate our system in a real-world, COVID-safe user study. We find that modal control reduces the efficiency of assistance according to our model, and when gaze provides a prediction earlier in the task, the system's performance improves. However, gaze on its own is unreliable and assistance using only gaze performs poorly. We conclude that control input and natural gaze serve different and complementary roles in goal prediction, and using them together leads to improved assistance.

Abstract:
Our goal is to develop theory and algorithms for establishing fundamental limits on performance for a given task imposed by a robot's sensors. In order to achieve this, we define a quantity that captures the amount of task-relevant information provided by a sensor. Using a novel version of the generalized Fano inequality from information theory, we demonstrate that this quantity provides an upper bound on the highest achievable expected reward for one-step decision making tasks. We then extend this bound to multi-step problems via a dynamic programming approach. We present algorithms for numerically computing the resulting bounds, and demonstrate our approach on three examples: (i) the lava problem from the literature on partially observable Markov decision processes, (ii) an example with continuous state and observation spaces corresponding to a robot catching a freely-falling object, and (iii) obstacle avoidance using a depth sensor with non-Gaussian noise. We demonstrate the ability of our approach to establish strong limits on achievable performance for these problems by comparing our upper bounds with achievable lower bounds (computed by synthesizing or learning concrete control policies).

Abstract:
Quadratic programming (QP) has become a core modelling component in the modern engineering toolkit. This is particularly true for simulation, planning and control in robotics. Yet, modern numerical solvers have not reached the level of efficiency and reliability required in practical applications where speed, robustness, and accuracy are all necessary. In this work, we introduce a few variations of the well-established augmented Lagrangian method, specifically for solving QPs, which include heuristics for improving practical numerical performances. Those variants are embedded within an open-source software which includes an efficient C++ implementation, a modular API, as well as best-performing heuristics for our test-bed. Relying on this framework, we present a benchmark studying the practical performances of modern optimization solvers for convex QPs on generic and complex problems of the literature as well as on common robotic scenarios. This benchmark notably highlights that this approach outperforms modern solvers in terms of efficiency, accuracy and robustness for small to medium-sized problems, while remaining competitive for higher dimensions.

Abstract:
Promising results have been achieved recently in category-level manipulation that generalizes across object instances. Nevertheless, it often requires expensive real-world data collection and manual specification of semantic keypoints for each object category and task. Additionally, coarse keypoint predictions and ignoring intermediate action sequences hinder adoption in complex manipulation tasks beyond pick-and-place. This work proposes a novel, category-level manipulation framework that leverages an object-centric, category-level representation and model-free 6 DoF motion tracking. The canonical object representation is learned solely in simulation and then used to parse a category-level, task trajectory from a single demonstration video. The demonstration is reprojected to a target trajectory tailored to a novel object via the canonical representation. During execution, the manipulation horizon is decomposed into longrange, collision-free motion and last-inch manipulation. For the latter part, a category-level behavior cloning (CatBC) method leverages motion tracking to perform closed-loop control. CatBC follows the target trajectory, projected from the demonstration and anchored to a dynamically selected category-level coordinate frame. The frame is automatically selected along the manipulation horizon by a local attention mechanism. This framework allows to teach different manipulation strategies by solely providing a single demonstration, without complicated manual programming. Extensive experiments demonstrate its efficacy in a range of challenging industrial tasks in highprecision assembly, which involve learning complex, long-horizon policies. The process exhibits robustness against uncertainty due to dynamics as well as generalization across object instances and scene configurations. The supplementary video is available at https://www.youtube.com/watch?v=WAr8ZY3mYyw

Abstract:
We present a terrain traversability mapping and navigation system (TNS) for autonomous excavator applications in an unstructured environment. We use an efficient approach to extract terrain features from RGB images and 3D point clouds and incorporate them into a global map for planning and navigation. Our system can adapt to changing environments and update the terrain information in real-time. Moreover, we present a novel dataset, the Complex Worksite Terrain (CWT) dataset, which consists of RGB images from construction sites with seven categories based on navigability. Our novel algorithms improve the mapping accuracy over previous SOTA methods by 4.17-30.48% and reduce MSE on the traversability map by 13.8-71.4%. We have combined our mapping approach with planning and control modules in an autonomous excavator navigation system and observe 49.3% improvement in the overall success rate. Based on TNS, we demonstrate the first autonomous excavator that can navigate through unstructured environments consisting of deep pits, steep hills, rock piles, and other complex terrain features. Dataset, videos, and a full technical report are available at gamma.umd.edu/tns/.

Abstract:
It is well-known that graph-based multi-robot path planning (MRPP) is NP-hard to optimally solve. In this work, we propose the first low polynomial-time algorithm for MRPP achieving 1–1.5 asymptotic optimality guarantees on solution makespan (i.e., the time it takes to complete a reconfiguration of the robots) for random instances under very high robot density, with high probability. The dual guarantee on computational efficiency and solution optimality suggests our proposed general method is promising in significantly scaling up multi-robot applications for logistics, e.g., at large robotic warehouses. Specifically, on an m1×m2 gird, m1 ≥m2, our RTH (Rubik Table with Highways) algorithm computes solutions for routing up to m1m2/3 robots with uniformly randomly distributed start and goal configurations with a makespan of m1 +2m2 +o(m1), with high probability. Because the minimum makespan for such instances is m1 + m2 −o(m1), also with high probability, RTH guarantees m1+2m2 m1+m2 optimality as m1 → ∞ for random instances with up to 1/3 robot density, with high probability (m1+2m2)/(m1+m2) ∈(1,1.5]. Alongside the above-mentioned key result, we also establish: (1) for completely filled grids, i.e., m1m2 robots, any MRPP instance may be solved in polynomial time under a makespan of 7m1 + 14m2, (2) for m1m2/3 robots, RTH solves arbitrary MRPP instances with makespan of 3m1 + 4m2 + o(m1), (3) for m1m2/2 robots, a variation of RTH solves a random MRPP instance with the same 1-1.5 optimality guarantee, and (4) the same (m1+2m2)/(m1+m2) optimality guarantee holds for regularly distributed obstacles at 1/9 density together with 2m1m2/9 randomly distributed robots; such settings directly map to real-world parcel sorting scenarios. Moreover, we have developed effective, principled heuristics that further improve the computed optimality of RTH algorithms. In extensive numerical evaluations, RTH and its variants demonstrate exceptional scalability as compared with methods including ECBS and DDM, scaling to over 450 ×300 grids with 45,000 robots, and consistently achieves makespan around 1.5 optimal or better, as predicted by our theoretical analysis

Abstract:
The number of multi-robot systems deployed in field applications has increased dramatically over the years. Despite the recent advancement of navigation algorithms, autonomous robots often encounter challenging situations where the control policy fails and the human assistance is required to resume robot tasks. Human-robot collaboration can help achieve high-levels of autonomy, but monitoring and managing multiple robots at once by a single human supervisor remains a challenging problem. Our goal is to help a supervisor decide which robots to assist in which order such that the team performance can be maximized. We formulate the one-to-many supervision problem in uncertain environments as a dynamic graph traversal problem. An approximation algorithm based on the profitable tour problem on a static graph is developed to solve the original problem, and the approximation error is bounded and analyzed. Our case study on a simulated autonomous farm demonstrates superior team performance than baseline methods in task completion time and human working time, and that our method can be deployed in real-time for robot fleets with moderate size.

Abstract:
Robot learning holds the promise of learning policies that generalize broadly. However, such generalization requires sufficiently diverse datasets of the task of interest, which can be prohibitively expensive to collect. In other fields, such as computer vision, it is common to utilize shared, reusable datasets, such as ImageNet, to overcome this challenge, but this has proven difficult in robotics. In this paper, we ask: what would it take to enable practical data reuse in robotics for end-to-end skill learning? We hypothesize that the key is to use datasets with multiple tasks and multiple domains, such that a new user that wants to train their robot to perform a new task in a new domain can include this dataset in their training process and benefit from cross-task and cross-domain generalization. To evaluate this hypothesis, we collect a large multi-domain and multi-task dataset, with 7,200 demonstrations constituting 71 tasks across 10 environments, and empirically study how this data can improve the learning of new tasks in new environments. We find that jointly training with the proposed dataset and 50 demonstrations of a never-before-seen task in a new domain on average leads to a 2x improvement in success rate compared to using target domain data alone. We also find that data for only a few tasks in a new domain can bridge the domain gap and make it possible for a robot to perform a variety of prior tasks that were only seen in other domains. These results suggest that reusing diverse multi-task and multi-domain datasets, including our open-source dataset, may pave the way for broader robot generalization, eliminating the need to re-collect data for each new robot learning project.

Abstract:
Maintaining an up-to-date map to reflect recent changes in the scene is very important, particularly in situations involving repeated traversals by a robot operating in an environment over an extended period. Undetected changes may cause a deterioration in map quality, leading to poor localization, inefficient operations, and lost robots. Volumetric methods, such as truncated signed distance functions (TSDFs), have quickly gained traction due to their real-time production of a dense and detailed map, though map updating in scenes that change over time remains a challenge. We propose a framework that introduces a novel probabilistic object state representation to track object pose changes in semi-static scenes. The representation jointly models a stationarity score and a TSDF change measure for each object. A Bayesian update rule that incorporates both geometric and semantic information is derived to achieve consistent online map maintenance. To extensively evaluate our approach alongside the state-of-the-art, we release a novel real-world dataset in a warehouse environment. We also evaluate on the public ToyCar dataset. Our method outperforms state-of-the-art methods on the reconstruction quality of semi-static environments.

Abstract:
Multi-robot exploration of complex, unknown environments benefits from the collaboration and cooperation offered by inter-robot communication. Accurate radio signal strength prediction enables communication-aware exploration. Models which ignore the effect of the environment on signal propagation or rely on a priori maps suffer in unknown, communication-restricted (e.g. subterranean) environments. In this work, we present Propagation Environment Modeling and Learning (PropEM-L), a framework which leverages real-time sensor-derived 3D geometric representations of an environment to extract information about line of sight between radios and attenuating walls/obstacles in order to accurately predict received signal strength (RSS). Our data-driven approach combines the strengths of well-known models of signal propagation phenomena (e.g. shadowing, reflection, diffraction) and machine learning, and can adapt online to new environments. We demonstrate the performance of PropEM-L on a six-robot team in a communication-restricted environment with subway-like, mine-like, and cave-like characteristics, constructed for the 2021 DARPA Subterranean Challenge. Our findings indicate that PropEM-L can improve signal strength prediction accuracy by up to 44% over a log-distance path loss model.

Abstract:
Autonomous underwater vehicles (AUVs) have long lagged behind other types of robots in supporting natural communication modes for human-robot interaction. Due to the limitations of the environment, most AUVs use digital displays or topside human-in-the-loop communications as their primary or only communication vectors. Natural methods for robot-to-human communication such as robot ""gestures"" have been proposed, but never evaluated on non-simulated AUVs. In this paper, we enhance, implement and evaluate a robot-to-human communication system for AUVs called Robot Communication Via Motion (RCVM), which utilizes explicit motion phrases (kinemes) to communicate with a dive partner. We present a small pilot study that shows our implementation to be reasonably effective in person followed by a large-population study, comparing the communication effectiveness of our RCVM implementation to three baseline systems. Our results establish RCVM as an effective method of robot-to-human communication underwater and reveal the differences with more traditional communication vectors in how accurately communication is achieved at different viewpoints and types of information payloads.

Abstract:
Many applications in robotics require computing a robot manipulator's "proximity" to a collision state in a given configuration. This collision proximity is commonly framed as a summation over closest Euclidean distances between many pairs of rigid shapes in a scene. Computing many such pairwise distances is inefficient, while more efficient approximations of this procedure, such as through supervised learning, lack accuracy and robustness. In this work, we present an approach for computing a collision proximity function for robot manipulators that formalizes the trade-off between efficiency and accuracy and provides an algorithm that gives control over it. Our algorithm, called Proxima, works in one of two ways: (1) given a time budget as input, the algorithm returns an as-accurate-as-possible proximity approximation value in this time; or (2) given an accuracy budget, the algorithm returns an as-fast-as-possible proximity approximation value that is within the given accuracy bounds. We show the robustness of our approach through analytical investigation and simulation experiments on a wide set of robot models ranging from 6 to 132 degrees of freedom. We demonstrate that controlling the trade-off between efficiency and accuracy in proximity computations via our approach can enable safe and accurate real-time robot motion-optimization even on high-dimensional robot models.

Abstract:
Novel view synthesis is required in many robotic applications, such as VR teleoperation and scene reconstruction. Existing methods are often too slow for these contexts, cannot handle dynamic scenes, and are limited by their explicit depth estimation stage, where incorrect depth predictions can lead to large projection errors. Our proposed method runs in real time on live streaming data and avoids explicit depth estimation by efficiently warping input images into the target frame for a range of assumed depth planes. The resulting plane sweep volume (PSV) is directly fed into our network, which first estimates soft PSV masks in a self-supervised manner, and then directly produces the novel output view. This improves efficiency and performance on transparent, reflective, thin, and feature-less scene parts. FaDIV-Syn can perform both interpolation and extrapolation tasks at 540p in real-time and outperforms state-of-the-art extrapolation methods on the large-scale RealEstate10k dataset. We thoroughly evaluate ablations, such as removing the Soft-Masking network, training from fewer examples as well as generalization to higher resolutions and stronger depth discretization. Our implementation is available.

Abstract:
In this work, we propose a novel safe and scalable decentralized solution for multi-agent control in the presence of stochastic disturbances. Safety is mathematically encoded using stochastic control barrier functions and safe controls are computed by solving quadratic programs. Decentralization is achieved by augmenting to each agent's optimization variables, copy variables, for its neighbors. This allows us to decouple the centralized multi-agent optimization problem. However, to ensure safety, neighboring agents must agree on what is safe for both of us, creating a need for consensus. To enable safe consensus solutions, we incorporate an ADMM-based approach. Specifically, we propose a Merged Consensus ADMM-OSQP implicit neural network layer, that solves a mini-batch of both, local quadratic programs as well as the overall consensus problem, as a single optimization problem. This layer is embedded within a Deep Forward-Backward Stochastic Differential Equations (FBSDEs) network architecture at every time step, to facilitate end-to-end differentiable, safe and decentralized stochastic optimal control. The efficacy of the proposed approach is demonstrated on several challenging multi-robot tasks in simulation. By imposing collision avoidance constraints, the safe operation of all agents is ensured during the entire training process. We also demonstrate superior scalability in terms of computational and memory savings as compared to a centralized approach.

Abstract:
We present a coordinated autonomy pipeline for multi-sensor exploration of confined environments. We simultaneously address four broad challenges that are typically overlooked in prior work: (a) make effective use of both range and vision sensing modalities, (b) perform this exploration across a wide range of environments, (c) be resilient to adverse events, and (d) execute this onboard a team of physical robots. Our solution centers around a behavior tree architecture, which adaptively switches between various behaviors involving coordinated exploration and responding to adverse events. Our exploration strategy exploits the benefits of both visual and range sensors with a new frontier-based exploration algorithm. The autonomy pipeline is evaluated with an extensive set of field experiments, with teams of up to 3 robots that fly up to 3 m/s and distances exceeding one kilometer. We provide a summary of various field experiments and detail resilient behaviors that arose: maneuvering narrow doorways, adapting to unexpected environment changes, and emergency landing. We provide an extended discussion of lessons learned, release software as open source, and present a video in the supplementary material.

Abstract:
Modeling and manipulating elasto-plastic objects are essential capabilities for robots to perform complex industrial and household interaction tasks (e.g., stuffing dumplings, rolling sushi, and making pottery). However, due to the high degree of freedom of elasto-plastic objects, significant challenges exist in virtually every aspect of the robotic manipulation pipeline, e.g., representing the states, modeling the dynamics, and synthesizing the control signals. We propose to tackle these challenges by employing a particle-based representation for elasto-plastic objects in a model-based planning framework. Our system, RoboCraft, only assumes access to raw RGBD visual observations. It transforms the sensing data into particles and learns a particle-based dynamics model using graph neural networks (GNNs) to capture the structure of the underlying system. The learned model can then be coupled with model-predictive control (MPC) algorithms to plan the robot’s behavior. We show through experiments that with just 10 minutes of real-world robotic interaction data, our robot can learn a dynamics model that can be used to synthesize control signals to deform elasto-plastic objects into various target shapes, including shapes that the robot has never encountered before. We perform systematic evaluations in both simulation and the real world to demonstrate the robot’s manipulation capabilities and ability to generalize to a more complex action space, different tool shapes, and a mixture of motion modes. We also conduct comparisons between RoboCraft and untrained human subjects controlling the gripper to manipulate deformable objects in both simulation and the real world. Our learned model-based planning framework is comparable to and sometimes better than human subjects on the tested tasks.

Abstract:
Humans are capable of completing a range of challenging manipulation tasks that require reasoning jointly over modalities such as vision, touch, and sound. Moreover, many such tasks are partially-observed; for example, taking a notebook out of a backpack will lead to visual occlusion and require reasoning over the history of audio or tactile information. While robust tactile sensing can be costly to capture on robots, microphones near or on a robot's gripper are a cheap and easy way to acquire audio feedback of contact events, which can be a surprisingly valuable data source for perception in the absence of vision. Motivated by the potential for sound to mitigate visual occlusion, we aim to learn a set of challenging partially-observed manipulation tasks from visual and audio inputs. Our proposed system learns these tasks by combining offline imitation learning from a modest number of tele-operated demonstrations and online finetuning using human provided interventions. In a set of simulated tasks, we find that our system benefits from using audio, and that by using online interventions we are able to improve the success rate of offline imitation learning by ~20%. Finally, we find that our system can complete a set of challenging, partially-observed tasks on a Franka Emika Panda robot, like extracting keys from a bag, with a 70% success rate, 50% higher than a policy that does not use audio.

Abstract:
We build a system that enables any human to control a robot hand and arm, simply by demonstrating motions with their own hand. The robot observes the human operator via a single RGB camera and imitates their actions in real-time. Human hands and robot hands differ in shape, size, and joint structure, and performing this translation from a single uncalibrated camera is a highly underconstrained problem. Moreover, the retargeted trajectories must effectively execute tasks on a physical robot, which requires them to be temporally smooth and free of self-collisions. Our key insight is that while paired human-robot correspondence data is expensive to collect, the internet contains a massive corpus of rich and diverse human hand videos. We leverage this data to train a system that understands human hands and retargets a human video stream into a robot hand-arm trajectory that is smooth, swift, safe, and semantically similar to the guiding demonstration. We demonstrate that it enables previously untrained people to teleoperate a robot on various dexterous manipulation tasks. Our low-cost, glove-free, marker-free remote teleoperation system makes robot teaching more accessible and we hope that it can aid robots in learning to act autonomously in the real world. Video demos can be found at: https://robotic-telekinesis.github.io

Abstract:
For autonomous quadruped robot navigation in various complex environments, a typical SOTA system is composed of four main modules -- mapper, global planner, local planner, and command-tracking controller -- in a hierarchical manner. In this paper, we build a robust and safe local planner which is designed to generate a velocity plan to track a coarsely planned path from the global planner. Previous works used waypoint-based methods (e.g. Proportional-Differential control and pure pursuit) which simplify the path tracking problem to local point-goal navigation. However, they suffer from frequent collisions in geometrically complex and narrow environments because of two reasons; the global planner uses a coarse and inaccurate model and the local planner is unable to track the global plan sufficiently well. Currently, deep learning methods are an appealing alternative because they can learn safety and path feasibility from experience more accurately. However, existing deep learning methods are not capable of planning for a long horizon. In this work, we propose a learning-based fully autonomous navigation framework composed of three innovative elements: a learned forward dynamics model (FDM), an online sampling-based model-predictive controller, and an informed trajectory sampler (ITS). Using our framework, a quadruped robot can autonomously navigate in various complex environments without a collision and generate a smoother command plan compared to the baseline method. Furthermore, our method can reactively handle unexpected obstacles on the planned path and avoid them.

Abstract:
3D scene graphs have recently emerged as a powerful high-level representation of 3D environments. A 3D scene graph models the environment as a layered graph where nodes represent spatial concepts at multiple levels of abstraction (from low-level geometry to high-level semantics including objects, places, rooms, buildings, etc.) and edges represent relations between concepts. While 3D scene graphs can serve as an advanced “mental model” for robots, how to build such a rich representation in real-time is still uncharted territory. This paper describes a real-time Spatial Perception System, a suite of algorithms to build a 3D scene graph from sensor data in real-time. Our first contribution is to develop real-time algorithms to incrementally construct the layers of a scene graph as the robot explores the environment; these algorithms build a local ESDF around the current robot trajectory estimate, extract a topological map of places from the ESDF, and then segment the places into rooms using an approach inspired by community-detection techniques. Our second contribution is to investigate loop closure detection and optimization in 3D scene graphs. We show that 3D scene graphs allow defining hierarchical descriptors for place recognition; our descriptors capture statistics across layers in the scene graph, ranging from low-level visual appearance, to summary statistics about objects and places. We then propose the first algorithm to optimize a 3D scene graph in response to loop closures; our approach relies on embedded deformation graphs to simultaneously correct all layers of the scene graph. We implement the proposed system into a highly parallelized architecture, named Hydra, that combines fast early and mid-level perception processes with slower high-level perception. We evaluate Hydra on simulated and real data and show it is able to reconstruct 3D scene graphs with an accuracy comparable with batch offline methods, while running online.

Abstract:
This work provides a complete framework for the simulation, co-optimization, and sim-to-real transfer of the design and control of soft legged robots. The compliance of soft robots provides a form of ``mechanical intelligence''---the ability to passively exhibit behaviors that would otherwise be difficult to program. Exploiting this capacity requires careful consideration of the coupling between mechanical design and control. Co-optimization provides a promising means to generate sophisticated soft robots by reasoning over this coupling. However, the complex nature of soft robot dynamics makes it difficult to achieve a simulation environment that is both sufficiently accurate to allow for sim-to-real transfer and fast enough for contemporary co-optimization algorithms. In this work, we describe a modularized model order reduction algorithm that significantly improves the efficiency of finite element simulation, while preserving the accuracy required to successfully learn effective soft robot design-control pairs that transfer to reality. We propose a reinforcement learning-based framework for co-optimization and demonstrate successful optimization, construction, and zero-shot sim-to-real transfer of several soft crawling robots. Our learned robot outperforms an expert-designed crawling robot, showing that our approach can generate novel, high-performing designs even in well-understood domains.

Abstract:
It is well-known that inverse dynamics models can improve tracking performance in robot control. These models need to precisely capture the robot dynamics, which consist of well-understood components, e.g., rigid body dynamics, and effects that remain challenging to capture, e.g., stick-slip friction and mechanical flexibilities. Such effects exhibit hysteresis and partial observability, rendering them, particularly challenging to model. Hence, hybrid models, which combine a physical prior with data-driven approaches are especially well-suited in this setting. We present a novel hybrid model formulation that enables us to identify fully physically consistent inertial parameters of a rigid body dynamics model which is paired with a recurrent neural network architecture, allowing us to capture unmodeled partially observable effects using the network memory. We compare our approach against state-of-the-art inverse dynamics models on a 7 degree of freedom manipulator. Using data sets obtained through an optimal experiment design approach, we study the accuracy of offline torque prediction and generalization capabilities of joint learning methods. In control experiments on the real system, we evaluate the model as a feed-forward term for impedance control and show the feedback gains can be drastically reduced to achieve a given tracking accuracy.

Abstract:
In this work, we propose a new learning-based iterative control (IC) framework that enables a complex soft-robotic arm to track trajectories accurately. Compared to traditional iterative learning control (ILC), which operates on a single fixed reference trajectory, we use a deep learning approach to generalize across various reference trajectories. The resulting nonlinear mapping computes feedforward actions and is used in a two degrees of freedom control design. Our method incorporates prior knowledge about the system dynamics and by learning only feedforward actions, it mitigates the risk of instability. We demonstrate a low sample complexity and an excellent tracking performance in real-world experiments. The experiments are carried out on a custom-made robot arm with four degrees of freedom that is actuated with pneumatic artificial muscles. The experiments include high acceleration and high velocity motion.

Abstract:
Traversability prediction is a fundamental perception capability for autonomous navigation. The diversity of data in different domains imposes significant gaps to the prediction performance of the perception model. In this work, we make efforts to reduce the gaps by proposing a novel coarse-to-fine unsupervised domain adaptation (UDA) model - CALI. Our aim is to transfer the perception model with high data efficiency, eliminate the prohibitively expensive data labeling, and improve the generalization capability during the adaptation from easy-to-obtain source domains to various challenging target domains. We prove that a combination of a coarse alignment and a fine alignment can be beneficial to each other and further design a first-coarse-then-fine alignment process. This proposed work bridges theoretical analyses and algorithm designs, leading to an efficient UDA model with easy and stable training. We show the advantages of our proposed model over multiple baselines in several challenging domain adaptation setups. To further validate the effectiveness of our model, we then combine our perception model with a visual planner to build a navigation system and show the high reliability of our model in complex natural environments where no labeled data is available. The robot navigation demonstration can be seen in this video: https://www.youtube.com/watch?v=Nqsegaq x-o.

Abstract:
Bridging model-based safety and model-free reinforcement learning (RL) for dynamic robots is appealing since model-based methods are able to provide formal safety guarantees, while RL-based methods are able to exploit the robot agility by learning from the full-order system dynamics. However, current approaches to tackle this problem are mostly restricted to simple systems. In this paper, we propose a new method to combine model-based safety with model-free reinforcement learning by explicitly finding a low-dimensional model of the system controlled by a RL policy and applying stability and safety guarantees on that simple model. We use a complex bipedal robot Cassie, which is a high dimensional nonlinear system with hybrid dynamics and underactuation, and its RL-based walking controller as an example. We show that a low-dimensional dynamical model is sufficient to capture the dynamics of the closed-loop system. We demonstrate that this model is linear, asymptotically stable, and is decoupled across control input in all dimensions. We further exemplify that such linearity exists even when using different RL control policies. Such results point out an interesting direction to understand the relationship between RL and optimal control: whether RL tends to linearize the nonlinear system during training in some cases. Furthermore, we illustrate that the found linear model is able to provide guarantees by safety-critical optimal control framework, e.g., Model Predictive Control with Control Barrier Functions, on an example of autonomous navigation using Cassie while taking advantage of the agility provided by the RL-based controller.

Abstract:
This paper extends recent work in demonstrating magnetic manipulation of conductive, nonmagnetic objects using rotating magnetic dipole fields. The current state of the art demonstrates dexterous manipulation of solid copper spheres with all object parameters known a priori. Our approach expands the previous model that contained three discrete modes to a single, continuous model that covers all possible relative positions of the manipulated object relative to the magnetic field source. We further leverage this new model to examine manipulation of spherical objects with unknown physical parameters, by applying techniques from the online-optimization and adaptive-control literature. Our experimental results validate our new dynamics model, showing that we get comparable or improved performance to the previously proposed model, while solving a simpler optimization problem for control. We further demonstrate the first physical magnetic control of aluminum spheres, as previous controllers were only physically validated on copper spheres. We show that our adaptive control framework can quickly acquire accurate estimates of the true spherical radius when weakly initialized, enabling control of spheres with unknown physical properties.Finally, we demonstrate that the spherical-object model can be used as an approximate model for adaptive control of nonspherical objects by preforming the first magnetic manipulation of nonspherical, non-magnetic objects.

Abstract:
Learning from demonstration (LfD) seeks to democratize robotics by enabling non-experts to intuitively program robots to perform novel skills through human task demonstration. Yet, LfD is challenging under a task and motion planning setting which requires hierarchical abstractions. Prior work has studied mechanisms for eliciting demonstrations that include hierarchical specifications of task and motion, via keyframes [1] or hierarchical task network specifications [2]. However, such prior works have not examined whether non-roboticist end-users are capable of providing such hierarchical demonstrations without explicit training from a roboticist showing how to teach each task [3]. To address the limitations and assumptions of prior work, we conduct two novel human-subjects experiments to answer (1) what are the necessary conditions to teach users through hierarchy and task abstractions and (2) what instructional information or feedback is required to support users to learn to program robots effectively to solve novel tasks. Our first experiment shows that fewer than half (35.71%) of our subjects provide demonstrations with sub-task abstractions when not primed. Our second experiment demonstrates that users fail to teach the robot correctly when not shown a video demonstration of an expert’s teaching strategy for the exact task that the subject is training. Not even showing the video of an analogue task was sufficient. These experiments reveal the need for fundamentally different approaches in LfD which can allow end-users to teach generalizable long-horizon tasks to robots without the need to be coached by experts at every step.

Abstract:
Aerial robots have demonstrated impressive feats of precise control, such as dynamic flight through openings or highly complex choreographies. Despite the accuracy needed for these tasks, there are problems that require levels of precision that are challenging to achieve today. One such problem is aerial interaction. Advances in aerial robot design and control have made such contact-based tasks possible and opened up research into challenging real-world tasks, including contact-based inspection. However, while centimetre accuracy is sufficient and achievable for inspection tasks, the positioning accuracy needed for other problems, such as layouting on construction sites or general push-and-slide tasks, is millimetres. To achieve such a high precision, we propose a new aerial system composed of an aerial vehicle equipped with a novel ""smart"" end-effector leveraging a stability-optimized Gough-Stewart mechanism. We present its design process and features incorporating the principles of compliance, multiple contact points, actuation, and self-containment. In experiments, we verify that the design choices made for our novel end-effector are necessary to obtain the desired positioning precision. Furthermore, we demonstrate that we can reliably mark lines on ceilings with millimetre accuracy without the need for precise modeling or sophisticated control of the aerial robot.