Some of the most exciting recent robot learning research results developed within our projects are shown here. The videos are part of our YouTube channel. For the older videos, checkout our video archive.

Combining domain randomization and reinforcement learning is a widely used approach to obtain control policies that can bridge the gap between simulation and reality. However, existing methods make limiting assumptions on the form of the domain parameter distribution which prevents them from utilizing the full power of domain randomization. Typically, a restricted family of probability distributions (e.g., normal or uniform) is chosen a priori for every parameter. Furthermore, straightforward approaches based on deep learning require differentiable simulators, which are either not available or can only simulate a limited class of systems. Such rigid assumptions diminish the applicability of domain randomization in robotics. Building upon recently proposed neural likelihood-free inference methods, we introduce Neural Posterior Domain Randomization (NPDR), an algorithm that alternates between learning a policy from a randomized simulator and adapting the posterior distribution over the simulator’s parameters in a Bayesian fashion. Our approach only requires a parameterized simulator, coarse prior ranges, a policy (optionally with optimization routine), and a small set of real-world observations. Most importantly, the domain parameter distribution is not restricted to a specific family, parameters can be correlated, and the simulator does not have to be differentiable. We show that the presented method is able to efficiently adapt the posterior over the domain parameters to closer match the observed dynamics. Moreover, we demonstrate that NPDR can learn transferable policies using fewer real-world rollouts than comparable algorithms.

**Want to know more? Read:**
Muratore, F.; Gruner, T.; Wiese, F.; Belousov, B.; Gienger, M.; Peters, J. (2021). Neural Posterior Domain Randomization, *Conference on Robot Learning (CoRL)*.
Download Article BibTeX Reference

Despite their abundance in robotics and nature, underactuated systems remain a challenge for control engineering. Trajectory optimization provides a generally applicable the solution, however, its efficiency strongly depends on the skill of the engineer to frame the problem in an optimizer-friendly way. This paper proposes a procedure that automates such problem reformulation for a class of tasks in which the desired trajectory is specified by a sequence of waypoints. The approach is based on introducing auxiliary optimization variables that represent waypoint activations. To validate the proposed method, a letter drawing task is set up where shapes traced by the tip of a rotary inverted pendulum are visualized using long exposure photography.

**Want to know more? Read:**
Eilers, C.; Eschmann, J.; Menzenbach, R.; Belousov, B.; Muratore, F.; Peters, J. (2020). Underactuated Waypoint Trajectory Optimization for Light Painting Photography, *Proceedings of the IEEE International Conference on Robotics and Automation (ICRA)*.
Download Article BibTeX Reference

When learning policies for robot control, the required real-world data is typically prohibitively expensive to acquire, so learning in simulation is a popular strategy. Unfortunately, such policies are often not transferable to the real world due to a mismatch between the simulation and reality, called 'reality gap'. Domain randomization methods tackle this problem by randomizing the physics simulator (source domain) during training according to a distribution over domain parameters in order to obtain more robust policies that are able to overcome the reality gap. Most domain randomization approaches sample the domain parameters from a fixed distribution. This solution is suboptimal in the context of sim-to-real transferability since it yields policies that have been trained without explicitly optimizing for the reward on the real system (target domain). Additionally, a fixed distribution assumes there is prior knowledge about the uncertainty over the domain parameters. In this paper, we propose Bayesian Domain Randomization (BayRn), a black-box sim-to-real algorithm that solves tasks efficiently by adapting the domain parameter distribution during learning given sparse data from the real-world target domain. BayRn uses Bayesian optimization to search the space of source domain distribution parameters such that this leads to a policy which maximizes the real-word objective, allowing for adaptive distributions during policy optimization. We experimentally validate the proposed approach in sim-to-sim as well as in sim-to-real experiments, comparing against three baseline methods on two robotic tasks. Our results show that BayRn is able to perform sim-to-real transfer, while significantly reducing the required prior knowledge.

**Want to know more? Read:**
Muratore, F.; Eilers, C.; Gienger, M.; Peters, J. (2021). Data-efficient Domain Randomization with Bayesian Optimization, *IEEE Robotics and Automation Letters (RA-L), with Presentation at the IEEE International Conference on Robotics and Automation (ICRA)*, IEEE.
Download Article BibTeX Reference

Sim-to-Real transfer of policies learned with Bayesian Domain Randomization (BayRn) on the Barrett WAM (ball-in-a-cup task) and the Quanser Qube (swing-up and balance task). When learning from simulations, the optimizer is free to exploit the simulation. Thus the resulting policies can perform very well in simulation but transfer poorly to the real-world counterpart. For example, both of the subsequent policies yield a return of 1, thus look equally good to the learner. Bayesian Domain Randomization (BayRn) uses a Gaussian process to learn how to adapt the randomized simulator solely from the observed real-world returns. BayRn is agnostic toward the policy optimization subroutine. In this work we used PPO and Power. We also evaluated BayRn on an underactuated swing-up and balance task.

**Want to know more? Read:**
Muratore, F.; Eilers, C.; Gienger, M.; Peters, J. (2021). Data-efficient Domain Randomization with Bayesian Optimization, *IEEE Robotics and Automation Letters (RA-L), with Presentation at the IEEE International Conference on Robotics and Automation (ICRA)*, IEEE.
Download Article BibTeX Reference

Robots that can learn in the physical world will be important to enable robots escape their stiff and pre-programmed movements. For dynamic high-acceleration tasks, such as juggling, learning in the real-world is particularly challenging as one must push the limits of the robot and its actuation without harming the system. Therefore, learning these tasks on the physical robot amplifies the necessity of sample efficiency and safety for robot learning algorithms, making a high-speed task an ideal benchmark to highlight robot learning systems. To achieve learning on the physical system, we propose a learning system that directly incorporates the safety and sample efficiency requirements in the design of the policy representation, initialization and optimization. This approach is in contrast to prior work which mainly focuses on the learning algorithm details, but neglect the engineering details. We demonstrate that this system enables the high-speed Barrett WAM to learn juggling of two balls from 56 minutes of experience. The robot learns to juggle consistently solely based on a binary reward signal. The optimal policy is able to juggle for up to 33 minutes or about 4500 repeated catches.
**Learning of the Juggling Task** - For the learning on the physical Barrett WAM 20 episodes were performed. During each episode 25 randomly sampled parameters were executed and the episodic reward evaluated. If the robot successfully juggles for 10s, the roll-out is stopped. Roll-outs that were corrupted due to obvious environment errors were repeated using the same parameters. Minor variations caused by the environment initialization were not repeated. After collecting the samples, the policy was updated using eREPS with a KL constraint of 2. The video shows **all** trials executed on the physical system to learn the optimal policy.

**Want to know more? Read:**

- Ploeger, K.; Lutter, M.; Peters, J. (2020). High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewards,
*Conference on Robot Learning (CoRL)*. Download Article BibTeX Reference

Highly dynamic robotic tasks require high-speed and reactive robots. These tasks are particularly challenging due to the physical constraints, the hardware limitations, and the high uncertainty of dynamics and sensor measures. To face these issues, it's crucial to design robotics agents that generate precise and fast trajectories and react immediately to environmental changes. Air hockey is an example of this kind of task. Due to the environment's characteristics, it is possible to formalize the problem and derive clean mathematical solutions. For these reasons, this environment is perfect for pushing to the limit the performance of currently available general-purpose robotic manipulators. Using two Kuka IIWA 14, we show how to design a policy for general-purpose robotic manipulators for the air hockey game. We demonstrate that a real robot arm can perform fast-hitting movements and that the two robots can play against each other on a medium-size air hockey table in simulation.

**Want to know more? Read:**

- Liu, P.; Tateo D.; Bou-Ammar, H.; Peters, J. (2021). Efficient and Reactive Planning for High Speed Robot Air Hockey,
*Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS)*. Download Article BibTeX Reference

Depending on the task at hand, learning behavior via reinforcement learning can be challenging or impractical - for example, due to the unsolved problem of targeted exploration. In this work, we take a look at so-called curriculum reinforcement learning, in which the goal is to sidestep challenges of RL algorithms by training them on a sequence of tasks that guides their learning towards a target task (or a set of those). More precisely, we show that an instantiation of self-paced learning in the domain of RL **a)** generates curricula that can drastically improve the learning performance of RL agents and **b)** can be seen as a form of tempering applied to the RL objective.

**Want to know more? Read:**

- Klink, P.; Abdulsamad, H.; Belousov, B.; D'Eramo, C.; Peters, J.; Pajarinen, J. (2021). A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning,
*Journal of Machine Learning Research (JMLR)*. Download Article BibTeX Reference - Klink, P.; D'Eramo, C.; Peters, J.; Pajarinen, J. (2020). Self-Paced Deep Reinforcement Learning,
*Advances in Neural Information Processing Systems (NIPS / NeurIPS)*. Download Article BibTeX Reference - Klink, P.; Abdulsamad, H.; Belousov, B.; Peters, J. (2019). Self-Paced Contextual Reinforcement Learning,
*Proceedings of the 3rd Conference on Robot Learning (CoRL)*. Download Article BibTeX Reference

Reinforcement learning (RL) algorithms still suffer from high sample complexity despite outstanding recent successes. The need for intensive interactions with the environment is especially observed in many widely popular policy gradient algorithms that perform updates using on-policy samples. The price of such inefficiency becomes evident in real-world scenarios such as interaction-driven robot learning, where the success of RL has been rather limited. We address this issue by building on the general sample efficiency of off-policy algorithms. With nonparametric regression and density estimation methods we construct a nonparametric Bellman equation in a principled manner, which allows us to obtain closed-form estimates of the value function, and to analytically express the full policy gradient. We provide a theoretical analysis of our estimate to show that it is consistent under mild smoothness assumptions and empirically show that our approach has better sample efficiency than state-of-the-art policy gradient methods. **Video:** Quanser CartPole: application of a stochastic policy learned in simulation with NOPG-S using a randomly uniform sampled dataset.

**Want to know more? Read:**

- Tosatto, S.; Carvalho, J.; Abdulsamad, H.; Peters, J. (2020). A Nonparametric Off-Policy Policy Gradient,
*Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS)*. Download Article BibTeX Reference

**SPOTA -- Sim-to-Real Evaluation**
Learning robot control policies from physics simulations is of great interest to the robotics community as it may render the learning process faster, cheaper, and safer by alleviating the need for expensive real-world experiments. However, the direct transfer of learned behavior from simulation to reality is a major challenge. Optimizing a policy on a slightly faulty simulator can easily lead to the maximization of the `Simulation Optimization Bias` (SOB). In this case, the optimizer exploits modeling errors of the simulator such that the resulting behavior can potentially damage the robot. We tackle this challenge by applying domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulation-based Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the over-fitting to the set of domains experienced while training. Our experimental results on two different second order nonlinear systems show that the new simulation-based policy search algorithm is able to learn a control policy exclusively from a randomized simulator, which can be applied directly to real systems without any additional training.
The video shows the Sim-to-Real transfer of a policy learned by Simulation-based Policy Optimization with Transferability Assessment (SPOTA) on the Ball-Balancer and Cart-Pole platform from Quanser.

**SPOTA -- Cross-Evaluation Between Vortex and Bullet Physics Engines**
We apply domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulation-based Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the over-fitting to the set of domains experienced while training.
Supplementary Video to "Domain Randomization for Simulation-Based Policy Optimization with Transferability Assessment" (CoRL 2018)
comparing against LQR, (vanilla) TRPO, and EPOpt synchronized random seeds 4 different initial positions.
In this setup, we both train and test in vertex.

**SPOTA -- Evaluation in Different Environments**
We apply domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulation-based Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the over-fitting to the set of domains experienced while training.
Supplementary Video to "Domain Randomization for Simulation-Based Policy Optimization with Transferability Assessment" (CoRL 2018)
comparing against LQR, (vanilla) TRPO, and EPOpt synchronized random seeds 4 different initial positions.
In this setup, we run in environment with nominal parameters.

**Want to know more? Read:**

- Muratore, F.; Gienger, M.; Peters, J. (2021). Assessing Transferability from Simulation to Reality for Reinforcement Learning,
*IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI)*,**43**,**4**, pp.1172-1183, IEEE. Download Article BibTeX Reference

Teaching motor skills to robots through human demonstrations, an approach called “imitation learning”, is an alternative to hand-coding each new robot behavior. Imitation learning is relatively cheap in terms of time and labor and is a promising route to give robots the necessary functionalities for widespread use in households, stores, hospitals, etc. However, current imitation learning techniques struggle with a number of challenges that prevent their wide usability. For instance, robots might not be able to accurately reproduce every human demonstration and it is not always clear how robots should generalize a movement to new contexts. This paper addresses those challenges by presenting a method to incrementally teach context-dependent motor skills to robots. The human demonstrates trajectories for different contexts by moving the links of the robot and partially or fully refines those trajectories by disturbing the movements of the robot while it executes the behavior it has learned so far. A joint probability distribution over trajectories and contexts can then be built based on those demonstrations and refinements. Given a new context, the robot computes the most probable trajectory, which can also be refined by the human. The joint probability distribution is incrementally updated with the refined trajectories. We have evaluated our method with experiments in which an elastically actuated robot arm with four degrees of freedom learns how to reach a ball at different positions

**Want to know more? Read:**

- Ewerton, M.; Maeda, G.J.; Kollegger, G.; Wiemeyer, J.; Peters, J. (2016). Incremental Imitation Learning of Context-Dependent Motor Skills,
*Proceedings of the International Conference on Humanoid Robots (HUMANOIDS)*, pp.351--358. Download Article BibTeX Reference

Robots that can learn over time by interacting with non-technical users must be capable of acquiring new motor skills, incrementally. The problem then is deciding when to teach the robot a new skill or when to rely on the robot generalizing its actions. This decision can be made by the robot if it is provided with means to quantify the suitability of its own skill given an unseen task. To this end, we present an algorithm that allows a robot to make active requests to incrementally learn movement primitives. A movement primitive is learned on a trajectory output by a Gaussian Process. The latter is used as a library of demonstrations that can be extrapolated with confidence margins. This combination not only allows the robot to generalize using as few as a single demonstration but more importantly, to indicate when such generalization can be executed with confidence or not. In experiments, a real robot arm indicates to the user which demonstrations should be provided to increase its repertoire of reaching skills. Experiments will also show that the robot becomes confident in reaching objects for whose demonstrations were never provided, by incrementally learning from the neighboring demonstrations.

**Want to know more? Read:**

- Maeda, G.; Ewerton, M.; Osa, T.; Busch, B.; Peters, J. (2017). Active Incremental Learning of Robot Movement Primitives,
*Proceedings of the Conference on Robot Learning (CoRL)*. Download Article BibTeX Reference

Robot imitation based on observations of the human movement is a challenging problem as the structure of the human demonstrator and the robot learner are usually different. A movement that can be demonstrated well by a human may not be kinematically feasible for robot reproduction. A common approach to solve this kinematic mapping is to retarget pre-defined corresponding parts of the human and the robot kinematic structure. When such a correspondence is not available, manual scaling of the movement amplitude and the positioning of the demonstration in relation to the reference frame of the robot may be required. This paper’s contribution is a method that eliminates both the need of human-robot structural associations—and therefore is less sensitive to the type of robot kinematics—and searches for the optimal location and adaptation of the human demonstration, such that the robot can accurately execute the optimized solution. The method defines a cost that quantifies the quality of the kinematic mapping and decreases it in conjunction with task-specific costs such as via-points and obstacles. We demonstrate the method experimentally where a real golf swing recorded via marker tracking is generalized to different speeds on the embodiment of a 7 degree-of-freedom (DoF) arm. In simulation, we compare solutions of robots with different kinematic structures

**Want to know more? Read:**

- Maeda, G.; Ewerton, M.; Koert, D; Peters, J. (2016). Acquiring and Generalizing the Embodiment Mapping from Human Observations to Robot Skills,
*IEEE Robotics and Automation Letters (RA-L)*,**1**,**2**, pp.784--791. Download Article BibTeX Reference

Learning motions from human demonstrations provides an intuitive way for non-expert users to teach tasks to robots. In particular, intelligent robotic co-workers should not only mimic human demonstrations but should also be able to adapt them to varying application scenarios. As such, robots must have the ability to generalize motions to different workspaces, e.g. to avoid obstacles not present during original demonstrations. Towards this goal, our work proposes a unified method to (1) generalize robot motions to different workspaces, using a novel formulation of trajectory optimization that explicitly incorporates human demonstrations, and (2) to locally adapt and reuse the optimized solution in the form of a distribution of trajectories. This optimized distribution can be used, online, to quickly satisfy the via-points and goals of a specific task. We validate the method using a 7 degrees of freedom (DoF) lightweight arm that grasps and places a ball into different boxes while avoiding obstacles that were not present during the original human demonstrations.

**Want to know more? Read:**

- Koert, D.; Maeda, G.J.; Lioutikov, R.; Neumann, G.; Peters, J. (2016). Demonstration Based Trajectory Optimization for Generalizable Robot Motions,
*Proceedings of the International Conference on Humanoid Robots (HUMANOIDS)*. Download Article BibTeX Reference

This paper proposes a method to achieve fast and fluid human-robot interaction by estimating the progress of the movement of the human. The method allows the progress, also referred to as the phase of the movement, to be estimated even when observations of the human are partial and occluded; a problem typically found when using motion capture systems in cluttered environments. By leveraging on the framework of Interaction Probabilistic Movement Primitives, phase estimation makes it possible to classify the human action, and to generate a corresponding robot trajectory before the human finishes his/her movement. The method is therefore suited for semi-autonomous robots acting as assistants and coworkers. Since observations may be sparse, our method is based on computing the probability of different phase candidates to find the phase that best aligns the Interaction Probabilistic Movement Primitives with the current observations. The method is fundamentally different from approaches based on Dynamic Time Warping that must rely on a consistent stream of measurements at runtime. The resulting framework can achieve phase estimation, action recognition and robot trajectory coordination using a single probabilistic representation. We evaluated the method using a seven-degree-of-freedom lightweight robot arm equipped with a five-finger hand in single and multi-task collaborative experiments. We compare the accuracy achieved by phase estimation with our previous method based on dynamic time warping.

**Want to know more? Read:**

- Maeda, G.; Ewerton, M.; Neumann, G.; Lioutikov, R.; Peters, J. (2017). Phase Estimation for Fast Action Recognition and Trajectory Generation in Human-Robot Collaboration,
*International Journal of Robotics Research (IJRR)*,**36**,**13-14**, pp.1579-1594. Download Article BibTeX Reference

Grasping is an essential component for robotic manipulation and has been investigated for decades. Prior work on grasping often assumes that a sufficient amount of training data is available for learning and planning robotic grasps. However, constructing such an exhaustive training dataset is very challenging in practice, and it is desirable that a robotic system can autonomously learn and improves its grasping strategy. Although recent work has presented autonomous data collection through trial and error, such methods are often limited to a single grasp type, e.g., vertical pinch grasp. To address these issues, we present a hierarchical policy search approach for learning multiple grasping strategies. To leverage human knowledge, multiple grasping strategies are initialized with human demonstrations. In addition, a database of grasping motions and point clouds of objects is also autonomously built upon a set of grasps given by a user. The problem of selecting the grasp location and grasp policy is formulated as a bandit problem in our framework. We applied our reinforcement learning to grasping both rigid and deformable objects. The experimental results show that our framework autonomously learns and improves its performance through trial and error and can grasp previously unseen objects with a high accuracy. This work is supported by H2020 RoMaNS (Robotic Manipulation for Nuclear Sort and Segregation) http://www.h2020romans.eu/

**Want to know more? Read:**

- Osa, T.; Peters, J.; Neumann, G. (2018). Hierarchical Reinforcement Learning of Multiple Grasping Strategies with Human Instructions,
*Advanced Robotics*,**32**,**18**, pp.955-968. Download Article BibTeX Reference

If you want to watch even more videos on robot learning- checkout our older videos in the archive.