Our Robot Videos
Some of the most exciting recent robot learning research results developed within our projects are shown here. The videos are part of our YouTube channel. For the older videos, checkout our video archive.
Learning Implicit Priors for Motion Optimization
In this paper, we focus on the problem of integrating Energybased Models (EBM) as guiding priors for motion optimization. EBMs are a set of neural networks that can represent expressive probability density distributions in terms of a Gibbs distribution parameterized by a suitable energy function. Due to their implicit nature, they can easily be integrated as optimization factors or as initial sampling distributions in the motion optimization problem, making them good candidates to integrate datadriven priors in the motion optimization problem. In this work, we present a set of required modeling and algorithmic choices to adapt EBMs into motion optimization. We investigate the benefit of including additional regularizers in the learning of the EBMs to use them with gradientbased optimizers and we present a set of EBM architectures to learn generalizable distributions for manipulation tasks. We present multiple cases in which the EBM could be integrated for motion optimization and evaluate the performance of learned EBMs as guiding priors for both simulated and real robot experiments.
Want to know more? Read:
 Urain, J.; Le, A. T.; Lambert, A.; Chalvatzaki, G.; Boots, B.; Peters, J. (2022). Learning Implicit Priors for Motion Optimization, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
 Le, A. T.; Urain, J.; Lambert, A.; Chalvatzaki, G.; Boots, B.; Peters, J. (2022). Learning Implicit Priors for Motion Optimization, RSS 2022 Workshop on Implicit Representations for Robotic Manipulation.
Learn2Assemble with Structured Representations and Search for Robotic Architectural Construction
Autonomous robotic assembly requires a wellorchestrated sequence of highlevel actions and smooth manipulation executions. The problem of learning to assemble complex 3D structures remains challenging, as it requires drawing connections between target shapes and available building blocks, as well as creating valid assembly sequences with respect to stability and kinematic feasibility in the robot's workspace. We design a hierarchical control framework that learns to sequence the building blocks to construct arbitrary 3D designs and ensures that they are feasible, as we plan the geometric execution with the robotintheloop. Our approach draws its generalization properties from combining graphbased representations with reinforcement learning (RL) and ultimately adding treesearch. Combining structured representations with modelfree RL and MonteCarlo planning allows agents to operate with various target shapes and building block types. We demonstrate the flexibility of the proposed structured representation and our algorithmic solution in a series of simulated 3D assembly tasks with robotic evaluation, which showcases our method's ability to learn to construct stable structures with a large number of building blocks.
Want to know more? Read:
 Funk, N.; Chalvatzaki, G.; Belousov, B.; Peters, J. (2021). Learn2Assemble with Structured Representations and Search for Robotic Architectural Construction, Conference on Robot Learning (CoRL).
Benchmarking Structured Policies and Policy Optimization for RealWorld Dexterous Object Manipulation
Dexterous manipulation is a challenging and important problem in robotics. While datadriven methods are a promising approach, current benchmarks require simulation or extensive engineering support due to the sample inefficiency of popular methods. We present benchmarks for the TriFinger system, an opensource robotic platform for dexterous manipulation and the focus of the 2020 Real Robot Challenge. The benchmarked methods, which were successful in the challenge, can be generally described as structured policies , as they combine elements of classical robotics and modern policy optimization. This inclusion of inductive biases facilitates sample efficiency, interpretability, reliability and high performance. The key aspects of this benchmarking is validation of the baselines across both simulation and the real system, thorough ablation study over the core features of each solution, and a retrospective analysis of the challenge as a manipulation benchmark.
Want to know more? Read:
 Funk, N.; Schaff, C.; Madan, R.; Yoneda, T.; Urain, J.; Watson, J.; Gordon, E.; Widmaier, F; Bauer, S.; Srinivasa, S.; Bhattacharjee, T.; Walter, M.; Peters, J. (2022). Benchmarking Structured Policies and Policy Optimization for RealWorld Dexterous Object Manipulation, IEEE Robotics and Automation Letters (RAL).
NPDR  Combining LikelihoodFree Inference, Normalizing Flows, and Domain Randomization
Combining domain randomization and reinforcement learning is a widely used approach to obtain control policies that can bridge the gap between simulation and reality. However, existing methods make limiting assumptions on the form of the domain parameter distribution which prevents them from utilizing the full power of domain randomization. Typically, a restricted family of probability distributions (e.g., normal or uniform) is chosen a priori for every parameter. Furthermore, straightforward approaches based on deep learning require differentiable simulators, which are either not available or can only simulate a limited class of systems. Such rigid assumptions diminish the applicability of domain randomization in robotics. Building upon recently proposed neural likelihoodfree inference methods, we introduce Neural Posterior Domain Randomization (NPDR), an algorithm that alternates between learning a policy from a randomized simulator and adapting the posterior distribution over the simulator’s parameters in a Bayesian fashion. Our approach only requires a parameterized simulator, coarse prior ranges, a policy (optionally with optimization routine), and a small set of realworld observations. Most importantly, the domain parameter distribution is not restricted to a specific family, parameters can be correlated, and the simulator does not have to be differentiable. We show that the presented method is able to efficiently adapt the posterior over the domain parameters to closer match the observed dynamics. Moreover, we demonstrate that NPDR can learn transferable policies using fewer realworld rollouts than comparable algorithms.
Want to know more? Read:
 Muratore, F.; Gruner, T.; Wiese, F.; Belousov, B.; Gienger, M.; Peters, J. (2021). Neural Posterior Domain Randomization, Conference on Robot Learning (CoRL).
Underactuated Waypoint Trajectory Optimization for Light Painting Photography
waypoint activations. To validate the proposed method, a letter drawing task is set up where shapes traced by the tip of a rotary inverted pendulum are visualized using long exposure photography.
Want to know more? Read:
 Eilers, C.; Eschmann, J.; Menzenbach, R.; Belousov, B.; Muratore, F.; Peters, J. (2020). Underactuated Waypoint Trajectory Optimization for Light Painting Photography, Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
Dataefficient Domain Randomization with Bayesian Optimization
Want to know more? Read:
 Muratore, F.; Eilers, C.; Gienger, M.; Peters, J. (2021). Dataefficient Domain Randomization with Bayesian Optimization, IEEE Robotics and Automation Letters (RAL), with Presentation at the IEEE International Conference on Robotics and Automation (ICRA), IEEE.
BayRn  SimtoReal Evaluation
When learning from simulations, the optimizer is free to exploit the simulation. Thus the resulting policies can perform very well in simulation but transfer poorly to the realworld counterpart. For example, both of the subsequent policies yield a return of 1, thus look equally good to the learner. Bayesian Domain Randomization (BayRn) uses a Gaussian process to learn how to adapt the randomized simulator solely from the observed realworld returns. BayRn is agnostic toward the policy optimization subroutine. In this work we used PPO and Power. We also evaluated BayRn on an underactuated swingup and balance task.
Want to know more? Read:
 Muratore, F.; Eilers, C.; Gienger, M.; Peters, J. (2021). Dataefficient Domain Randomization with Bayesian Optimization, IEEE Robotics and Automation Letters (RAL), with Presentation at the IEEE International Conference on Robotics and Automation (ICRA), IEEE.
Robot Juggling Learning Procedure
Learning of the Juggling Task  For the learning on the physical Barrett WAM 20 episodes were performed. During each episode 25 randomly sampled parameters were executed and the episodic reward evaluated. If the robot successfully juggles for 10s, the rollout is stopped. Rollouts that were corrupted due to obvious environment errors were repeated using the same parameters. Minor variations caused by the environment initialization were not repeated. After collecting the samples, the policy was updated using eREPS with a KL constraint of 2. The video shows **all** trials executed on the physical system to learn the optimal policy.
Want to know more? Read:
 Ploeger, K.; Lutter, M.; Peters, J. (2020). High Acceleration Reinforcement Learning for RealWorld Juggling with Binary Rewards, Conference on Robot Learning (CoRL).
Robot Air Hockey
Want to know more? Read:
 Liu, P.; Tateo, D.; BouAmmar, H.; Peters, J. (2021). Efficient and Reactive Planning for High Speed Robot Air Hockey, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
SPRL Ball in a Cup
Want to know more? Read:

 Klink, P.; Abdulsamad, H.; Belousov, B.; D'Eramo, C.; Peters, J.; Pajarinen, J. (2021). A Probabilistic Interpretation of SelfPaced Learning with Applications to Reinforcement Learning, Journal of Machine Learning Research (JMLR).

 Klink, P.; D'Eramo, C.; Peters, J.; Pajarinen, J. (2020). SelfPaced Deep Reinforcement Learning, Advances in Neural Information Processing Systems (NIPS / NeurIPS).

 Klink, P.; Abdulsamad, H.; Belousov, B.; Peters, J. (2019). SelfPaced Contextual Reinforcement Learning, Proceedings of the 3rd Conference on Robot Learning (CoRL).
A Nonparametric OffPolicy Policy Gradient
Want to know more? Read:
 Tosatto, S.; Carvalho, J.; Abdulsamad, H.; Peters, J. (2020). A Nonparametric OffPolicy Policy Gradient, Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS).
SPOTA  Simulationbased Policy Optimization with Transferability Assessment
Learning robot control policies from physics simulations is of great interest to the robotics community as it may render the learning process faster, cheaper, and safer by alleviating the need for expensive realworld experiments. However, the direct transfer of learned behavior from simulation to reality is a major challenge. Optimizing a policy on a slightly faulty simulator can easily lead to the maximization of the `Simulation Optimization Bias` (SOB). In this case, the optimizer exploits modeling errors of the simulator such that the resulting behavior can potentially damage the robot. We tackle this challenge by applying domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulationbased Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the overfitting to the set of domains experienced while training. Our experimental results on two different second order nonlinear systems show that the new simulationbased policy search algorithm is able to learn a control policy exclusively from a randomized simulator, which can be applied directly to real systems without any additional training. The video shows the SimtoReal transfer of a policy learned by Simulationbased Policy Optimization with Transferability Assessment (SPOTA) on the BallBalancer and CartPole platform from Quanser.
SPOTA  CrossEvaluation Between Vortex and Bullet Physics Engines We apply domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulationbased Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the overfitting to the set of domains experienced while training. Supplementary Video to "Domain Randomization for SimulationBased Policy Optimization with Transferability Assessment" (CoRL 2018) comparing against LQR, (vanilla) TRPO, and EPOpt synchronized random seeds 4 different initial positions. In this setup, we both train and test in vertex.
SPOTA  Evaluation in Different Environments We apply domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulationbased Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the overfitting to the set of domains experienced while training. Supplementary Video to "Domain Randomization for SimulationBased Policy Optimization with Transferability Assessment" (CoRL 2018) comparing against LQR, (vanilla) TRPO, and EPOpt synchronized random seeds 4 different initial positions. In this setup, we run in environment with nominal parameters.
Want to know more? Read:
 Muratore, F.; Gienger, M.; Peters, J. (2021). Assessing Transferability from Simulation to Reality for Reinforcement Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 43, 4, pp.11721183, IEEE.
Incremental Imitation Learning of ContextDependent Motor Skills
current imitation learning techniques struggle with a number of challenges that prevent their wide usability. For instance, robots might not be able to accurately reproduce every human demonstration and it is not always clear how robots should generalize a movement to new contexts. This paper addresses those challenges by presenting a method to incrementally teach contextdependent motor skills to robots. The human demonstrates trajectories for different contexts by moving the links of the robot and partially or fully refines those trajectories by disturbing the movements of the robot while it executes the behavior it has learned so far. A joint probability distribution over trajectories and contexts can then be built based on those demonstrations and refinements. Given a new context, the robot computes the most probable trajectory, which can also be refined by the human. The joint probability distribution is incrementally updated with the refined trajectories. We have evaluated our method with experiments in which an elastically actuated robot arm with four degrees of freedom learns how to reach a ball at different positions
Want to know more? Read:
 Ewerton, M.; Maeda, G.J.; Kollegger, G.; Wiemeyer, J.; Peters, J. (2016). Incremental Imitation Learning of ContextDependent Motor Skills, Proceedings of the International Conference on Humanoid Robots (HUMANOIDS), pp.351358.
Active Incremental Learning of Robot Movement Primitives
Want to know more? Read:
 Maeda, G.; Ewerton, M.; Osa, T.; Busch, B.; Peters, J. (2017). Active Incremental Learning of Robot Movement Primitives, Proceedings of the Conference on Robot Learning (CoRL).
Robot learning from observation
the human demonstrator and the robot learner are usually different. A movement that can be demonstrated well by a human may not be kinematically feasible for robot reproduction. A common approach to solve this kinematic mapping is to retarget predefined corresponding parts of the human and the robot kinematic structure. When such a correspondence is not available, manual scaling of the movement amplitude and the positioning of the demonstration in relation to the reference frame of the robot may be required. This paper’s contribution is a method that eliminates both the need of humanrobot structural associations—and therefore is less sensitive to the type of robot kinematics—and searches for the optimal location and adaptation of the human demonstration, such that the robot can accurately execute the optimized solution. The method defines a cost that quantifies the quality of the kinematic mapping and decreases it in conjunction with taskspecific costs such as viapoints and obstacles. We demonstrate the method experimentally where a real golf swing recorded via marker tracking is generalized to different speeds on the embodiment of a 7 degreeoffreedom (DoF) arm. In simulation, we compare solutions of robots with different kinematic structures
Want to know more? Read:
 Maeda, G.; Ewerton, M.; Koert, D; Peters, J. (2016). Acquiring and Generalizing the Embodiment Mapping from Human Observations to Robot Skills, IEEE Robotics and Automation Letters (RAL), 1, 2, pp.784791.
Combining Human Demonstrations and Motion Planning for Movement Primitive Optimization
Want to know more? Read:
 Koert, D.; Maeda, G.J.; Lioutikov, R.; Neumann, G.; Peters, J. (2016). Demonstration Based Trajectory Optimization for Generalizable Robot Motions, Proceedings of the International Conference on Humanoid Robots (HUMANOIDS).
Phase Estimation for Fast Action Recognition and Trajectory Generation in HumanRobot Collaboration
Want to know more? Read:
 Maeda, G.; Ewerton, M.; Neumann, G.; Lioutikov, R.; Peters, J. (2017). Phase Estimation for Fast Action Recognition and Trajectory Generation in HumanRobot Collaboration, International Journal of Robotics Research (IJRR), 36, 1314, pp.15791594.
Hierarchical Reinforcement Learning of Multiple Grasping Policies
Prior work on grasping often assumes that a sufficient amount of training data is available for learning and planning robotic grasps. However, constructing such an exhaustive training dataset is very challenging in practice, and it is desirable that a robotic system can autonomously learn and improves its grasping strategy. Although recent work has presented autonomous data collection through trial and error, such methods are often limited to a single grasp type, e.g., vertical pinch grasp. To address these issues, we present a hierarchical policy search approach for learning multiple grasping strategies. To leverage human knowledge, multiple grasping strategies are initialized with human demonstrations. In addition, a database of grasping motions and point clouds of objects is also autonomously built upon a set of grasps given by a user. The problem of selecting the grasp location and grasp policy is formulated as a bandit problem in our framework. We applied our reinforcement learning to grasping both rigid and deformable objects. The experimental results show that our framework autonomously learns and improves its performance through trial and error and can grasp previously unseen objects with a high accuracy. This work is supported by H2020 RoMaNS (Robotic Manipulation for Nuclear Sort and Segregation) http://www.h2020romans.eu/
Want to know more? Read:
 Osa, T.; Peters, J.; Neumann, G. (2018). Hierarchical Reinforcement Learning of Multiple Grasping Strategies with Human Instructions, Advanced Robotics, 32, 18, pp.955968.
If you want to watch even more videos on robot learning, please checkout our older videos in the archive.