Our Robot Videos
Some of the most exciting recent robot learning research results developed within our projects are shown here. The videos are part of our YouTube channel. For the older videos, checkout our video archive.
LocoMuJoCo
LocoMuJoCo is an imitation learning benchmark specifically designed for whole-body control. It features a diverse set of environments, including quadrupeds, humanoids, and (musculo-)skeletal human models, each provided with comprehensive datasets (over 22,000 samples per humanoid). Although primarily focused on imitation learning, LocoMuJoCo also supports custom reward function classes, making it suitable for pure reinforcement learning as well. LocoMuJoCo supports both the classic MuJoCo simulator for single environments and the high-performance MJX/MJWarp backend for massively parallel training. It includes twelve humanoid and four quadruped environments, featuring four different biomechanical human models. The benchmark comes with clean, single-file JAX implementations of common algorithms such as PPO, GAIL, AMP, and DeepMimic, enabling rapid experimentation. By combining the training loop and environment into a single JIT-compiled function, it achieves exceptionally fast training speeds. Furthermore, it provides over 22,000 retargeted motion capture trajectories from established datasets like AMASS and LAFAN1, as well as native LocoMuJoCo motions. Thanks to its robot-to-robot retargeting capabilities, any existing motion dataset can be easily adapted to different robot morphologies. The framework also offers powerful trajectory comparison metrics, including Dynamic Time Warping and discrete Fréchet distance, all implemented efficiently in JAX. Additional features include a standard Gymnasium interface, built-in domain and terrain randomization, and a highly modular design that allows users to flexibly define, swap, and reuse components such as observation types, reward functions, termination handlers, and randomization modules.
Want to know more? Check out: github.com/robfiras/loco-mujoco
Floating-Base Deep Lagrangian Networks
Grey-box methods for system identification combine deep learning with physics-informed constraints, capturing complex dependencies while improving out-of-distribution generalization. Despite the growing importance of floating-base systems such as humanoids and quadrupeds, current grey-box models ignore their specific physical constraints. For instance, the inertia matrix is not only positive definite but also exhibits branch-induced sparsity and input independence. Moreover, the 6x6 composite spatial inertia of the floating base inherits properties of single-rigid-body inertia matrices. As we show, this includes the triangle inequality on the eigenvalues of the composite rotational inertia. To address the lack of physical consistency in deep learning models of floating-base systems, we introduce a parameterization of inertia matrices that satisfies all these constraints. Inspired by Deep Lagrangian Networks (DeLaN), we train neural networks to predict physically plausible inertia matrices that minimize inverse dynamics error under Lagrangian mechanics. For evaluation, we collected and released a dataset on multiple quadrupeds and humanoids. In these experiments, our Floating-Base Deep Lagrangian Networks (FeLaN) achieve better overall performance on both simulated and real robots, while providing greater physical interpretability.
Want to know more? Read:
- Schulze, L.; Negri, J.D.; Barasuol, V.; Medeiros, V.S.; Becker, M.; Peters, J.; Arenz, O. (2026). Floating-Base Deep Lagrangian Networks, IEEE International Conference on Robotics and Automation (ICRA).
Context-Aware Deep Lagrangian Networks for Model Predictive Control
Controlling a robot based on physics-consistent dynamic models, such as Deep Lagrangian Networks (DeLaN), can improve the generalizability and interpretability of the resulting behavior. However, in complex environments, the number of objects to potentially interact with is vast, and their physical properties are often uncertain. This complexity makes it infeasible to employ a single global model. Therefore, we need to resort to online system identification of context-aware models that capture only the currently relevant aspects of the environment. While physical principles such as the conservation of energy may not hold across varying contexts, ensuring physical plausibility for any individual context-aware model can still be highly desirable, particularly when using it for receding horizon control methods such as model predictive control (MPC). Hence, in this work, we extend DeLaN to make it context-aware, combine it with a recurrent network for online system identification, and integrate it with an MPC for adaptive, physics-consistent control. We also combine DeLaN with a residual dynamics model to leverage the fact that a nominal model of the robot is typically available. We evaluate our method on a 7-DOF robot arm for trajectory tracking under varying loads. Our method reduces the end-effector tracking error by 39%, compared to a 21% improvement achieved by a baseline that uses an extended Kalman filter.
Want to know more? Read:
- Schulze, L.; Peters, J.; Arenz, O. (2025). Context-Aware Deep Lagrangian Networks for Model Predictive Control, 2025 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning
The use of target networks in deep reinforcement learning is a widely popular solution to mitigate the brittleness of semi-gradient approaches and stabilize learning. However, target networks notoriously require additional memory and delay the propagation of Bellman updates compared to an ideal target-free approach. In this work, we step out of the binary choice between target-free and target-based algorithms. We introduce a new method that uses a copy of the last linear layer of the online network as a target network, while sharing the remaining parameters with the up-to-date online network. This simple modification enables us to keep the target-free's low-memory footprint while leveraging the target-based literature. We find that combining our approach with the concept of iterated Q-learning, which consists of learning consecutive Bellman updates in parallel, helps improve the sample-efficiency of target-free approaches. Our proposed method, iterated Shared Q-Learning (iS-QL), bridges the performance gap between target-free and target-based approaches across various problems while using a single Q-network, thus stepping towards resource-efficient reinforcement learning algorithms.
Want to know more? Read:
- Vincent, T.; Tripathi, Y.; Faust, T.; Akgül, A.; Oren, Y.; Kandemir, M.; Peters, J.; D'Eramo, C. (2026). Bridging the Performance Gap Between Target-Free and Target-Based Reinforcement Learning, International Conference on Learning Representations (ICLR).
Iterated Q-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning
The vast majority of Reinforcement Learning methods is largely impacted by the computation effort and data requirements needed to obtain effective estimates of action-value functions, which in turn determine the quality of the overall performance and the sample-efficiency of the learning procedure. Typically, action-value functions are estimated through an iterative scheme that alternates the application of an empirical approximation of the Bellman operator and a subsequent projection step onto a considered function space. It has been observed that this scheme can be potentially generalized to carry out multiple iterations of the Bellman operator at once, benefiting the underlying learning algorithm. However, till now, it has been challenging to effectively implement this idea, especially in high-dimensional problems. In this paper, we introduce iterated Q-Network (i-QN), a novel principled approach that enables multiple consecutive Bellman updates by learning a tailored sequence of action-value functions where each serves as the target for the next. We show that i-QN is theoretically grounded and that it can be seamlessly used in value-based and actor-critic methods. We empirically demonstrate the advantages of i-QN in Atari 2600 games and MuJoCo continuous control problems.
Want to know more? Read:
- Vincent, T.; Palenicek, D.; Belousov, B.; Peters, J.; D'Eramo, C. (2025). Iterated Q-Network: Beyond One-Step Bellman Updates in Deep Reinforcement Learning, Transactions on Machine Learning Research (TMLR), J2C Certificate.
One Policy to Run Them All: an End-to-end Learning Approach to Multi-Embodiment Locomotion
Deep Reinforcement Learning techniques are achieving state-of-the-art results in robust legged locomotion. While there exists a wide variety of legged platforms such as quadruped, humanoids, and hexapods, the field is still missing a single learning framework that can control all these different embodiments easily and effectively and possibly transfer, zero or few shot, to unseen robot embodiments. We introduce URMA, the Unified Robot Morphology Architecture, to close this gap. Our framework brings the end-to-end Multi-Task Reinforcement Learning approach to the realm of legged robots, enabling the learned policy to control any type of robot morphology. The key idea of our method is to allow the network to learn an abstract locomotion controller that can be seamlessly shared between embodiments thanks to our morphology-agnostic encoders and decoders. This flexible architecture can be seen as a potential first step in building a foundation model for legged robot locomotion. Our experiments show that URMA can learn a locomotion policy on multiple embodiments that can be easily transferred to unseen robot platforms in simulation and the real world.
Want to know more? Read:
- Bohlinger, N.; Czechmanowski, G.; Krupka, M.; Kicki, P.; Walas, K.; Peters, J.; Tateo, D. (2024). One Policy to Run Them All: an End-to-end Learning Approach to Multi-Embodiment Locomotion, Conference on Robot Learning (CoRL).
- Bohlinger, N.; Czechmanowski, G.; Krupka, M.; Kicki, P.; Walas, K.; Peters, J.; Tateo, D. (2024). One Policy to Run Them All: an End-to-end Learning Approach to Multi-Embodiment Locomotion, CoRL 2024 Morphology-Aware Policy and Design Learning Workshop.
Towards Embodiment Scaling Laws in Robot Locomotion
Cross-embodiment generalization underpins the vision of building generalist embodied agents for any robot, yet its enabling factors remain poorly understood. We investigate embodiment scaling laws, the hypothesis that increasing the number of training embodiments improves generalization to unseen ones, using robot locomotion as a test bed. We procedurally generate ∼1,000 embodiments with topological, geometric, and joint-level kinematic variations, and train policies on random subsets. We observe positive scaling trends supporting the hypothesis, and find that embodiment scaling enables substantially broader generalization than data scaling on fixed embodiments. Our best policy, trained on the full dataset, transfers zero-shot to novel embodiments in simulation and the real world, including the Unitree Go2 and H1. These results represent a step toward general embodied intelligence, with relevance to adaptive control for configurable robots, morphology co-design, and beyond.
Want to know more? Read:
- Bohlinger, N.; Ai, B.; Dai, L.; Li, D.; Mu, T.; Wu, Z.; Fay, K.; Christensen, H.I.; Peters, J.; Su, H. (2025). Towards Embodiment Scaling Laws in Robot Locomotion, Conference on Robot Learning (CoRL).
- Bohlinger, N.; Ai, B.; Dai, L.; Li, D.; Mu, T.; Wu, Z.; Fay, K.; Christensen, H.I.; Peters, J.; Su, H. (2025). Towards Embodiment Scaling Laws in Robot Locomotion, RSS 2025 Workshop on Robot Hardware-Aware Intelligence.
Gait in Eight: Efficient On-Robot Learning for Omnidirectional Quadruped Locomotion
On-robot Reinforcement Learning is a promising approach to train embodiment-aware policies for legged robots. However, the computational constraints of real-time learning on robots pose a significant challenge. We present a framework for efficiently learning quadruped locomotion in just 8 minutes of raw real-time training utilizing the sample efficiency and minimal computational overhead of the new off-policy algorithm CrossQ. We investigate two control architectures: Predicting joint target positions for agile, high-speed locomotion and Central Pattern Generators for stable, natural gaits. While prior work focused on learning simple forward gaits, our framework extends on-robot learning to omnidirectional locomotion. We demonstrate the robustness of our approach in different indoor and outdoor environments.
Want to know more? Read:
- Bohlinger, N.; Kinzel, J.; Palenicek, D.; Antczak, L.; Peters, J. (2025). Gait in Eight: Efficient On-Robot Learning for Omnidirectional Quadruped Locomotion, International Conference on Intelligent Robots and Systems (IROS).
- Bohlinger, N.; Kinzel, J.; Palenicek, D.; Antczak, L.; Peters, J. (2025). Gait in Eight: Efficient On-Robot Learning for Omnidirectional Quadruped Locomotion, European Workshop on Reinforcement Learning (EWRL).
A Safety-Aware Shared Autonomy Framework with BarrierIK Using Control Barrier Functions
Shared autonomy blends operator intent with autonomous assistance. In cluttered environments, linear blending can produce unsafe commands even when each source is individually collision-free. Many existing approaches model obstacle avoidance through potentials or cost terms, which only enforce safety as a soft constraint. In contrast, safety-critical control requires hard guarantees. We investigate the use of control barrier functions (CBFs) at the inverse kinematics (IK) layer of shared autonomy, targeting post-blend safety while preserving task performance. Our approach is evaluated in simulation on representative cluttered environments and in a VR teleoperation study comparing pure teleoperation with shared autonomy. Across conditions, employing CBFs at the IK layer reduces violation time and increases minimum clearance while maintaining task performance. In the user study, participants reported higher perceived safety and trust, lower interference, and an overall preference for shared autonomy with our safety filter.
Want to know more? Read:
- Guler, B.; Pompetzki, K.; Sun, Y.; Manschitz, S.; Peters, J. (2026). A Safety-Aware Shared Autonomy Framework with BarrierIK Using Control Barrier Functions, IEEE International Conference on Robotics and Automation (ICRA).
GaussTwin: Unified Simulation and Correction with Gaussian Splatting for Robotic Digital Twins
Digital twins promise to enhance robotic manipulation by maintaining a consistent link between real-world perception and simulation. However, most existing systems struggle with the lack of a unified model, complex dynamic interactions, and the real-to-sim gap, which limits downstream applications such as model predictive control. Thus, we propose GaussTwin, a real-time digital twin that combines position-based dynamics with discrete Cosserat rod formulations for physically grounded simulation, and Gaussian splatting for efficient rendering and visual correction. By anchoring Gaussians to physical primitives and enforcing coherent SE(3) updates driven by photometric error and segmentation masks, GaussTwin achieves stable prediction-correction while preserving physical fidelity. Through experiments in both simulation and on a Franka Research 3 platform, we show that GaussTwin consistently improves tracking accuracy and robustness compared to shape-matching and rigid-only baselines, while also enabling downstream tasks such as push-based planning. These results highlight GaussTwin as a step toward unified, physically meaningful digital twins that can support closed-loop robotic interaction and learning.
Want to know more? Read:
- Cai, Y.; Jansonnie, P.; Arenz, O.; de Farias C.; Peters, J. (2026). GaussTwin: Unified Simulation and Correction with Gaussian Splatting for Robotic Digital Twins, IEEE International Conference on Robotics and Automation (ICRA).
Eau De Q-Network: Adaptive Distillation of Neural Networks in Deep Reinforcement Learning
Recent works have successfully demonstrated that sparse deep reinforcement learning agents can be competitive against their dense counterparts. This opens up opportunities for reinforcement learning applications in fields where inference time and memory requirements are cost-sensitive or limited by hardware. Until now, dense-to-sparse methods have relied on hand-designed sparsity schedules that are not synchronized with the agent's learning pace. Crucially, the final sparsity level is chosen as a hyperparameter, which requires careful tuning as setting it too high might lead to poor performances. In this work, we address these shortcomings by crafting a dense-to-sparse algorithm that we name Eau De Q-Network (EauDeQN). To increase sparsity at the agent's learning pace, we consider multiple online networks with different sparsity levels, where each online network is trained from a shared target network. At each target update, the online network with the smallest loss is chosen as the next target network, while the other networks are replaced by a pruned version of the chosen network. We evaluate the proposed approach on the Atari 2600 benchmark and the MuJoCo physics simulator, showing that EauDeQN reaches high sparsity levels while keeping performances high.
Want to know more? Read:
- Vincent, T.; Faust, T.; Tripathi, Y.; Peters, J.; D'Eramo, C. (2025). Eau De Q-Network: Adaptive Distillation of Neural Networks in Deep Reinforcement Learning, Conference on Reinforcement Learning and Decision Making (RLDM).
- Vincent, T.; Faust, T.; Tripathi, Y.; Peters, J.; D'Eramo, C. (2025). Eau De Q-Network: Adaptive Distillation of Neural Networks in Deep Reinforcement Learning, Reinforcement Learning Journal (RLJ).
Adaptive Q-Network: On-the-fly Target Selection for Deep Reinforcement Learning
Deep Reinforcement Learning (RL) is well known for being highly sensitive to hyperparameters, requiring practitioners substantial efforts to optimize them for the problem at hand. This also limits the applicability of RL in real-world scenarios. In recent years, the field of automated Reinforcement Learning (AutoRL) has grown in popularity by trying to address this issue. However, these approaches typically hinge on additional samples to select well-performing hyperparameters, hindering sample-efficiency and practicality. Furthermore, most AutoRL methods are heavily based on already existing AutoML methods, which were originally developed neglecting the additional challenges inherent to RL due to its non-stationarities. In this work, we propose a new approach for AutoRL, called Adaptive Q-Network (AdaQN), that is tailored to RL to take into account the non-stationarity of the optimization procedure without requiring additional samples. AdaQN learns several Q-functions, each one trained with different hyperparameters, which are updated online using the Q-function with the smallest approximation error as a shared target. Our selection scheme simultaneously handles different hyperparameters while coping with the non-stationarity induced by the RL optimization procedure and being orthogonal to any critic-based RL algorithm. We demonstrate that AdaQN is theoretically sound and empirically validate it in MuJoCo control problems and Atari 2600 games, showing benefits in sample-efficiency, overall performance, robustness to stochasticity and training stability.
Want to know more? Read:
- Vincent, T.; Wahren, F.; Peters, J.; Belousov, B.; D'Eramo, C. (2025). Adaptive Q-Network: On-the-fly Target Selection for Deep Reinforcement Learning, International Conference on Learning Representations (ICLR).
- Vincent, T.; Wahren, F.; Peters, J.; Belousov, B.; D'Eramo, C.; (2024). Adaptive Q-Network: On-the-fly Target Selection for Deep Reinforcement Learning, European Workshop on Reinforcement Learning (EWRL).
- Vincent, T.; Wahren, F.; Peters, J.; Belousov, B.; D'Eramo, C.; (2024). Adaptive Q-Network: On-the-fly Target Selection for Deep Reinforcement Learning, ICML Workshop on Automated Reinforcement Learning.
Apple: Toward General Active Perception via Reinforcement Learning
Active perception is a fundamental skill that enables us humans to deal with uncertainty in our inherently partially observable environment. For senses such as touch, where the information is sparse and local, active perception becomes crucial. In recent years, active perception has emerged as an important research domain in robotics. However, current methods are often bound to specific tasks or make strong assumptions, which limit their generality. To address this gap, this work introduces APPLE (Active Perception Policy Learning) - a novel framework that leverages reinforcement learning (RL) to address a range of different active perception problems. APPLE jointly trains a transformer-based perception module and decision-making policy with a unified optimization objective, learning how to actively gather information. By design, APPLE is not limited to a specific task and can, in principle, be applied to a wide range of active perception problems. We evaluate two variants of APPLE across different tasks, including tactile exploration problems from the Tactile MNIST benchmark. Experiments demonstrate the efficacy of APPLE, achieving high accuracies on both regression and classification tasks. These findings underscore the potential of APPLE as a versatile and general framework for advancing active perception in robotics.
Want to know more? Read:
- Schneider, T.; de Farias, C.; Calandra, R.; Chen, L.; Peters, J. (2026). APPLE: Toward General Active Perception via Reinforcement Learning, International Conference on Learning Representations (ICLR).
Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation
Contact-rich manipulation depends on applying the correct grasp forces throughout the manipulation task, especially when handling fragile or deformable objects. Most existing imitation learning approaches often treat visuotactile feedback only as an additional observation, leaving applied forces as an uncontrolled consequence of gripper commands. In this work, we present Force-Aware Robotic Manipulation (FARM), an imitation learning framework that integrates high-dimensional tactile data to infer tactile-conditioned force signals, which in turn define a matching force-based action space. We collect human demonstrations using a modified version of the handheld Universal Manipulation Interface (UMI) gripper that integrates a GelSight Mini visual tactile sensor. For deploying the learned policies, we developed an actuated variant of the UMI gripper with geometry matching our handheld version. During policy rollouts, the proposed FARM diffusion policy jointly predicts robot pose, grip width, and grip force. FARM outperforms several baselines across three tasks with distinct force requirements -- high-force, low-force, and dynamic force adaptation -- demonstrating the advantages of its two key components: leveraging force-grounded, high-dimensional tactile observations and a force-based control space.
Want to know more? Read:
- Helmut, E.; Funk, N.; Schneider, T.; de Farias, C.; Peters, J. (2026). Tactile-Conditioned Diffusion Policy for Force-Aware Robotic Manipulation, IEEE International Conference on Robotics and Automation (ICRA).
- Helmut, E.; Funk, N.; Schneider, T.; de Farias, C.; Peters, J. (2026). FARM: Force-Aware Robotic Manipulation with Tactile-Conditioned Diffusion Policies, German Robotics Conference (GRC).
Learning Force Distribution Estimation for the GelSight Mini Optical Tactile Sensor Based on Finite Element Analysis
Contact-rich manipulation remains a major challenge in robotics. Optical tactile sensors like GelSight Mini offer a low-cost solution for contact sensing by capturing soft-body deformations of the silicone gel. However, accurately inferring shear and normal force distributions from these gel deformations has yet to be fully addressed. In this work, we propose a machine learning approach using a U-net architecture to predict force distributions directly from the sensor's raw images. Our model, trained on force distributions inferred from Finite Element Analysis (FEA), demonstrates promising accuracy in predicting normal and shear force distributions for the commercially available GelSight Mini sensor. It also shows potential for generalization across indenters, sensors of the same type, and for enabling real-time application.
Want to know more? Read:
- Helmut, E.; Dziarski, L.; Funk, N.; Belousov, B.; Peters, J. (2025). Learning Force Distribution Estimation for the GelSight Mini Optical Tactile Sensor Based on Finite Element Analysis, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
- Helmut, E.; Dziarski, L.; Funk, N.; Belousov, B.; Peters, J. (2024). Learning Force Distribution Estimation for the GelSight Mini Optical Tactile Sensor Based on Finite Element Analysis, 2nd NeurIPS Workshop on Touch Processing: From Data to Knowledge.
On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting
The field of robotic manipulation has advanced significantly in recent years. At the sensing level, several novel tactile sensors have been developed, capable of providing accurate contact information. On a methodological level, learning from demonstrations has proven an efficient paradigm to obtain performant robotic manipulation policies. The combination of both holds the promise to extract crucial contact-related information from the demonstration data and actively exploit it during policy rollouts. However, this integration has so far been underexplored, most notably in dynamic, contact-rich manipulation tasks where precision and reactivity are essential. This work therefore proposes a multimodal, visuotactile imitation learning framework that integrates a modular transformer architecture with a flow-based generative model, enabling efficient learning of fast and dexterous manipulation policies. We evaluate our framework on the dynamic, contact-rich task of robotic match lighting - a task in which tactile feedback influences human manipulation performance. The experimental results highlight the effectiveness of our approach and show that adding tactile information improves policy performance, thereby underlining their combined potential for learning dynamic manipulation from few demonstrations.
Want to know more? Read:
- Funk, N.; Chen, C.; Schneider, T.; Chalvatzaki, G.; Calandra, R.; Peters, J. (2025). On the Importance of Tactile Sensing for Imitation Learning: A Case Study on Robotic Match Lighting, ICRA 2025 Workshop on “Towards Human Level Intelligence Vision and Tactile Sensing”.
ActionFlow: Equivariant, Accurate, and Efficient Policies with Spatially Symmetric Flow Matching
Spatial understanding is a critical aspect of most robotic tasks, particularly when generalization is important. Despite the impressive results of deep generative models in complex manipulation tasks, the absence of a representation that encodes intricate spatial relationships between observations and actions often limits spatial generalization, necessitating large amounts of demonstrations. To tackle this problem, we introduce a novel policy class, ActionFlow. ActionFlow integrates spatial symmetry inductive biases while generating expressive action sequences. On the representation level, ActionFlow introduces an SE(3) Invariant Transformer architecture, which enables informed spatial reasoning based on the relative SE(3) poses between observations and actions. For action generation, ActionFlow leverages Flow Matching, a state-of-the-art deep generative model known for generating high-quality samples with fast inference - an essential property for feedback control. In combination, ActionFlow policies exhibit strong spatial and locality biases and SE(3)-equivariant action generation. Our experiments demonstrate the effectiveness of ActionFlow and its two main components on several simulated and real-world robotic manipulation tasks and confirm that we can obtain equivariant, accurate, and efficient policies with spatially symmetric flow matching.
Want to know more? Read:
- Funk, N.; Urain, J.; Carvalho, J.; Prasad, V.; Chalvatzaki, G.; Peters, J. (submitted). ActionFlow: Equivariant, Accurate, and Efficient Policies with Spatially Symmetric Flow Matching.
Evetac: An Event-based Optical Tactile Sensor for Robotic Manipulation
Optical tactile sensors have recently become popular. They provide high spatial resolution, but struggle to offer fine temporal resolutions. To overcome this shortcoming, we study the idea of replacing the RGB camera with an event-based camera and introduce a new event-based optical tactile sensor called Evetac. Along with hardware design, we develop touch processing algorithms to process its measurements online at 1000 Hz. We devise an efficient algorithm to track the elastomer's deformation through the imprinted markers despite the sensor's sparse output. Benchmarking experiments demonstrate Evetac's capabilities of sensing vibrations up to 498 Hz, reconstructing shear forces, and significantly reducing data rates compared to RGB optical tactile sensors. Moreover, Evetac's output and the marker tracking provide meaningful features for learning data-driven slip detection and prediction models. The learned models form the basis for a robust and adaptive closed-loop grasp controller capable of handling a wide range of objects. We believe that fast and efficient event-based tactile sensors like Evetac will be essential for bringing human-like manipulation capabilities to robotics.
Want to know more? Read:
- Funk, N.; Helmut, E.; Chalvatzaki, G.; Calandra, R.; Peters, J. (2024). Evetac: An Event-based Optical Tactile Sensor for Robotic Manipulation, IEEE Transactions on Robotics (T-RO), 40, pp.3812-3832.
Parameterized Projected Bellman Operator
Approximate value iteration (AVI) is a family of algorithms for reinforcement learning (RL) that aims to obtain an approximation of the optimal value function. Generally, AVI algorithms implement an iterated procedure where each step consists of (i) an application of the Bellman operator and (ii) a projection step into a considered function space. Notoriously, the Bellman operator leverages transition samples, which strongly determine its behavior, as uninformative samples can result in negligible updates or long detours, whose detrimental effects are further exacerbated by the computationally intensive projection step. To address these issues, we propose a novel alternative approach based on learning an approximate version of the Bellman operator rather than estimating it through samples as in AVI approaches. This way, we are able to (i) generalize across transition samples and (ii) avoid the computationally intensive projection step. For this reason, we call our novel operator projected Bellman operator (PBO). We formulate an optimization problem to learn PBO for generic sequential decision-making problems, and we theoretically analyze its properties in two representative classes of RL problems. Furthermore, we theoretically study our approach under the lens of AVI and devise algorithmic implementations to learn PBO in offline and online settings by leveraging neural network parameterizations. Finally, we empirically showcase the benefits of PBO w.r.t. the regular Bellman operator on several RL problems.
Want to know more? Read:
- Vincent, T.; Metelli, A.; Belousov, B.; Peters, J.; Restelli, M.; D'Eramo, C. (2024). Parameterized Projected Bellman Operator, Proceedings of the National Conference on Artificial Intelligence (AAAI).
- Vincent, T.; Metelli, A.; Peters, J.; Restelli, M.; D'Eramo, C. (2023). Parameterized projected Bellman operator, ICML Workshop on New Frontiers in Learning, Control, and Dynamical Systems.
Adaptive Control based Friction Estimation for Tracking Control of Robot Manipulators
Adaptive control is often used for friction compensation in trajectory tracking tasks because it does not require torque sensors. However, it has some drawbacks: first, the most common certainty-equivalence adaptive control design is based on linearized parameterization of the friction model, therefore nonlinear effects, including the stiction and Stribeck effect, are usually omitted. Second, the adaptive control-based estimation can be biased due to non-zero steady-state error. Third, neglecting unknown model mismatch could result in non-robust estimation. This paper proposes a novel linear parameterized friction model capturing the nonlinear static friction phenomenon. Subsequently, an adaptive control-based friction estimator is proposed to reduce the bias during estimation based on backstepping. Finally, we propose an algorithm to generate excitation for robust estimation. Using a KUKA iiwa 14, we conducted trajectory tracking experiments to evaluate the estimated friction model, including random Fourier and drawing trajectories, showing the effectiveness of our methodology in different control schemes.
Want to know more? Read:
- Huang, J.; Tateo, D.; Liu, P.; Peters, J. (2025). Adaptive Control based Friction Estimation for Tracking Control of Robot Manipulators, IEEE Robotics and Automation Letters, and IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 10, pp.2454-2461.
Safe Reinforcement Learning on the Constraint Manifold: Theory and Applications
Integrating learning-based techniques, especially reinforcement learning, into robotics is promising for solving complex problems in unstructured environments. However, most existing approaches are trained in well-tuned simulators and subsequently deployed on real robots without online fine-tuning. In this setting, extensive engineering is required to mitigate the sim-to-real gap, which can be challenging for complex systems. Instead, learning with real-world interaction data offers a promising alternative: it not only eliminates the need for a fine-tuned simulator but also applies to a broader range of tasks where accurate modeling is unfeasible. One major problem for on-robot reinforcement learning is ensuring safety, as uncontrolled exploration can cause catastrophic damage to the robot or the environment. Indeed, safety specifications, often represented as constraints, can be complex and non-linear, making safety challenging to guarantee in learning systems. In this paper, we show how we can impose complex safety constraints on learning-based robotics systems in a principled manner, both from theoretical and practical points of view. Our approach is based on the concept of the Constraint Manifold, representing the set of safe robot configurations. Exploiting differential geometry techniques, i.e., the tangent space, we can construct a safe action space, allowing learning agents to sample arbitrary actions while ensuring safety. We demonstrate the method's effectiveness in a real-world Robot Air Hockey task, showing that our method can handle high-dimensional tasks with complex constraints.
Want to know more? Read:
- Liu, P.; Bou-Ammar H.; Peters, J.; Tateo D. (2025). Safe Reinforcement Learning on the Constraint Manifold: Theory and Applications, IEEE Transactions on Robotics (T-Ro), 41, pp.3442-3461.
Learning Implicit Priors for Motion Optimization
In this paper, we focus on the problem of integrating Energy-based Models (EBM) as guiding priors for motion optimization. EBMs are a set of neural networks that can represent expressive probability density distributions in terms of a Gibbs distribution parameterized by a suitable energy function. Due to their implicit nature, they can easily be integrated as optimization factors or as initial sampling distributions in the motion optimization problem, making them good candidates to integrate data-driven priors in the motion optimization problem. In this work, we present a set of required modeling and algorithmic choices to adapt EBMs into motion optimization. We investigate the benefit of including additional regularizers in the learning of the EBMs to use them with gradient-based optimizers and we present a set of EBM architectures to learn generalizable distributions for manipulation tasks. We present multiple cases in which the EBM could be integrated for motion optimization and evaluate the performance of learned EBMs as guiding priors for both simulated and real robot experiments.
Want to know more? Read:
- Urain, J.*; Le, A.T.*; Lambert, A.*; Chalvatzaki, G.; Boots, B.; Peters, J. (2022). Learning Implicit Priors for Motion Optimization, IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
- Le, A. T.; Urain, J.; Lambert, A.; Chalvatzaki, G.; Boots, B.; Peters, J. (2022). Learning Implicit Priors for Motion Optimization, RSS 2022 Workshop on Implicit Representations for Robotic Manipulation.
Learn2Assemble with Structured Representations and Search for Robotic Architectural Construction
Autonomous robotic assembly requires a well-orchestrated sequence of high-level actions and smooth manipulation executions. The problem of learning to assemble complex 3D structures remains challenging, as it requires drawing connections between target shapes and available building blocks, as well as creating valid assembly sequences with respect to stability and kinematic feasibility in the robot's workspace. We design a hierarchical control framework that learns to sequence the building blocks to construct arbitrary 3D designs and ensures that they are feasible, as we plan the geometric execution with the robot-in-the-loop. Our approach draws its generalization properties from combining graph-based representations with reinforcement learning (RL) and ultimately adding tree-search. Combining structured representations with model-free RL and Monte-Carlo planning allows agents to operate with various target shapes and building block types. We demonstrate the flexibility of the proposed structured representation and our algorithmic solution in a series of simulated 3D assembly tasks with robotic evaluation, which showcases our method's ability to learn to construct stable structures with a large number of building blocks.
Want to know more? Read:
- Funk, N.; Chalvatzaki, G.; Belousov, B.; Peters, J. (2021). Learn2Assemble with Structured Representations and Search for Robotic Architectural Construction, Conference on Robot Learning (CoRL).
Benchmarking Structured Policies and Policy Optimization for Real-World Dexterous Object Manipulation
Dexterous manipulation is a challenging and important problem in robotics. While data-driven methods are a promising approach, current benchmarks require simulation or extensive engineering support due to the sample inefficiency of popular methods. We present benchmarks for the TriFinger system, an open-source robotic platform for dexterous manipulation and the focus of the 2020 Real Robot Challenge. The benchmarked methods, which were successful in the challenge, can be generally described as structured policies , as they combine elements of classical robotics and modern policy optimization. This inclusion of inductive biases facilitates sample efficiency, interpretability, reliability and high performance. The key aspects of this benchmarking is validation of the baselines across both simulation and the real system, thorough ablation study over the core features of each solution, and a retrospective analysis of the challenge as a manipulation benchmark.
Want to know more? Read:
- Funk, N.; Schaff, C.; Madan, R.; Yoneda, T.; Urain, J.; Watson, J.; Gordon, E.; Widmaier, F; Bauer, S.; Srinivasa, S.; Bhattacharjee, T.; Walter, M.; Peters, J. (2022). Benchmarking Structured Policies and Policy Optimization for Real-World Dexterous Object Manipulation, IEEE Robotics and Automation Letters (R-AL).
NPDR -- Combining Likelihood-Free Inference, Normalizing Flows, and Domain Randomization
Combining domain randomization and reinforcement learning is a widely used approach to obtain control policies that can bridge the gap between simulation and reality. However, existing methods make limiting assumptions on the form of the domain parameter distribution which prevents them from utilizing the full power of domain randomization. Typically, a restricted family of probability distributions (e.g., normal or uniform) is chosen a priori for every parameter. Furthermore, straightforward approaches based on deep learning require differentiable simulators, which are either not available or can only simulate a limited class of systems. Such rigid assumptions diminish the applicability of domain randomization in robotics. Building upon recently proposed neural likelihood-free inference methods, we introduce Neural Posterior Domain Randomization (NPDR), an algorithm that alternates between learning a policy from a randomized simulator and adapting the posterior distribution over the simulator’s parameters in a Bayesian fashion. Our approach only requires a parameterized simulator, coarse prior ranges, a policy (optionally with optimization routine), and a small set of real-world observations. Most importantly, the domain parameter distribution is not restricted to a specific family, parameters can be correlated, and the simulator does not have to be differentiable. We show that the presented method is able to efficiently adapt the posterior over the domain parameters to closer match the observed dynamics. Moreover, we demonstrate that NPDR can learn transferable policies using fewer real-world rollouts than comparable algorithms.
Want to know more? Read:
- Muratore, F.; Gruner, T.; Wiese, F.; Belousov, B.; Gienger, M.; Peters, J. (2021). Neural Posterior Domain Randomization, Conference on Robot Learning (CoRL).
Underactuated Waypoint Trajectory Optimization for Light Painting Photography
waypoint activations. To validate the proposed method, a letter drawing task is set up where shapes traced by the tip of a rotary inverted pendulum are visualized using long exposure photography.
Want to know more? Read:
- Eilers, C.; Eschmann, J.; Menzenbach, R.; Belousov, B.; Muratore, F.; Peters, J. (2020). Underactuated Waypoint Trajectory Optimization for Light Painting Photography, Proceedings of the IEEE International Conference on Robotics and Automation (ICRA).
Data-efficient Domain Randomization with Bayesian Optimization
Want to know more? Read:
- Muratore, F.; Eilers, C.; Gienger, M.; Peters, J. (2021). Data-efficient Domain Randomization with Bayesian Optimization, IEEE Robotics and Automation Letters (RA-L), with Presentation at the IEEE International Conference on Robotics and Automation (ICRA), IEEE.
BayRn -- Sim-to-Real Evaluation
When learning from simulations, the optimizer is free to exploit the simulation. Thus the resulting policies can perform very well in simulation but transfer poorly to the real-world counterpart. For example, both of the subsequent policies yield a return of 1, thus look equally good to the learner. Bayesian Domain Randomization (BayRn) uses a Gaussian process to learn how to adapt the randomized simulator solely from the observed real-world returns. BayRn is agnostic toward the policy optimization subroutine. In this work we used PPO and Power. We also evaluated BayRn on an underactuated swing-up and balance task.
Want to know more? Read:
- Muratore, F.; Eilers, C.; Gienger, M.; Peters, J. (2021). Data-efficient Domain Randomization with Bayesian Optimization, IEEE Robotics and Automation Letters (RA-L), with Presentation at the IEEE International Conference on Robotics and Automation (ICRA), IEEE.
Robot Juggling Learning Procedure
Learning of the Juggling Task - For the learning on the physical Barrett WAM 20 episodes were performed. During each episode 25 randomly sampled parameters were executed and the episodic reward evaluated. If the robot successfully juggles for 10s, the roll-out is stopped. Roll-outs that were corrupted due to obvious environment errors were repeated using the same parameters. Minor variations caused by the environment initialization were not repeated. After collecting the samples, the policy was updated using eREPS with a KL constraint of 2. The video shows **all** trials executed on the physical system to learn the optimal policy.
Want to know more? Read:
- Ploeger, K.; Lutter, M.; Peters, J. (2020). High Acceleration Reinforcement Learning for Real-World Juggling with Binary Rewards, Conference on Robot Learning (CoRL).
Robot Air Hockey
Want to know more? Read:
- Liu, P.; Tateo, D.; Bou-Ammar, H.; Peters, J. (2021). Efficient and Reactive Planning for High Speed Robot Air Hockey, Proceedings of the IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS).
SPRL Ball in a Cup
Want to know more? Read:
-
- Klink, P.; Abdulsamad, H.; Belousov, B.; D'Eramo, C.; Peters, J.; Pajarinen, J. (2021). A Probabilistic Interpretation of Self-Paced Learning with Applications to Reinforcement Learning, Journal of Machine Learning Research (JMLR).
-
- Klink, P.; D'Eramo, C.; Peters, J.; Pajarinen, J. (2020). Self-Paced Deep Reinforcement Learning, Advances in Neural Information Processing Systems (NIPS / NeurIPS).
-
- Klink, P.; Abdulsamad, H.; Belousov, B.; Peters, J. (2019). Self-Paced Contextual Reinforcement Learning, Proceedings of the 3rd Conference on Robot Learning (CoRL).
A Nonparametric Off-Policy Policy Gradient
Want to know more? Read:
- Tosatto, S.; Carvalho, J.; Abdulsamad, H.; Peters, J. (2020). A Nonparametric Off-Policy Policy Gradient, Proceedings of the 23rd International Conference on Artificial Intelligence and Statistics (AISTATS).
SPOTA -- Simulation-based Policy Optimization with Transferability Assessment
Learning robot control policies from physics simulations is of great interest to the robotics community as it may render the learning process faster, cheaper, and safer by alleviating the need for expensive real-world experiments. However, the direct transfer of learned behavior from simulation to reality is a major challenge. Optimizing a policy on a slightly faulty simulator can easily lead to the maximization of the `Simulation Optimization Bias` (SOB). In this case, the optimizer exploits modeling errors of the simulator such that the resulting behavior can potentially damage the robot. We tackle this challenge by applying domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulation-based Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the over-fitting to the set of domains experienced while training. Our experimental results on two different second order nonlinear systems show that the new simulation-based policy search algorithm is able to learn a control policy exclusively from a randomized simulator, which can be applied directly to real systems without any additional training. The video shows the Sim-to-Real transfer of a policy learned by Simulation-based Policy Optimization with Transferability Assessment (SPOTA) on the Ball-Balancer and Cart-Pole platform from Quanser.
SPOTA -- Cross-Evaluation Between Vortex and Bullet Physics Engines We apply domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulation-based Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the over-fitting to the set of domains experienced while training. Supplementary Video to "Domain Randomization for Simulation-Based Policy Optimization with Transferability Assessment" (CoRL 2018) comparing against LQR, (vanilla) TRPO, and EPOpt synchronized random seeds 4 different initial positions. In this setup, we both train and test in vertex.
SPOTA -- Evaluation in Different Environments We apply domain randomization, i.e., randomizing the parameters of the physics simulations during learning. We propose an algorithm called Simulation-based Policy Optimization with Transferability Assessment (SPOTA) which uses an estimator of the SOB to formulate a stopping criterion for training. The introduced estimator quantifies the over-fitting to the set of domains experienced while training. Supplementary Video to "Domain Randomization for Simulation-Based Policy Optimization with Transferability Assessment" (CoRL 2018) comparing against LQR, (vanilla) TRPO, and EPOpt synchronized random seeds 4 different initial positions. In this setup, we run in environment with nominal parameters.
Want to know more? Read:
- Muratore, F.; Gienger, M.; Peters, J. (2021). Assessing Transferability from Simulation to Reality for Reinforcement Learning, IEEE Transactions on Pattern Analysis and Machine Intelligence (PAMI), 43, 4, pp.1172-1183, IEEE.
Incremental Imitation Learning of Context-Dependent Motor Skills
current imitation learning techniques struggle with a number of challenges that prevent their wide usability. For instance, robots might not be able to accurately reproduce every human demonstration and it is not always clear how robots should generalize a movement to new contexts. This paper addresses those challenges by presenting a method to incrementally teach context-dependent motor skills to robots. The human demonstrates trajectories for different contexts by moving the links of the robot and partially or fully refines those trajectories by disturbing the movements of the robot while it executes the behavior it has learned so far. A joint probability distribution over trajectories and contexts can then be built based on those demonstrations and refinements. Given a new context, the robot computes the most probable trajectory, which can also be refined by the human. The joint probability distribution is incrementally updated with the refined trajectories. We have evaluated our method with experiments in which an elastically actuated robot arm with four degrees of freedom learns how to reach a ball at different positions
Want to know more? Read:
- Ewerton, M.; Maeda, G.J.; Kollegger, G.; Wiemeyer, J.; Peters, J. (2016). Incremental Imitation Learning of Context-Dependent Motor Skills, Proceedings of the International Conference on Humanoid Robots (HUMANOIDS), pp.351--358.
Active Incremental Learning of Robot Movement Primitives
Want to know more? Read:
- Maeda, G.; Ewerton, M.; Osa, T.; Busch, B.; Peters, J. (2017). Active Incremental Learning of Robot Movement Primitives, Proceedings of the Conference on Robot Learning (CoRL).
Robot learning from observation
the human demonstrator and the robot learner are usually different. A movement that can be demonstrated well by a human may not be kinematically feasible for robot reproduction. A common approach to solve this kinematic mapping is to retarget pre-defined corresponding parts of the human and the robot kinematic structure. When such a correspondence is not available, manual scaling of the movement amplitude and the positioning of the demonstration in relation to the reference frame of the robot may be required. This paper’s contribution is a method that eliminates both the need of human-robot structural associations—and therefore is less sensitive to the type of robot kinematics—and searches for the optimal location and adaptation of the human demonstration, such that the robot can accurately execute the optimized solution. The method defines a cost that quantifies the quality of the kinematic mapping and decreases it in conjunction with task-specific costs such as via-points and obstacles. We demonstrate the method experimentally where a real golf swing recorded via marker tracking is generalized to different speeds on the embodiment of a 7 degree-of-freedom (DoF) arm. In simulation, we compare solutions of robots with different kinematic structures
Want to know more? Read:
- Maeda, G.; Ewerton, M.; Koert, D; Peters, J. (2016). Acquiring and Generalizing the Embodiment Mapping from Human Observations to Robot Skills, IEEE Robotics and Automation Letters (RA-L), 1, 2, pp.784--791.
Combining Human Demonstrations and Motion Planning for Movement Primitive Optimization
Want to know more? Read:
- Koert, D.; Maeda, G.J.; Lioutikov, R.; Neumann, G.; Peters, J. (2016). Demonstration Based Trajectory Optimization for Generalizable Robot Motions, Proceedings of the International Conference on Humanoid Robots (HUMANOIDS).
Phase Estimation for Fast Action Recognition and Trajectory Generation in Human-Robot Collaboration
Want to know more? Read:
- Maeda, G.; Ewerton, M.; Neumann, G.; Lioutikov, R.; Peters, J. (2017). Phase Estimation for Fast Action Recognition and Trajectory Generation in Human-Robot Collaboration, International Journal of Robotics Research (IJRR), 36, 13-14, pp.1579-1594.
Hierarchical Reinforcement Learning of Multiple Grasping Policies
Prior work on grasping often assumes that a sufficient amount of training data is available for learning and planning robotic grasps. However, constructing such an exhaustive training dataset is very challenging in practice, and it is desirable that a robotic system can autonomously learn and improves its grasping strategy. Although recent work has presented autonomous data collection through trial and error, such methods are often limited to a single grasp type, e.g., vertical pinch grasp. To address these issues, we present a hierarchical policy search approach for learning multiple grasping strategies. To leverage human knowledge, multiple grasping strategies are initialized with human demonstrations. In addition, a database of grasping motions and point clouds of objects is also autonomously built upon a set of grasps given by a user. The problem of selecting the grasp location and grasp policy is formulated as a bandit problem in our framework. We applied our reinforcement learning to grasping both rigid and deformable objects. The experimental results show that our framework autonomously learns and improves its performance through trial and error and can grasp previously unseen objects with a high accuracy. This work is supported by H2020 RoMaNS (Robotic Manipulation for Nuclear Sort and Segregation) http://www.h2020romans.eu/
Want to know more? Read:
- Osa, T.; Peters, J.; Neumann, G. (2018). Hierarchical Reinforcement Learning of Multiple Grasping Strategies with Human Instructions, Advanced Robotics, 32, 18, pp.955-968.
If you want to watch even more videos on robot learning, please checkout our older videos in the archive.