We offer these current topics directly for Bachelor and Master students at TU Darmstadt who can feel free to DIRECTLY contact the thesis advisor if you are interested in one of these topics. **Excellent external students from another university may be accepted but please first email Jan Peters. Note that we cannot provide funding for any of these theses projects.**

We highly recommend that you do either our robotics and machine learning lectures (Robot Learning, Statistical Machine Learning) or our colleagues (Grundlagen der Robotik, Probabilistic Graphical Models and/or Deep Learning). Even more important to us is that you take both Robot Learning: Integrated Project, Part 1 (Literature Review and Simulation Studies) and Part 2 (Evaluation and Submission to a Conference) before doing a thesis with us.

In addition, we are usually happy to devise new topics on request to suit the abilities of excellent students. Please **DIRECTLY** contact the thesis advisor if you are interested in one of these topics. When you contact the advisor, it would be nice if you could mention (1) **WHY** you are interested in the topic (dreams, parts of the problem, etc), and (2) **WHAT** makes you special for the projects (e.g., class work, project experience, special programming or math skills, prior work, etc.). Supplementary materials (CV, grades, etc) are highly appreciated. Of course, such materials are not mandatory but they help the advisor to see whether the topic is too easy, just about right or too hard for you.

**Only contact *ONE* potential advisor at the same time! If you contact more a second one without first concluding discussions with the first advisor (i.e., decide for or against the thesis with her or him), we may not consider you at all. Only if you are super excited for at most two topics send an email to both supervisors, so that the supervisors are aware of the additional interest. **

**Scope:** Master's thesis **Advisor:** Michael Lutter, Joni Pajarinen **Start:** ASAP

**Topic:** Imagine a future, where a user teaches a robot how to combine parts into objects or structures. Now imagine, that the robot is able to go even further: build objects with desired object properties, for example height, stability, or shape, using object parts which it has not seen before. In this thesis, we use reinforcement learning and monte carlo tree search to train a robot to build novel objects from novel object parts based on a database of previously demonstrated object part assemblies. Object parts and objects will be modeled as graphs where each graph node specifies to which kinds of other graph nodes it can be connected to. Putting two object parts together results then in a bigger graph merged from the two object part graphs. Experiments will be performed mainly in simulation, but, if desired, the approach can be also evaluated on a real robot. Suitable background knowledge for this thesis can be gained for example in robot learning or reinforcement learning lectures.

**Scope:** Master's thesis **Advisor:** Tuan Dam, Joni Pajarinen **Start:** ASAP

**Topic:** Google Deepmind recently showed how Monte Carlo Tree Search (MCTS) combined with neural networks can be used to play Go on a super-human level. However, one disadvantage of MCTS is that the search tree explodes exponentially with respect to the planning horizon. In this Master thesis the student will integrate the advantages of MCTS, that is, optimistic decision making into a policy representation that is limited in size with respect to the planning horizon. The outcome will be an approach that can plan further into the future. The application domain will include partially observable problems where decisions can have far reaching consequences.

**Scope:** Master's or Bachelor's thesis **Advisor:** Tuan Dam, Boris Belousov **Start:** ASAP **Topic:** A recently discovered reformulation of reinforcement learning based on online optimization [1] highlighted an intriguing link between

policy search and the famous mirror descent algorithm, which led to the development of a family of efficient model predictive control (MPC) approaches [2]. In parallel, building upon the long tradition of acceleration in optimization, an accelerated version of mirror descent search has been proposed [3], which is closely related to relative entropy policy search (REPS) [4].
The proposed thesis will survey these exciting recent developments and give a unifying picture of acceleration in policy search on the basis of accelerated mirror descent. In particular, the algorithm introduced in [3] will be implemented and evaluated on continuous control problems using Quanser platforms and the Barrett WAM robot. A motivated candidate will identify strengths and weaknesses of the current state-of-the-art approaches and work on developing innovative solutions to push the boundary of research.

[1]: Cheng, C. A., Yan, X., Ratliff, N., & Boots, B. (2019, May). Predictor-Corrector Policy Optimization. In ICML (pp. 1151-1161).

[2]: Wagener, N., Cheng, C. A., Sacks, J., & Boots, B. (2019). An Online Learning Approach to Model Predictive Control. In RSS.

[3]: Miyashita, M., Yano, S., & Kondo, T. (2018). Mirror descent search and its acceleration. Robotics and Autonomous Systems, 106, 107-116.

[4]: Peters, J., Mulling, K., & Altun, Y. (2010, July). Relative entropy policy search. In AAAI.

**Scope:** Master's Thesis **Advisors:** Davide Tateo**Start:** ASAP **Topic:**

Moving in dynamics and challenging environments is one of the key issues of robot locomotion. Most of the previous work on this area has focused on the task of solving specific problems, such as efficient learning of compact motor primitives, exploitation of bio-inspired motion models, or fall detection. The objective of this thesis is to develop a general framework for learning locomotion skills, that are flexible and adapts to current sensory measurement and environment characteristics. To achieve high performances and generalization, we will build our framework on top of the previous works on Dynamic Motion Primitives and Synergies, and improve their generalization capabilities with state of the art Deep Reinforcement Learning approaches. The proposed methodology will be evaluated in simulation, using existing Nao and Darwin OP models, exploiting the Gazebo simulator and ROS tools. An optional final step will be to bring the learned behavior to real Nao robot.

This thesis will exploit a newly proposed RL library, Mushroom, and its ROS interface Mushroom ROS that can be used to work both with simulators and real robots.

**Minimum knowledge**

- Basic Knowledge of Reinforcement Learning;
- Good Python programming skills.
- Basic knowledge of ROS (Robot Operating System)

**Preferred knowledge**

- Knowledge of deep learning toolboxes (PyTorch);
- Knowledge of recent Deep RL methodologies.
- Knowledge/Experience with Gazebo simulator.

**Accepted candidate will**

- Improve the current existing simulator platform to run Reinforcement Learning experiments with the Nao and Darwin robots;
- Perform Deep Reinforcement Learning experiments with the Gazeebo simulator;
- Optionally, transfer the learned policy on a real Nao robot.

**Scope:** Master's thesis **Advisor:** Joe Watson**Start:** ASAP **Topic:**

Recent work has presented a control-as-inference formulation that frames optimal control as input estimation. The linear Gaussian assumption can be shown to be equivalent to the LQR solution, while approximate inference through linearization can be viewed as a Gauss–Newton method, similar to popular trajectory optimization methods (e.g. iLQR). However, the linearization approximation limits both the tolerable environment stochasticity and exploration during inference.

The aim of this thesis is to use alternative approximate inference methods (e.g. quadrature, monte carlo, variational), and investigate the benefits to stochastic optimal control and trajectory optimization. Ideally, prospective students are interested in optimal control, approximate inference methods and model-based reinforcement learning.

**Scope:** Master's thesis, Bachelor's thesis **Advisor:** Michael Lutter **Start:** Anytime Soon **Topic:** One way to achieve reinforcement learning using few samples is model-based reinforcement learning but historically these approaches lack the comparable asymptotic performance as model-free approaches. Only very recently two papers showed comparable asymptotic performance with lower sample complexity using probabilistic models composed of network ensembles.

Within this thesis you should develop a probabilistic version of Deep Lagrangian Networks ( Lutter et. al., ICLR 2019), a physics derived architecture that only allows physically plausible models, and use this probabilistic representation for model-based exploration and policy improvement. For the probabilistic version you should use the deterministic and robust bayesian network approach presented earlier this year ( Wu et. al., ICLR 2019).

Finally, you should demonstrate your sample-efficient approach on the physical Cartpole & Furuta Pendulum and learn the swing up only using the physical system and, publish a paper about it :D. So if your are excited to try out Bayesian Deep Learning and want to get your hands dirty with model-based RL, this thesis is perfect for you. So if you are interested just message me (michael@robot-learning.de) and I am happy to discuss more details.

**TL;TR:**

- Extend DeLaN to bayesian deep learning
- Use this probabilistic model for efficient exploration and policy improvement
- Impress everybody by learning the swing-up only using the physical cartpole
- Publish your thesis at a machine learning conference
- Good knowledge of machine learning & deep learning required
- Good programming skills in Python required

**Scope:** Bachelor's thesis/Master's Thesis/Project **Advisors:** Davide Tateo, Carlo D'Eramo, Tianyu Ren **Start:** ASAP **Topic:**
The most important field of Reinforcement Learning (RL) research is Deep RL. Exploiting the capabilities of Deep approximators, RL algorithms are able to solve complex control taks in different fields, such as autonomous learning of robotics skills or solving Atari games.
However, the main drawback of Deep RL research is that experimental evaluation of Deep Learning tasks is difficult for several reasons: most of the state of the art algorithms implementations are standalone, so it is difficult to reuse them for different benchmarks and environments.
Furthermore, when the implementation is available, it is often difficult to understand and contains minor modifications and w.r.t. the original algorithm, to improve their performance on specific benchmark tasks.
Finally, the experimental evaluation is often not done properly, as the experiments are often tweaked to show better performances of the newly proposed algorithms w.r.t. the state of the art approaches. Also, often the results are not presented properly, using plots and experimental setup that are not adequate to measure and compare performances of different agents.

The scope of the thesis is to work on a newly proposed RL library, Mushroom, whose objective is to provide a common platform for RL research, by providing both algorithms, common benchmarks, interfaces to simulators and real robots. The final objective of the thesis is to define a scientifically sound method to design, execute, and present experiment for (Deep) RL algorithms.

**Minimum knowledge**

- Basic Knowledge of Reinforcement Learning;
- Good Python programming skills.

**Preferred knowledge**

- Knowledge of deep learning toolboxes (PyTorch, TensorFlow);
- Knowledge of recent Deep RL methodologies.

**Accepted candidate will**

- Perform Deep Reinforcement Learning experiments in classical benchmarks;
- Run Deep Reinforcement Learning algorithms on real robotic platforms;
- Design a proper scientific methodology to compare RL algorithms;
- Implement novel benchmarks to be solved with existing Deep RL approaches.

**Scope:** Master's Thesis **Advisor:** Julen Urain De Jesus, Hany Abdulsamad **Start:** Anytime **Topic:** Solving the infinite horizon optimal control problem is a very hard problem and dynamic programming has to deal with the curse of dimensionality. In order to solve the problem, the infinite horizon value functions or controllers have been learnt by trajectory optimization techniques. The first works in this direction were done by Atkenson et al. two decades ago. One of the latest most notorious case is the Guided Policy Search, on which iLQR was applied in order to learn state-feedback controllers.

With the explosion of the Neural Ordinary Differential Equation(ODE) and the high similarities of Neural ODE to Pontryagin's Maximun Principle, this project aims to study the possibility of applying indirect methods from trajectory optimization as base for learning both infinite horizon value functions and state-feedback controllers. See this write up for more details.

**Scope:** Master's or Bachelor's thesis **Advisor:** Dorothea Koert **Start:** ASAP **Topic:** In the context of the KoBo34 project, which aims to build an assistive robot for elderly people, we offer different thesis topics in the context of learning robot skills for human robot interaction as well as predicting human motions into the future and recognizing human intentions. If you are interested in this research area please contact me directly to discuss more concrete topics.

**Scope:** Master's or Bachelor's thesis **Advisor:** Riad Akrour, Oleg Arenz **Start:** ASAP **Topic:** Correlated exploration is any exploration mechanism that enforces correlation of the action noise with respect to time or states. Correlated exploration is important for robotics in order to reduce or eliminate jerkiness of exploration and maintain the physical integrity of the robot. Correlated exploration was studied on low dimensional policy representations [1, 2], and we demonstrated suitability of such a learning scheme, for specialized policies, directly on a robotics platform [3]. It has also been shown that correlated exploration can be applied to larger, neural network based, policies [4]. However, the exploration scheme of [4], if seen as an episodic contextual policy search algorithm, is rather primitive in its adaption of the exploration noise, and does not offer the necessary guarantees to be applied directly on a robot. In this thesis, we propose to leverage our expertise in entropy regularized policy search algorithms [5, 6] to improve over these shortcomings in order to provide a safe and efficient correlated exploration algorithm for robotics. The successful candidate is expected to investigate the following topics:

- Set-up baseline by integrating correlated exploration of [4] to recent versions of DDPG such as [7].
- Compare uncorrelated and correlated exploration on simulated tasks and on the Quanser robots.
- Improve over existing correlated exploration formulations by, for example, integrating the gradient update of DDPG to our well founded formulations of entropy regularized episodic policy search algorithms [5, 6].

The successful candidate is expected to conduct their thesis with scientific rigor and a drive for quality such that their work find its place at a top machine learning or robotics conference.

[1] Rückstieß, T. et al.; State-dependent exploration for policy gradient methods; ECML 2008.

[2] van Hoof, H. et al.; Generalized exploration in policy search.; MLJ 2017.

[3] Parisi, S. et al.; Reinforcement learning vs human programming in tetherball robot games; IROS 2015.

[4] Plappert, M. et al.; Parameter space noise for exploration; ICLR 2018.

[5] Akrour, R. et al.; Model-free trajectory-based policy optimization with monotonic improvement; JMLR 2018.

[6] Arenz, O. et al.; Efficient gradient-free variational inference using policy search; ICML 2018.

[7] Fujimoto, S. et al.; Addressing function approximation error in actor-critic methods; ICML 2018.

**Scope:** Master's thesis **Advisor:** Oleg Arenz, Joni Pajarinen **Start:** ASAP **Topic:** When confronted with large piles of entangled or otherwise stuck together objects a robot

has to separate the objects before further manipulation is possible. For example, in waste segregation

the robot may put different types of objects into different containers. In this Master thesis project, one

robot will learn to disentangle objects and another adversarial robot will learn to entangle objects.

Learning will be done on real robots shown in the picture right. **Background knowledge:** robot learning

**Scope:** Master's thesis, Bachelor's thesis **Advisor:** Jan Peters, Ruth Stock-Homburg, Katharina Schneider **Start:** ASAP

**Topic:**
Companies of various industries started introducing anthropomorphic, social robots that interact with customers by gesturing and showing facial expressions with their equipped extremities and head. In this way, they have a social presence that, in turn, can create an emotional bond with the human within the interaction. Accordingly, the physical and haptic contact between a social robot and a human is an important part of the human-robot-interaction. Handshaking is a simple human interaction, but it is a complex movement and can be applied in several different social contexts, such as greeting or congratulations. Therefore, an anthropomorphic, social robot that interacts with humans should be motor intelligent and have the ability to show a human-like and authentic handshake behavior. While first theoretical frameworks about the human hand movement for handshaking were investigated, their implementation for anthropomorphic robots using the handshake turing test are not yet well understood. The thesis is embedded in the interdisciplinary FIF-project "Handshake Turing Test – Androide robot vs. human." The aim of this thesis is first to survey the literature of theories about the human handshake and handshake turing test, second to develop a concept of handshaking for an anthropomorphic, social robot, and third to test the concept on our real anthropomorphic robot Elenoide (see picture).

**Scope:** Bachelor's thesis/Project **Advisor:** Davide Tateo **Start:** ASAP **Topic:**
Hierarchical Reinforcement Learning (HRL) is the field of Reinforcement Learning (RL) that considers structured agents. In this field, a high-level task is decomposed in simpler subtasks. The resulting control policy is represented as a hierarchy of policy, where each policy solves a subtask. While the original literature of HRL focus on how is possible to exploit domain knowledge and structured exploration to speed-up the learning, the more recent approaches, based on Deep Learning, focus on using the hierarchical structure to solve tasks that cannot be solved, or that are difficult to learn, using classical Deep RL approaches. While classical HRL approaches are particularly well suited for finite state-action space MDPs, the more recent Deep HRL approaches can work in complex robotic tasks with continuous state and actions pairs.

One major drawback of the recent literature, is that the Deep HRL approaches shares one of the major issues of the "flat" Deep RL: indeed, the resulting policy is difficult to be interpreted by humans and thus cannot be trusted in safety-critical applications, as we cannot analyze and predict the global behavior. Another major drawback of Deep HRL algorithms is that it is difficult to insert prior knowledge of the environment in the policy structure, making even more difficult to apply these kinds of algorithms in real-world scenarios.

To solve these issues, we propose a novel HRL framework, inspired by control theory, where the design of the hierarchical agent is performed using block diagrams. This framework simplifies the design of hierarchical agents and proposes a different paradigm for HRL: we build structured agents that do not execute of a policy following the stack principle i.e., functions calls, but instead are composed by a set of different parallel controllers. More details about this framework can be found here.

The objective of this thesis is to simplify the design of hierarchical agents using the above-mentioned framework by implementing graphical tools to define easily the structure of the agent and analyze the behavior of the agent while interacting with the environment. Also, we need to improve the existing codebase by refactoring interfaces and implementing new features.

**Minimum knowledge**

- Good Python programming skills.

**Preferred knowledge**

- Knowledge of Python graphical and graph libraries;
- Basic knowledge of Reinforcement Learning;
- Knowledge of recent Deep RL methodologies;

**Accepted candidate will**

- Learn the basic of the proposed framework by looking at the existent codebase;
- Implement graphical tools to design and analyze Hierarchical Reinforcement Learning Agents;
- Refactor the currently existing framework to design HRL agents;
- Add new functionalities to the Hierarchical Reinforcement Learning Framework;
- Test the developed framework in toy problems or, optionally, on real robots;
- optionally, implement some standard Hierarchical Reinforcement Learning algorithms.

**Scope:** Master's thesis **Advisor:** Carlo D'Eramo **Start:** ASAP **Topic:**
Curriculum Reinforcement Learning (RL) is an effective way of addressing the sample-efficiency, and feature extraction issues in Deep RL. It frames the learning of a complex task, in the sequential learning of simpler tasks ordered by increasing complexity. This way, the agent always starts the learning of the next task from an effective basin resulting from the learning of previous ones, and this process enables to extract more meaningful features and improve sample-efficiency by knowledge transfer.

This thesis starts from the assumption that, as the complexity of tasks sequentially increases, the complexity of the function approximator used to learn them should increase accordingly. In particular, two studies will be conducted in parallel. (1) A deep neural network is progressively enlarged to increase its representational power. The way the increment is performed will be the subject of the study, that will be carried out finding measures to assess the actual increase of complexity, and the minimum required size of the network to effectively learn. (2) The connections of a deep neural network are sparsified (e.g. LASSO regularization) at the beginning, and the sparsification is progressively lessened according to the increase of task complexity. (1) and (2) are similar methodologies, but both are worth to be studied. In particular, (1) has the advantage of building a network with the most efficient size for the considered task, but a rigorous way to increment network is challenging to derive; on the other hand, (2) has the advantage of using a network with a fixed size, but it has no control over its structure, and the number of its parameters. This thesis aims at studying both approaches, deriving theoretically motivated methodologies, and showing their empirical benefit on challenging Deep RL problems.

**Minimum knowledge**

- Basic knowledge of Reinforcement Learning and Deep Learning;
- Good Python programming skills.

**Preferred knowledge**

- Knowledge of PyTorch library;
- Knowledge of recent Deep RL methodologies;
- Knowledge of recent Multi-Task, Transfer, Meta, and Curriculum RL methodologies.

**Accepted candidate will**

- Implement the studied methodologies in Python, desirably using the RL library Mushroom (https://github.com/AIRLab-POLIMI/mushroom);
- Perform experiments on the Quanser robots and simulated environments (Atari/MuJoCo);
- Have the opportunity of publishing their work at Machine Learning conferences.

**Scope:** Master's thesis **Advisor:** Vincent Berenz (a collaborator at Tübingen at the at the Max Planck Institute for Intelligent Systems) **Start:** ASAP **Topic:**

Robotic scripted dance is common. One the other hand, interactive dance, in which the robot uses runtime sensory information to continuously adapt its moves to those of its (human) partner, remains challenging. It requires integration of together various sensors, action modalities and cognitive processes. The selected candidate objective will be to develop such an interactive dance, based on the software suit for simultaneous perception and motion generation our department built over the years. The target robot on which the dance will be applied is the wheeled robot Softbank Robotics Pepper. This master thesis is with the Max Planck Institute for Intelligent Systems and is located in Tuebingen. More information: https://am.is.tuebingen.mpg.de/jobs/master-thesis-interactive-dance-performed-by-sofbank-robotics-pepper

**Scope:** Master's Thesis, Bachelor's thesis **Advisor:** Marco Ewerton **Start:** Already taken **Topic:** Recent research has leveraged Learning from Demonstrations and Probabilistic Movement Representations to allow humans and robots to efficiently perform tasks together, such as moving objects from one location to another without hitting obstacles in the way Co-manipulation with a Library of Virtual Guiding Fixtures. In some situations, however, it might be not trivial to provide good demonstrations to the robot. Moreover, the robot and the human might need to adapt their respective behaviors with time in order to get used to each other and achieve better performance in previously unknown environments. Intelligent prosthetic limbs or exoskeletons could for instance adapt to human users as the users themselves get accustomed to those devices and as both agents face new environments.
In this project, the student will explore Learning from Demonstrations, Probabilistic Movement Primitives and Policy Search algorithms in order to enable robots to assist humans in shared control tasks.

**Scope:** Master's thesis, Bachelor thesis **Advisor:** Jan Peters **Start:** ASAP **Topic:** Inspired by results in neuroscience, especially in the Cerebellum, Kawato & Wolpert introduced the idea of the MOSAIC (modular selection and identification for control) learning architecture. In this architecture, local forward models, i.e., models that predict future states and events, are learned directly from observations. Based on the prediction accuracy of these models, corresponding inverse models can be learned. In this thesis, we want to focus on the problem of learning to control a robot system with a hysteresis in its friction.

**Scope:** Master's Thesis, Bachelor's thesis **Advisor:** Hany Abdulsamad **Start:** ASAP **Topic:** Model-based Reinforcement Learning is an approach to learn complex tasks given local approximations of the nonlinear dynamics of the environment and cost functions. It has proven to be a sample efficient approach for learning on real robots. Classical approaches for learning such local models have certain restrictions on the overall structure; for example the number of local componants and switching dynamics. State of the art research has recently moved to more general settings with nonparameteric approaches that require less structure. The aim of this thesis is to review the literature on this subject and to compare existing algorithms on real robots like the BioRob or the Barrett WAM.

**Scope:** Master's Thesis **Advisor:** Hany Abdulsamad **Start:** ASAP **Topic:** Standard learning control techniques focus on learning deterministic controllers. Even advanced policy search methods that rely on stochastic search distributions use stochastic controllers only for the purpose of exploration, the final policy is only applied in its deterministic form. There are however cases in which a deterministic controller is always sub-optimal, such as in scenarios with random unstructured disturbances. In this thesis we want to address the problem of learning true stochastic optimal controllers in the context of an adversarial setting, and investigate the question if adversarial learning can be used to generalize standard policy search methods. This topic includes very interesting and deep connections to robust control, game theory and multi-agent learning.

**Scope:** Master's Thesis, Bachelor's thesis **Advisor:** Hany Abdulsamad **Start:** ASAP **Topic:** A great challenge in applying Reinforcement Learning approaches is the need for human intervention to reset the scenario of a learned task, making the process very tedious and time consuming. A clear example is learning table tennis, where we are either limited to using a ball gun with predictable pattern of initial positions or a human is needed to play against the robot. However given a second robotic player, we propose a new setup, in which the two agents cooperate to develop two different strategies, where one agent learns to support the second in becoming a great table tennis player. It is interesting to see if in such a scenario the agents would be able to discover what might resemble a defensive and an aggressive strategy in table tennis. The thesis will concentrate on developing the concept of cooperation and testing the results in simulation and on our own real table tennis setup.

**Scope:** Master's Thesis **Advisor:** Hany Abdulsamad **Start:** ASAP **Topic:** Let's stop reinventing the wheel. Most recent approaches to model-based RL revolve around the concept of trajectory optimization (iLQG, GPS, PILCO <-- all related to DDP). They can all be categorized in terms of direct and indirect shooting methods, two categories of optimal control that have existed for a very long time. It is time to clarify this connection. The aim of this thesis is, first dive into the literature of control and model-based RL, second investigate the possibility of applying information-theoretic bounds to standard optimal control techniques.

**Scope:** Master's Thesis; Bachelor's Thesis **Advisor:** Samuele Tosatto **Start:** Fall Semester 2018

**Topic:** One of the major drawback state-of-the-art RL is its sample inefficiency. Very often, the amount of interaction with the system needed to learn a task is not realistic for real world application.
We find a fundamental cause of such inefficiency in the Policy Gradient Theorem: the estimation of the state-distribution is often done by monte-carlo sampling, thus "on-policy". In fact, while it is easy to obtain off-policy estimation of the value function, there is currently no method for estimating the state-distribution offline.
We propose a method which is able to provide an off-policy estimation of the state-distribution, and an analytical solution for the policy gradient.
The derived algorithm seems to be very promising for real-world applications, but yet it has to be enhanced.
The thesis consists in improving the method proposed in such a way to solve a robotic task defined together with the student. The ideal applicant must have good mathematical skills, the will to study and understand deeply the RL theory, to come up with some idea, and to have good programming skills (python, tensorflow). What the student will gain from this project is the possibility to better understand the RL theory and the state-of-the-art, work with robot (our target platform is Darias), and hopefully to have a publication.

**Scope:** Master's Thesis **Advisor:** Pascal Klink, Carlo D'Eramo **Start:** ASAP **Topic:**
Curriculum Learning has demonstrated to be a promising tool for improving the performance of Reinforcement Learning agents by exploiting similarities and relations between tasks. Recent work has formulated Curriculum Learning as a trade-off optimization between local reward improvement and progression towards a target task distribution. However, this work is restricted to the episodic-RL setting, hence not leveraging information contained in each individual transition encountered in the environment.

The goal of this thesis is to extend the formulation to a step-based RL algorithm and investigate the benefits of the resulting algorithm by comparison with current state-of-the-art curriculum learning methods for RL.

**Scope:** Bachelor's / Master's Thesis **Advisor:** Joe Watson **Start:** Anytime **Topic:**
For learning in robotic manipulation, there are two cultures: 'end-to-end' vs inductive biases.
While the former is purely data-driven, the latter incorporates ideas from computer vision and visual servoing for more interpretable and sample-efficient performance.
This is an open-ended project aimed at investigating novel frameworks for perception-based manipulation that combine 'structure' (computer vision) with learning.

The ideal candidate:

- has knowledge of both robotics and (geometric) computer vision
- is interested in working on real robotic manipulators
- can write clean, maintainable software

**Scope:** Master's thesis **Advisor:** Joni Pajarinen **Start:** ASAP

**Topic:** Efficient exploration is one of the most prominent challenges in deep reinforcement learning. In reinforcement learning, exploration of the state space is critical for finding high value actions and connecting them to the causing actions. Exploration in model-free reinforcement learning has relied on classical techniques, empirical uncertainty estimates of the value function, or random policies. In model-based reinforcement learning value bounds have been used successfully to direct exploration. In this Master thesis project the student will investigate how lower and upper value bounds can be used to target exploration in model-free reinforcement learning into the most promising parts of the state space. This thesis topic requires background knowledge in reinforcement learning gained e.g. through machine learning or robot learning courses.

**Scope:** Master's Thesis **Advisor:** Joe Watson, Michael Lutter **Start:** Anytime **Topic:** Model-based Reinforcement Learning for robotics typically requires learning the nonlinear dynamics of complex multibody mechanical systems. The Recursive Newton Euler Algorithm (RNEA) is an existing means of efficiently modelling such systems. This project looks at using the Lie Algebra perspective of RNEA to implement the algorithm on a differential computation graph, the basis of deep learning models. This potentially offers the means of learning high-fidelity and interpretable models for robotics from data. See this write up for more details.