Publication Details

SELECT * FROM publications WHERE Record_Number=11283
Reference TypeThesis
Author(s)Carvalho, J.A.C.
TitleNonparametric Off-Policy Policy Gradient
Journal/Conference/Book TitleMaster Thesis
AbstractIn the context of Reinforcement Learning, the Policy Gradient Theorem provides a principled way to estimate the gradient of an objective function with respect to the parameters of a differentiable policy. Computing this gradient includes an expectation over the state distribution induced by the current policy, which is hard to obtain because it is a function of the generally unknown environment’s dynamics. Therefore, one way to estimate the policy gradient is by direct interaction with the environment. The need for constant interactions is one of the reasons for the high sample-complexity of policy gradient algorithms, and why it hinders their direct application in robotics. Off-policy Reinforcement Learning offers the promise to solve this problem by providing better exploration, higher sample efficiency, and the ability to learn with demonstrations from other agents or humans. However, current state-of-the-art approaches cannot cope with truly off-policy trajectories. This work proposes a different path to improve the sample efficiency of off-policy algorithms by providing a full off-policy gradient estimate. For that we construct a Nonparametric Bellman Equation with explicit dependence on the policy parameters using kernel density estimation and regression to model the transition dynamics and the reward function, respectively. From this equation we are able to extract a value function and a gradient estimate computed in closed-form, leading to the Nonparametric Off-Policy Policy Gradient (NOPG) algorithm. We provide empirical results to show that NOPG achieves better sample-complexity than state-of-the-art techniques.
Link to PDF


zum Seitenanfang