Publication Details

SELECT * FROM publications WHERE Record_Number=11189
Reference TypeConference Proceedings
Author(s)Belousov, B.; Peters, J.
TitleEntropic Regularization of Markov Decision Processes
Journal/Conference/Book Title38th International Workshop on Bayesian Inference and Maximum Entropy Methods in Science and Engineering
Keywordsreinforcement learning; actor-critic methods; entropic proximal mappings; policy search
AbstractThe problem of synthesis of an optimal feedback controller for a given Markov decision process (MDP) can in principle be solved by value iteration or policy iteration. However, if system dynamics and the reward function are unknown, the only way for a learning agent to discover an optimal controller is through interaction with the MDP. During data gathering, it is crucial to account for the lack of information, because otherwise ignorance will push the agent towards dangerous areas of the state space. To prevent such behavior and smoothen learning dynamics, prior works proposed to bound the information loss measured by the Kullback-Leibler (KL) divergence at every policy improvement step. In this paper, we consider a broader family of f -divergences that preserve the beneficial property of the KL divergence of providing the policy improvement step in closed form accompanied by a compatible dual objective for policy evaluation. Such entropic proximal policy optimization view gives a unified perspective on compatible actor-critic architectures. In particular, common least squares value function fitting coupled with advantage-weighted maximum likelihood policy estimation is shown to correspond to the Pearson χ2-divergence penalty. Other connections can be established by considering different choices of the penalty generator function f .
Link to PDF


zum Seitenanfang