Publication Details

SELECT * FROM publications WHERE Record_Number=11164
Reference TypeConference Proceedings
Author(s)Belousov, B.; Peters, J.
TitleMean squared advantage minimization as a consequence of entropic policy improvement regularization
Journal/Conference/Book TitleEuropean Workshops on Reinforcement Learning (EWRL)
Keywordspolicy optimization, entropic proximal mappings, actor-critic algorithms
AbstractPolicy improvement regularization with entropy-like f-divergence penalties provides a unifying perspective on actor-critic algorithms, rendering policy improvement and policy evaluation steps as primal and dual subproblems of the same optimization problem. For small policy improvement steps, we show that all f-divergences with twice differentiable generator function f yield a mean squared advantage minimization objective for the policy evaluation step and an advantage-weighted maximum log-likelihood objective for the policy improvement step. The mean squared advantage objective fits in-between the well-known mean squared Bellman error and the mean squared temporal difference error objectives, requiring only the expectation of the temporal difference error with respect to the next state and not the policy, in contrast to the Bellman error, which requires both, and the temporal difference error, which requires none. The advantage-weighted maximum log-likelihood policy improvement rule emerges as a linear approximation to a more general weighting scheme where weights are a monotone function of the advantage. Thus, the entropic policy regularization framework provides a rigorous justification for the common practice of least squares value function fitting accompanied by advantage-weighted maximum log-likelihood policy parameters estimation, at the same time pointing at the direction in which this classical actor-critic approach can be extended.
Link to PDF


zum Seitenanfang