Vectors will always be written in bold font and lower case, i.e., {$\mathbf{a}$}.
With a vector, we will always denote a column vector, i.e., {$\mathbf{a} = \left[\begin{array}{c} a_1 \\ \vdots \\ a_n \end{array}\right]$}. A row vector is denoted as {$\mathbf{a}^T$}.
Matrices will always be written in bold font and upper case, i.e., {$\mathbf{A}$}.
Gradients are always defined as row fectors, i.e., {$\frac{d f}{d \mathbf{x}} = \left[ \frac{d f}{d x_1}, \dots, \frac{d f}{d x_n} \right]$}.
The gradient of a vector valued function {$\mathbf{f}(\mathbf{x})$}is a matrix defined as {$\frac{d \mathbf{f}}{d \mathbf{x}} = \left[\begin{array}{ccc} \frac{d f_1}{d x_1} & \dots & \frac{d f_1}{d x_n} \\ \vdots & \vdots & \vdots \\ \frac{d f_m}{d x_1} & \dots & \frac{d f_m}{d x_n}\end{array} \right]$}
The expectation of a function {$f(\mathbf{x})$} with respect to a distribution {$p(\mathbf{x})$} will be written as {$$\mathbb{E}_{p(\mathbf{x})}[f(\mathbf{x})] = \int p(\mathbf{x}) f(\mathbf{x}) d\mathbf{x} $$}
{$\mathbf{\tau}$} ... 1. torques (a motor command), 2. trajectory or 3. temporal scaling parameter for movement primitives ({$\tau$})
{$\mathbf{a}$} ... action (often {$\mathbf{a}$} and {$\mathbf{u}$} can be replaced)
{$\mathbf{s}$} ... state of the agent (used in most RL literature)
{$\mathbf{x}$} ... 1. state of the system (used in control literature, often {$\mathbf{x}$} and {$\mathbf{s}$} can be replaced), 2. task space coordinates (for example end-effector coordinates) 3. input sample for supervised learning methods
{$\mathbf{y}$} ... 1. state of a dynamical movement primitive, 2. output sample for supervised learning methods
{$\mathbf{f}$} ... 1. {$\mathbf{f}(\mathbf{q})$} ... forward kinematics, 2. {$\mathbf{f}(\mathbf{x}, \mathbf{u})$} (or similar notation for state and control) ... forward dynamics
{$\mathbf{J}$} ... Jacobian (of the forward kinematics)
{$\mathbf{\mu}^{\pi}(\mathbf{s})$} ... state visit distribution of policy {$\pi$}
{$\mathbf{\mu}_0(\mathbf{s})$} ... initial state distribution
{$r(\mathbf{s}, \mathbf{a})$} ... reward function
{$J_{\pi}$} ... expected long term reward of policy {$\pi$}
{$V^{\pi}$} ... value function of policy $\pi$
{$Q^{\pi}$} ... state-action value function of policy $\pi$
{$V^{*}$} ... optimal value function
{$Q^{*}$} ... optimal state-action value function
Policy Gradients
{$\pi(\mathbf{a}|\mathbf{s}; \mathbf{\theta})$}... lower level policy for controlling the robot (stochastic)
{$\mathbf{a} = \pi(\mathbf{s}; \mathbf{\theta})$}... lower level policy for controlling the robot (deterministic)
{$\mathbf{\theta}$} ... parameter vector of the lower level policy
{$\pi(\mathbf{\theta}|\mathbf{\omega})$}... upper level policy (for choosing the parameters of the lower level policy)
{$\mathbf{\omega}$} ... parameter vector of the upper level policy
{$J_{\mathbf{\theta}}$} or {$J_{\mathbf{\omega}}$}... expected return function that depends on the parameters of lower level policy (left) or upper level policy (right)
{$\nabla_{\mathbf{\theta}}$} or {$\nabla_{\mathbf{\omega}}$}... gradient with respect to the parameters of lower-level policy parameters (left) or upper-level policy parameters
{$R^{[i]}$} ... return for the ith executed episode.
{$Q_t^{[i]}$} ... reward to come for time step t in the ith executed episode.
{$\mathbf{G}(\mathbf{\theta})$} ... Fisher information matrix (FIM)