Policy Gradients

An approach to solving reinforcement learning (RL) problems by learning the policy directly. For e.g. parametrize policies for discrete actions by \(\pi_\mathbf{\theta}(A_t|S_t) = \frac{1}{Z}e^{\mathbf{\theta}^T\phi(S_t,A_t)}\) or for continuous actions \(\pi_\mathbf{\theta} \sim N(\mathbf{\theta}^T\phi(S_t),\sigma^2)\).

The quality of the policy can be measured by the cumulative reward, \(R(\theta) = \mathbb{E}[\sum_{t=0}^T \gamma^t R_t]\). To learn a good policy, we use gradient ascent on \(R(\theta)\): \[ \theta_{m+1} \rightarrow \theta_m + \alpha_m \nabla_\theta R(\theta)|_{\theta=\theta_m}\]

Where the sequence \(\alpha_m\) satisfies \(\sum_{\mathbb{N}} \alpha_m = \infty, \sum_{\mathbb{N}} \alpha_m^2 < \infty\).

To get the gradient, first denote the trajectory \(\mathbf{\tau} \sim \mathbb{P}(\mathbf{\tau}|\mathbf{\theta}) = \mathbb{P}(S_0)\prod_{t=0}^H\mathbb{P}(S_{t+1}|S_t,A_t)\pi_\mathbf{\theta}(A_t|S_t)\). The expected cumulative reward can be written as \(R(\mathbf{\theta}) = \mathbb{E}_\mathbf{\tau}[R(\mathbf{\tau})] = \int \mathbb{P}(\mathbf{\tau}|\mathbf{\theta})R(\mathbf{\tau}) \; d\mathbf{\tau}\). Using the likelihood ratio trick, \[ \nabla_{\mathbf{\theta}}R(\mathbf{\theta}) = \int \nabla_\mathbf{\theta} \mathbb{P}(\mathbf{\tau}|\mathbf{\theta})R(\mathbf{\tau})d\mathbf{\tau}\] \[\int \mathbb{P}(\mathbf{\tau}|\mathbf{\theta}) \nabla_\mathbf{\theta}\log \mathbb{P}(\mathbf{\tau}|\mathbf{\theta})R(\mathbf{\tau})d\mathbf{\tau} = \mathbb{E}_{\mathbf{\tau}}[ \nabla_\mathbf{\theta}\log \mathbb{P}(\mathbf{\tau}|\mathbf{\theta})R(\mathbf{\tau})]\]

We can reduce variance by subtracting a baseline value \(b\) from the reward - the intuition is that the optimal policy does not care about a reward function that is modified by an additive constant. The REINFORCE Algorithm chooses the optimal baseline that minimizes the variance of the estimator.

Thoughts

Relate the \(\alpha_m\) sequence to Robbins-Monro.
Mention behavourial cloning.
TODO Policy Gradient Theorem, Off-Policy Policy Gradient, Using Natural Gradients.