An approach to solving reinforcement learning (RL) problems by learning the policy directly. For e.g. parametrize policies for discrete actions by $$\pi_\mathbf{\theta}(A_t|S_t) = \frac{1}{Z}e^{\mathbf{\theta}^T\phi(S_t,A_t)}$$ or for continuous actions $$\pi_\mathbf{\theta} \sim N(\mathbf{\theta}^T\phi(S_t),\sigma^2)$$.

The quality of the policy can be measured by the cumulative reward, $$R(\theta) = \mathbb{E}[\sum_{t=0}^T \gamma^t R_t]$$. To learn a good policy, we use gradient ascent on $$R(\theta)$$: $\theta_{m+1} \rightarrow \theta_m + \alpha_m \nabla_\theta R(\theta)|_{\theta=\theta_m}$

Where the sequence $$\alpha_m$$ satisfies $$\sum_{\mathbb{N}} \alpha_m = \infty, \sum_{\mathbb{N}} \alpha_m^2 < \infty$$.

To get the gradient, first denote the trajectory $$\mathbf{\tau} \sim \mathbb{P}(\mathbf{\tau}|\mathbf{\theta}) = \mathbb{P}(S_0)\prod_{t=0}^H\mathbb{P}(S_{t+1}|S_t,A_t)\pi_\mathbf{\theta}(A_t|S_t)$$. The expected cumulative reward can be written as $$R(\mathbf{\theta}) = \mathbb{E}_\mathbf{\tau}[R(\mathbf{\tau})] = \int \mathbb{P}(\mathbf{\tau}|\mathbf{\theta})R(\mathbf{\tau}) \; d\mathbf{\tau}$$. Using the likelihood ratio trick, $\nabla_{\mathbf{\theta}}R(\mathbf{\theta}) = \int \nabla_\mathbf{\theta} \mathbb{P}(\mathbf{\tau}|\mathbf{\theta})R(\mathbf{\tau})d\mathbf{\tau}$ $\int \mathbb{P}(\mathbf{\tau}|\mathbf{\theta}) \nabla_\mathbf{\theta}\log \mathbb{P}(\mathbf{\tau}|\mathbf{\theta})R(\mathbf{\tau})d\mathbf{\tau} = \mathbb{E}_{\mathbf{\tau}}[ \nabla_\mathbf{\theta}\log \mathbb{P}(\mathbf{\tau}|\mathbf{\theta})R(\mathbf{\tau})]$

We can reduce variance by subtracting a baseline value $$b$$ from the reward - the intuition is that the optimal policy does not care about a reward function that is modified by an additive constant. The REINFORCE Algorithm chooses the optimal baseline that minimizes the variance of the estimator.

## Thoughts

• Relate the $$\alpha_m$$ sequence to Robbins-Monro.
• Mention behavourial cloning.