reinforcement learning (4/4): policy gradient

In parts 1-3 we found that learning the values of different states (or state-action pairs) made it easy to define good polices; we simply selected high valued states and actions. Policy gradient methods use a different approach: learn policies directly by optimizing their parameters to maximize reward. These techniques allow us to tackle more interesting problems consisting of large or continuous action and state spaces. The math is a bit heavier :nerd_face:, but so is the payoff.

policy gradient

Up to this point we have been learning value functions, which can be used to find good policies. Policy gradient methods directly learn the parameters for the policy. To do this we gradually tweak our policy parameters $\theta$ in the direction that increases expected reward. We therefore need to take the gradient of some performance metric $J(\theta)$ with respect to $\theta$ and update in that direction by some amount $\alpha \in (0,1]$:

\[\theta_{t+1} = \theta_t + \alpha \nabla J(\theta)\]

policy gradient theorem

Let $J(\theta)$ be the amount of reward we expect under the policy. This is the thing we want to maximize. Differentiating $J(\theta)$ with respect to $\theta$ presents a challenge. The expected reward depends on the policy $\pi_\theta(a|s)$, which we take to be differentiable, but also on the distribution of states, which depends on potentially complex interactions between the policy and the environment.

The policy gradient theorem solves this problem. Consider the episodic case1 in which we start at state $s_0$. Performance can be defined as the expected return for the episode, $v(s_0)$. The policy gradient theorem gives the following derivative, where $\mu(s)$ is the stationary distribution2 of states under $\pi$:

\[\begin{aligned} \nabla_\theta J(\theta) &= \nabla_\theta v(s_0) \\ &= \nabla_\theta \underbrace{\sum_s \mu(s)}_\text{average over states} \underbrace{\sum_a \pi(a \mid s) q(s,a)}_\text{value of each state} \\ &\propto \sum_s \mu(s) \sum_a q(s,a) \underbrace{\nabla_\theta \pi(a \mid s)}_\text{nice!} \end{aligned}\]

The magic happens in the third line. The policy gradient theorem says we can take the gradient with respect to the policy without worrying about the gradient of the state distribution.

policy gradient theorem proof

We want to take the gradient of $J(\theta) = v(s_0)$. Let’s start by taking the gradient for any state $s$. First we’ll establish a recursive relationship between the value of a state and the subsequent state. For simplicity (but without loss of generality) we will not use discounting. Using the product rule we have:

\[\begin{aligned} \nabla_\theta v(s) &= \nabla_\theta \sum_a \pi(a \mid s) q(s,a) \\ &= \sum_a \left[ \nabla_\theta \pi(a \mid s) q(s,a) + \pi(a \mid s) \nabla_\theta q(s,a) \right] \end{aligned}\]

Let’s expand the final term in this expression. Recall that in stochastic environments the same state-action pair $(s,a)$ can result in several possible subsequent state-reward pairs $(s’,r)$. Therefore, $q(s,a)$ can be expressed as an average over these possibilities, $q(s,a) = \sum_{s’,r}p(s’,r \mid s,a)[r + v(s’)]$, giving us:

\[\nabla_\theta v(s) = \sum_a \left[ \nabla_\theta \pi(a \mid s) q(s,a) + \pi(a \mid s) \nabla_\theta \sum_{s',r}p(s',r\mid s,a)[r + v(s')] \right]\]

When we distribute the rightmost $\nabla_\theta$ the $r$ disappears because it doesn’t depend on $\theta$ ($r$ behaves like a constant inside the summation):

\[\nabla_\theta v(s) = \sum_a \left[ \nabla_\theta \pi(a \mid s) q(s,a) + \pi(a \mid s) \sum_{s'}p(s'\mid s,a)\nabla_\theta v(s') \right]\]

Finally, to reduce the painfulness of the following steps we will set $\phi(s) = \sum_a \nabla \pi(a \mid s) q(s,a)$:

\[\nabla_\theta v(s) = \phi(s) + \sum_a \pi(a \mid s) \sum_{s}p(s'\mid s,a)\nabla_\theta v(s')\]

Okay! Get ready for some funny notation 📝📝📝. The probability of moving from state $s$ to state $x$ in exactly $k$ steps under the current policy will be denoted $\rho(s \rightarrow x, k)$. Notice that to compute the probability of moving from $s_0$ to $s_2$ in 2 steps, we add up the probabilities of getting there via all possible intermediate states $s_1$:

\[\rho(s_0 \rightarrow s_2, 2) = \sum_{s_1} \underbrace{\rho(s_0 \rightarrow s_1, 1)}_{\text{$s_0$ to $s_1$ in $1$ step}} \underbrace{\rho(s_1 \rightarrow s_2, 1)}_{\text{$s_1$ to $s_2$ in $1$ step}}\]

With this seemningly out-of-nowhere notation we are ready to prove the policy gradient theorem. First we shift around the summations a bit:

\[\begin{aligned} \nabla_\theta v(s) &= \phi(s) + \sum_a \pi(a \mid s) \sum_{s'}p(s'\mid s,a)\nabla_\theta v(s') \\ &= \phi(s) + \sum_{s'} \sum_a \pi(a \mid s) p(s'\mid s,a)\nabla_\theta v(s') \\ \end{aligned}\]

The probability of moving from $s$ to $s’$ in one step is the average of the probabilities for each action we could take from $s$. Mathematically, this means that $\rho(s \rightarrow s’, 1) = \sum_a p(a \mid s) p(s’ \mid s,a)$. Performing this substitution and then plugging in our formula recursively we have:

\[\begin{aligned} \nabla_\theta v(s) &= \phi(s) + \sum_{s'} \rho(s \rightarrow s', 1) \nabla_\theta v(s') \\ &= \phi(s) + \sum_{s'}\rho(s \rightarrow s', 1) \left[\phi(s') + \sum_{s''} \rho(s' \rightarrow s'', 1) \nabla_\theta v(s'') \right] & \scriptstyle{\text{recursion, mwahaha}} \\ \end{aligned}\]

Now we distribute $\sum_{s’} \rho(s \rightarrow s’, 1)$ and use the fact that $\rho(s \rightarrow s’’, 2) = \sum_{s’} \rho(s \rightarrow s’, 1) \rho(s’ \rightarrow s’’, 1)$:

\[\nabla_\theta v(s) = \phi(s) + \sum_{s'} \rho(s \rightarrow s', 1) \phi(s') + \sum_{s''} \rho(s \rightarrow s'', 2) \nabla_\theta v(s'')\]

Continuing the pattern forever we get:

\[\begin{aligned} \nabla_\theta v(s) &= \sum_{x \in \mathcal{S}} \sum_{k=0}^\infty \rho(s \rightarrow x, k) \phi(x) \\ &= \sum_{x \in \mathcal{S}} \sum_{k=0}^\infty \rho(s \rightarrow x, k) \sum_a \nabla_\theta \pi(a \mid s) q(s,a) \end{aligned}\]

So 😬. Close 😬. I’m only going to introduce one more annoying thing. $\eta(s)$ will be the average number of steps we spend in state $s$ across episodes. It is the sum of the probabilities of being in state $s$ after zero steps, one steps, two steps…

\[\eta(s) = \sum_{k=0}^\infty \rho(s_0 \rightarrow s, k)\]

Now let’s consider what happens to our gradient with $s = s_0$:

\[\begin{aligned} \nabla_\theta v(s_0) &= \sum_{s} \sum_{k=0}^\infty \rho(s_0 \rightarrow s, k) \phi(x) \\ &= \sum_{s} \eta(s) \phi(s) \\ &= \sum_{s'}\eta(s') \sum_{s} \frac{\eta(s)}{\sum_{s'}\eta(s')} \phi(x) & \scriptstyle{\text{normalize to probability distribution}} \\ &= \sum_{s'} \eta(s') \sum_{s} \mu(s) \phi(x) \\ &\propto \sum_{s} \mu(s) \sum_a \nabla_\theta \pi(a \mid s) q(s,a) \end{aligned}\]

The proof is complete! In the last line we use the fact that the amount of time we spend in a state, $\eta(s)$, is proportional to the probability of being in that state, $\mu(s)$. Specifically, $\mu(s) = \eta(s) / \sum_{s’}\eta(s’)$, a fact we use in the fourth line.



Let’s put the policy gradient theorem to use. Although we don’t know the true distribution of states $\mu(s)$, we can continuously sample states under the current policy to approximate the gradient without bias. This Monte Carlo approach relies on the fact that:

\[\begin{aligned} \nabla_\theta J(\theta) &\propto \sum_{s} \mu(s) \sum_a q(s,a) \nabla_\theta \pi(a \mid s) \\ &= \mathbb{E}_\pi \left[ \sum_a q(S_t,a) \nabla_\theta \pi(a \mid S_t) \right] & \scriptstyle{\text{replacing $s$ with the sample $S_t$}} \end{aligned}\]

There is a subtle shift here. In the first line we loop over specific states $s$, whereas in the second line we sample states $S_t$. Now the state is a random variable. This means that instead of averaging across states according to their (unknown!) probabilities, we can just behave in the world and average across the states we see!

Using similar logic, rather than performing updates by averaging across all actions, we can perform updates by sampling one action at a time because:

\[\begin{aligned} \nabla_\theta J(\theta) &\propto \mathbb{E}_\pi \left[ \sum_a q(S_t,a) \nabla_\theta \pi(a \mid S_t) \right] \\ &= \mathbb{E}_\pi \left[ \sum_a \pi(a \mid S_t) q(S_t,a) \frac{\nabla_\theta \pi(a \mid S_t)}{\pi(a \mid S_t)} \right] \\ &= \mathbb{E}_\pi \left[q(S_t, A_t) \frac{\nabla_\theta \pi(A_t \mid S_t)}{\pi(A_t \mid S_t)} \right] && \scriptstyle{\text{replacing $a$ with the sample $A_t$}} \\ &= \mathbb{E}_\pi \left[G_t \frac{\nabla_\theta \pi(A_t \mid S_t)}{\pi(A_t \mid S_t)} \right] && \scriptstyle{\text{because $\mathbb{E}[G_t \mid S_t, A_t] = q(S_t,A_t)$}} \end{aligned}\]

This leads to the stochastic update rule:

\[\theta_{t+1} = \theta_t + \alpha G_t \frac{\nabla_\theta \pi(A_t \mid S_t)}{\pi(A_t \mid S_t)}\]

Because $\nabla \ln{f(x)} = \frac{\nabla f(x)}{f(x)}$ this is equivalent to:

\[\theta_{t+1} = \theta_t + \alpha G_t \nabla_\theta \ln \pi(A_t \mid S_t)\]

This formula makes both mathematical and common sense. If the return $G_t$ is high, we want to make the actions that led to it more likely. In other words, we want to push $\theta$ in the direction of the gradient of $\pi$ for actions that are associated with large returns.

REINFORCE with baseline

REINFORCE has zero bias because it relies on sample returns, but it can suffer from very high variance due to stochasticity in these returns. We can reduce variance by subtracting a baseline from these returns. A common approach is to subtract an estimate of the state value function (or related functions):

\[\theta_{t+1} = \theta_t + \alpha (G_t - \underbrace{\hat{v}(S_t)}_\text{baseline}) \nabla_\theta \ln \pi(A_t \mid S_t)\]

This strategy does not introduce bias because the value function does not depend on the selected actions.


We can further reduce variance by constructing bootstrapped targets, e.g. by replacing $G_t$ with a one-step TD target:

\[\underbrace{\theta_{t+1}}_\text{"actor" parameters} = \theta + \alpha \left[\underbrace{R_{t+1} + \gamma \underbrace{\hat{v}(S_{t+1})}_\text{"critic"}}_\text{TD target} - \hat{v}(S_t) \right] \nabla \ln \pi (A_t \mid S_t)\]

In such actor-critic approaches the actor - the policy - selects actions whereas the critic - the learned value function - helps evaluate the goodness of the selected actions. Because we are now using a bootstrapped target estimate we have introduced bias into our updates.

Note that REINFORCE with baseline is not an actor-critic approach despite also using a learned value function. This is because using $\hat{v}(S_t)$ introduces bias when used to bootstrap the target, but not when used as a baseline. Even this (super useful) TensorFlow tutorial makes this mistake. See further discussion here.

policy parameterizations

Policy gradient methods present a natural way of dealing with large or continuous action spaces. Rather than learning probability mass functions over many different actions we can directly learn the parameters of probability distributions, for example the mean and standard deviation of Gaussians. We can then draw from these Gaussians when selecting actions in a continuous action space.


Lilian Weng’s excellent policy gradient tutorial was very helpful, as was Sutton and Barto’s presentation of this proof.

  1. The policy gradient theorem also holds in continuing problems if we define $J(\theta)$ to be the average rate of reward: \(J(\theta) = \sum_s \mu(s) \sum_a \pi (a \mid s) \sum_{s',r} p(s',r \mid s, a) r\) 

  2. The stationary distribution describes the likelihood of being in each state after a really long time, given a particular starting state and policy. Mathematically: $\mu(s) = \lim_{t \rightarrow \infty} P(s_t=s \mid s_0, \pi_\theta)$