Policy Gradient methods are a class of reinforcement learning algorithms that directly optimize the policy parameters to maximize expected return. Unlike value-based methods, they work by computing the gradient of the expected return with respect to the policy parameters theta, and then use gradient ascent to improve the policy.
To derive the policy gradient, we first express the expected return as a sum over all possible trajectories. Each trajectory tau has a probability P of tau given theta under the current policy, and generates a total return R of tau. The expected return is the weighted sum of all trajectory returns.
The key insight in policy gradient derivation is the log-likelihood trick. We use the mathematical identity that the gradient of a function equals the function times the gradient of its logarithm. This transforms the gradient of the probability into a more manageable form involving the log probability.
We decompose the trajectory probability into three components: the initial state probability, the environment transition probabilities, and the policy probabilities. When we take the gradient with respect to theta, only the policy terms survive because the environment dynamics don't depend on our policy parameters.
The policy gradient theorem gives us the final result: the gradient of expected return equals the expected sum of log policy gradients weighted by returns. In practice, we use future returns and subtract a baseline to reduce variance. This fundamental result enables direct policy optimization and forms the foundation for many modern reinforcement learning algorithms.