The Bellman equation is a cornerstone of reinforcement learning, providing a recursive way to compute the value of states in a Markov Decision Process. It breaks down the complex problem of finding optimal policies into smaller, interconnected subproblems. In this grid world example, an agent must navigate from its starting position to the goal state, with each state having an associated value representing the expected future reward.
The Bellman Expectation Equation is the first fundamental form of the Bellman equation. It defines the value function V of pi for state s as the expected value under policy pi. The equation states that the value of being in state s under policy pi equals the expected immediate reward plus the discounted expected value of the next state. The expanded form shows how this expectation is computed over all possible actions and next states, weighted by the policy probabilities and transition probabilities.
The Bellman Optimality Equation is the second fundamental form, describing the optimal value function V star. Unlike the expectation equation which averages over actions according to a given policy, the optimality equation uses maximization to select the best action at each state. The max operator replaces the policy probability, representing the principle of optimality: the optimal value of a state is achieved by taking the action that maximizes the expected future return. This equation forms the basis for dynamic programming algorithms like value iteration.
The Bellman equation has deep connections to calculus through several key areas. First, the maximization operation in the optimality equation is analogous to finding extrema in calculus using derivatives. Second, in continuous state and action spaces, the Bellman equation becomes the Hamilton-Jacobi-Bellman partial differential equation, directly applying calculus to reinforcement learning. Third, when using function approximation with neural networks, gradient-based optimization methods rely on calculus to minimize loss functions derived from Bellman equations. Finally, the recursive structure of dynamic programming shares conceptual similarities with calculus concepts of accumulation and rates of change.
In summary, the Bellman equation serves as the mathematical foundation for reinforcement learning, providing both the expectation equation for policy evaluation and the optimality equation for finding optimal policies. These equations bridge discrete optimization problems with continuous calculus through the Hamilton-Jacobi-Bellman formulation, gradient-based function approximation, and optimization principles. The Bellman equation enables powerful algorithms like value iteration, policy iteration, Q-learning, and modern deep reinforcement learning methods, making it one of the most important concepts in artificial intelligence and optimal control theory.