Bellman equation derivation

Bellman equation for $v_{\pi }$ .

We want to show $v_{\pi }(s)=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)[r+\gamma v_{\pi }(s')]$ for all states $s$ .

The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.

The law of total probability states that if $B$ is an event, and $C_{1},\ldots ,C_{n}$ are events that partition the sample space, then ${\textstyle \Pr(B)=\sum _{j=1}^{n}\Pr(B\cap C_{j})=\sum _{j=1}^{n}\Pr(B\mid C_{j})\Pr(C_{j})}$ .

For fixed event $A$ with non-zero probability, the mapping $B\mapsto \Pr(B\mid A)$ is another valid probability measure. In other words, define $\Pr _{A}$ by $\Pr _{A}(B):=\Pr(B\mid A)$ for all events $B$ . Now the law of total probability for $P_{A}$ states that ${\textstyle \Pr _{A}(B)=\sum _{j=1}^{n}\Pr _{A}(B\mid C_{j})\Pr _{A}(C_{j})}$ . We also have

\Pr _{A}(B\mid C_{j})={\frac {\Pr _{A}(B\cap C_{j})}{\Pr _{A}(C_{j})}}={\frac {\Pr(B\cap C_{j}\mid A)}{\Pr(C_{j}\mid A)}}={\frac {\Pr(B\cap C_{j}\cap A)/\Pr(A)}{\Pr(C_{j}\cap A)/\Pr(A)}}={\frac {\Pr(B\cap (C_{j}\cap A))}{\Pr(C_{j}\cap A)}}=\Pr(B\mid C_{j},A)

So the law of total probability states that ${\textstyle \Pr(B\mid A)=\sum _{j=1}^{n}\Pr(B\mid C_{j},A)\Pr(C_{j}\mid A)}$ .

Now we see how the law of total probability interacts with conditional expectation. Let $X$ be a random variable. Then

\mathbb {E} [X\mid A]=\sum _{x}x\cdot \Pr(X=x\mid A)=\sum _{x}x\sum _{j}\Pr(X=x\mid C_{j},A)\Pr(C_{j}\mid A)

Here the event $X=x$ is playing the role of $B$ in the statement of the conditional law of total probability.

This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.

By definition, $v_{\pi }(s)=\mathbb {E} _{\pi }[G_{t}\mid S_{t}=s]$ . Now rewrite $G_{t}=R_{t+1}+\gamma G_{t+1}$ and use the linearity of expectation to get $\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]+\gamma \mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]$ . From here, we can work separately with $\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]$ and $\mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]$ for a while.

Using the law of total probability while conditioning over actions, we have

\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{r}r\cdot \Pr(R_{t+1}=r\mid S_{t}=s)=\sum _{r}r\sum _{a}\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)\Pr(A_{t}=a\mid S_{t}=s)

Using the convention that $\Pr(A_{t}=a\mid S_{t}=s)=\pi (a\mid s)$ , this becomes

\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{r}r\sum _{a}\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)\pi (a\mid s)

Now we can reorder the sums to get

\sum _{a}\pi (a\mid s)\sum _{r}r\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)

Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have

\sum _{a}\pi (a\mid s)\sum _{r}r\sum _{s'}\Pr(R_{t+1}=r,S_{t+1}=s'\mid A_{t}=a,S_{t}=s)

Reordering the sums, this becomes

\sum _{a}\pi (a\mid s)\sum _{s'}\sum _{r}\Pr(R_{t+1}=r,S_{t+1}=s'\mid A_{t}=a,S_{t}=s)r

Sutton and Barto abbreviate $\Pr(R_{t+1}=r,S_{t+1}=s'\mid A_{t}=a,S_{t}=s)$ as $p(s',r\mid s,a)$ (strictly speaking, we should track the timestep parameter $t$ , but we will omit this detail here). We can also combine the nested sums ${\textstyle \sum _{s'}\sum _{r}}$ into a single sum that iterates over pairs $(s',r)$ . So we obtain

\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)r

This completes the part for $\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]$ . In other words, we have shown that

\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)r

Now we do a similar series of steps for $\mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]$ . Conditioning over actions,

\mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]=\sum _{g}g\cdot \Pr(G_{t+1}=g\mid S_{t}=s)=\sum _{g}g\sum _{a}\Pr(G_{t+1}=g\mid A_{t}=a,S_{t}=s)\pi (a\mid s)

Rearranging sums, this is

\sum _{a}\pi (a\mid s)\sum _{g}g\Pr(G_{t+1}=g\mid A_{t}=a,S_{t}=s)

Conditioning over states and rewards, we have

\sum _{a}\pi (a\mid s)\sum _{g}g\sum _{s',r}\Pr(G_{t+1}=g,S_{t+1}=s',R_{t+1}=r\mid A_{t}=a,S_{t}=s)

Now write $\Pr(G_{t+1}=g,S_{t+1}=s',R_{t+1}=r\mid A_{t}=a,S_{t}=s)$ as $\Pr(G_{t+1}=g\mid A_{t}=a,S_{t}=s)\Pr(S_{t+1}=s',R_{t+1}=r\mid A_{t}=a,S_{t}=s)$ .