Bellman equation for
.
We want to show
for all states
.
The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.
Law of total probability
The law of total probability states that if
is an event, and
are events that partition the sample space, then
.
For fixed event
with non-zero probability, the mapping
is another valid probability measure, so the law of total probability also applies to it. Define
by
for all events
. Now the law of total probability for
states that
. We also have

So the law of total probability states that
.
Now we see how the law of total probability interacts with conditional expectation. Let
be a random variable. Then
![{\displaystyle \mathbb {E} [X\mid A]=\sum _{x}x\cdot \Pr(X=x\mid A)=\sum _{x}x\sum _{j}\Pr(X=x\mid C_{j},A)\Pr(C_{j}\mid A)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/35ef670fac6c79fa42e8161f5bf1218542a90dd3)
Here the event
is playing the role of
in the statement of the conditional law of total probability.
This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.
Start of proof
By definition,
. Now rewrite
and use the linearity of expectation to get
. From here, we can work separately with
and
for a while.
Simplifying expectation for reward term
Using the law of total probability while conditioning over actions, we have
![{\displaystyle \mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{r}r\cdot \Pr(R_{t+1}=r\mid S_{t}=s)=\sum _{r}r\sum _{a}\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)\Pr(A_{t}=a\mid S_{t}=s)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f2100ed1634fe12b589792649da79449124a05c9)
Using the convention that
, this becomes
![{\displaystyle \mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{r}r\sum _{a}\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)\pi (a\mid s)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/0c5317f728f92277c5ab3fe96ec89e7955e8d1ed)
Now we can reorder the sums to get

Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have

Reordering the sums, this becomes

Sutton and Barto abbreviate
as
(strictly speaking, we should track the timestep parameter
, but we will omit this detail here). We can also combine the nested sums
into a single sum that iterates over pairs
. So we obtain

This completes the part for
. In other words, we have shown that
![{\displaystyle \mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)r}](https://wikimedia.org/api/rest_v1/media/math/render/svg/fae5ff53ec908b271d8b146fd58d39390d1a1fd2)
Simplifying expression for return term
Now we do a similar series of steps for
. Conditioning over actions,
![{\displaystyle \mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]=\sum _{g}g\cdot \Pr(G_{t+1}=g\mid S_{t}=s)=\sum _{g}g\sum _{a}\Pr(G_{t+1}=g\mid A_{t}=a,S_{t}=s)\pi (a\mid s)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cf58bfb81a88cbde4a043aa0183c9118a9b779ab)
Rearranging sums, this is

Conditioning over states and rewards, we have

Now write
as
. Since
we have
. Thus, substituting this expression and rearranging sums, we have

This is basically what we want, except we want to say that
, since the latter expression equals
. Thankfully, we can say this, because of the Markov assumption. Actually proving this is a little complicated; see the section at the end of this page.
So we end up with

Finishing the proof
So now we have
![{\displaystyle {\begin{aligned}v_{\pi }(s)&=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)r+\gamma \sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)v_{\pi }(s')\\&=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)[r+\gamma v_{\pi }(s')]\end{aligned}}}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ada842e552b144be80b8d4e19199a5861accf039)
This completes the proof. I'm pretty confused why Sutton and Barto seem to think abbreviating all these steps without comment is a good idea.
Using the Markov assumption
I think the following works: It suffices to show
. Conditioning over actions in the
timestep, we have
. But the action is chosen by the policy, which depends only on the state, so
. As for
, recall that
so has only rewards starting on timestep
. Since we are working with an MDP, the environment's dynamics are completely determined by the state and action on timestep
, i.e. on
and
(NOTE: Sutton and Barto's text seems to have a small error here, where they describe the Markovian property as "That is, the probability of each possible value for
and
depends only on the immediately preceding state and action,
and
, and, given them, not at all on earlier states and actions." -- the last part should probably say "not at all on earlier states, actions, and rewards". Otherwise I'm not sure how to prove that we can ignore the conditioning on
). So this is equal to
. Thus we have
. Now we can reverse the conditionalization, i.e. we marginalize to obtain
.
See this question for a similar manipulation.