|
|
Line 37: |
Line 37: |
| Again, using the law of total probability while conditioning this time over states, we have | | Again, using the law of total probability while conditioning this time over states, we have |
|
| |
|
| :<math>\sum_a \pi(a\mid s) \sum_r r \Pr(R_{t+1}=r \mid A_t=a,S_t=s) = \sum_a \pi(a\mid s) \sum_r r \sum_{s'} \Pr(R_{t+1}=r \mid S_{t+1}=s',A_t=a,S_t=s)\Pr(S_{t+1}=s' \mid A_t=a,S_t=s)</math> | | :<math>\begin{align}\sum_a \pi(a\mid s) \sum_r r \Pr(R_{t+1}=r \mid A_t=a,S_t=s) \\ &= \sum_a \pi(a\mid s) \sum_r r \sum_{s'} \Pr(R_{t+1}=r \mid S_{t+1}=s',A_t=a,S_t=s)\Pr(S_{t+1}=s' \mid A_t=a,S_t=s)\end{align}</math> |
Revision as of 01:33, 1 September 2019
Bellman equation for
.
We want to show
for all states
.
The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.
The law of total probability states that if
is an event, and
are events that partition the sample space, then
.
For fixed event
with non-zero probability, the mapping
is another valid probability measure. In other words, define
by
for all events
. Now the law of total probability for
states that
. We also have

So the law of total probability states that
.
Now we see how the law of total probability interacts with conditional expectation. Let
be a random variable. Then
![{\displaystyle \mathbb {E} [X\mid A]=\sum _{x}x\cdot \Pr(X=x\mid A)=\sum _{x}x\sum _{j}\Pr(X=x\mid C_{j},A)\Pr(C_{j}\mid A)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/35ef670fac6c79fa42e8161f5bf1218542a90dd3)
Here the event
is playing the role of
in the statement of the conditional law of total probability.
This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.
By definition,
. Now rewrite
and use the linearity of expectation to get
. From here, we can work separately with
and
for a while.
Using the law of total probability while conditioning over actions, we have
![{\displaystyle \mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{r}r\cdot \Pr(R_{t+1}=r\mid S_{t}=s)=\sum _{r}r\sum _{a}\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)\Pr(A_{t}=a\mid S_{t}=s)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/f2100ed1634fe12b589792649da79449124a05c9)
Using the convention that
, this becomes
![{\displaystyle \mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{r}r\sum _{a}\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)\pi (a\mid s)}](https://wikimedia.org/api/rest_v1/media/math/render/svg/0c5317f728f92277c5ab3fe96ec89e7955e8d1ed)
Now we can reorder the sums to get

Again, using the law of total probability while conditioning this time over states, we have
