Bellman equation derivation: Difference between revisions

From Machinelearning
No edit summary
No edit summary
Line 43: Line 43:
:<math>\sum_a \pi(a\mid s) \sum_{s'} \sum_r \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s) r</math>
:<math>\sum_a \pi(a\mid s) \sum_{s'} \sum_r \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s) r</math>


Sutton and Barto write <math>\Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s)</math> as <math>p(s',r \mid s,a)</math> (strictly speaking, we should track the timestep parameter <math>t</math>) so we have
Sutton and Barto write <math>\Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s)</math> as <math>p(s',r \mid s,a)</math> (strictly speaking, we should track the timestep parameter <math>t</math>, but we will omit this detail here) so we have


:<math>\sum_a \pi(a\mid s) \sum_{s'} \sum_r p(s',r \mid s,a) r</math>
:<math>\sum_a \pi(a\mid s) \sum_{s'} \sum_r p(s',r \mid s,a) r</math>

Revision as of 01:42, 1 September 2019

Bellman equation for .

We want to show for all states .

The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.

The law of total probability states that if is an event, and are events that partition the sample space, then .

For fixed event with non-zero probability, the mapping is another valid probability measure. In other words, define by for all events . Now the law of total probability for states that . We also have

So the law of total probability states that .

Now we see how the law of total probability interacts with conditional expectation. Let be a random variable. Then

Here the event is playing the role of in the statement of the conditional law of total probability.

This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.

By definition, . Now rewrite and use the linearity of expectation to get . From here, we can work separately with and for a while.

Using the law of total probability while conditioning over actions, we have

Using the convention that , this becomes

Now we can reorder the sums to get

Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have

Reordering the sums, this becomes

Sutton and Barto write as (strictly speaking, we should track the timestep parameter , but we will omit this detail here) so we have