Bellman equation derivation: Difference between revisions

From Machinelearning
No edit summary
No edit summary
Line 43: Line 43:
:<math>\sum_a \pi(a\mid s) \sum_{s'} \sum_r \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s) r</math>
:<math>\sum_a \pi(a\mid s) \sum_{s'} \sum_r \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s) r</math>


Sutton and Barto write <math>\Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s)</math> as <math>p(s',r \mid s,a)</math> (strictly speaking, we should track the timestep parameter <math>t</math>, but we will omit this detail here) so we have
Sutton and Barto abbreviate <math>\Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s)</math> as <math>p(s',r \mid s,a)</math> (strictly speaking, we should track the timestep parameter <math>t</math>, but we will omit this detail here). We can also combine the nested sums <math display="inline">\sum_{s'}\sum_r</math> into a single sum that iterates over pairs <math>(s',r)</math>. So we obtain


:<math>\sum_a \pi(a\mid s) \sum_{s'} \sum_r p(s',r \mid s,a) r</math>
:<math>\sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) r</math>
 
This completes the part for <math>\mathbb E_\pi[R_{t+1} \mid S_t=s]</math>. In other words, we have shown that
 
:<math>\mathbb E_\pi[R_{t+1} \mid S_t=s] = \sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) r</math>

Revision as of 01:44, 1 September 2019

Bellman equation for vπ.

We want to show vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)] for all states s.

The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.

The law of total probability states that if B is an event, and C1,,Cn are events that partition the sample space, then Pr(B)=j=1nPr(BCj)=j=1nPr(BCj)Pr(Cj).

For fixed event A with non-zero probability, the mapping BPr(BA) is another valid probability measure. In other words, define PrA by PrA(B):=Pr(BA) for all events B. Now the law of total probability for PA states that PrA(B)=j=1nPrA(BCj)PrA(Cj). We also have

PrA(BCj)=PrA(BCj)PrA(Cj)=Pr(BCjA)Pr(CjA)=Pr(BCjA)/Pr(A)Pr(CjA)/Pr(A)=Pr(B(CjA))Pr(CjA)=Pr(BCj,A)

So the law of total probability states that Pr(BA)=j=1nPr(BCj,A)Pr(CjA).

Now we see how the law of total probability interacts with conditional expectation. Let X be a random variable. Then

E[XA]=xxPr(X=xA)=xxjPr(X=xCj,A)Pr(CjA)

Here the event X=x is playing the role of B in the statement of the conditional law of total probability.

This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.

By definition, vπ(s)=Eπ[GtSt=s]. Now rewrite Gt=Rt+1+γGt+1 and use the linearity of expectation to get Eπ[Rt+1St=s]+γEπ[Gt+1St=s]. From here, we can work separately with Eπ[Rt+1St=s] and Eπ[Gt+1St=s] for a while.

Using the law of total probability while conditioning over actions, we have

Eπ[Rt+1St=s]=rrPr(Rt+1=rSt=s)=rraPr(Rt+1=rAt=a,St=s)Pr(At=aSt=s)

Using the convention that Pr(At=aSt=s)=π(as), this becomes

Eπ[Rt+1St=s]=rraPr(Rt+1=rAt=a,St=s)π(as)

Now we can reorder the sums to get

aπ(as)rrPr(Rt+1=rAt=a,St=s)

Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have

aπ(as)rrsPr(Rt+1=r,St+1=sAt=a,St=s)

Reordering the sums, this becomes

aπ(as)srPr(Rt+1=r,St+1=sAt=a,St=s)r

Sutton and Barto abbreviate Pr(Rt+1=r,St+1=sAt=a,St=s) as p(s,rs,a) (strictly speaking, we should track the timestep parameter t, but we will omit this detail here). We can also combine the nested sums sr into a single sum that iterates over pairs (s,r). So we obtain

aπ(as)s,rp(s,rs,a)r

This completes the part for Eπ[Rt+1St=s]. In other words, we have shown that

Eπ[Rt+1St=s]=aπ(as)s,rp(s,rs,a)r