Bellman equation derivation: Difference between revisions

From Machinelearning
 
Line 9: Line 9:
The law of total probability states that if <math>B</math> is an event, and <math>C_1, \ldots, C_n</math> are events that partition the sample space, then <math display="inline">\Pr(B) = \sum_{j=1}^n \Pr(B \cap C_j) = \sum_{j=1}^n \Pr(B \mid C_j)\Pr(C_j)</math>.
The law of total probability states that if <math>B</math> is an event, and <math>C_1, \ldots, C_n</math> are events that partition the sample space, then <math display="inline">\Pr(B) = \sum_{j=1}^n \Pr(B \cap C_j) = \sum_{j=1}^n \Pr(B \mid C_j)\Pr(C_j)</math>.


For fixed event <math>A</math> with non-zero probability, the mapping <math>B \mapsto \Pr(B \mid A)</math> is another valid probability measure. In other words, define <math display="inline">\Pr_A</math> by <math display="inline">\Pr_A(B) := \Pr(B \mid A)</math> for all events <math>B</math>. Now the law of total probability for <math display="inline">\Pr_A</math> states that <math display="inline">\Pr_A(B) = \sum_{j=1}^n \Pr_A(B \mid C_j)\Pr_A(C_j)</math>. We also have
For fixed event <math>A</math> with non-zero probability, the mapping <math>B \mapsto \Pr(B \mid A)</math> is another valid probability measure, so the law of total probability also applies to it. Define <math display="inline">\Pr_A</math> by <math display="inline">\Pr_A(B) := \Pr(B \mid A)</math> for all events <math>B</math>. Now the law of total probability for <math display="inline">\Pr_A</math> states that <math display="inline">\Pr_A(B) = \sum_{j=1}^n \Pr_A(B \mid C_j)\Pr_A(C_j)</math>. We also have


:<math>\Pr_A(B \mid C_j) = \frac{\Pr_A(B \cap C_j)}{\Pr_A(C_j)} = \frac{\Pr(B\cap C_j \mid A)}{\Pr(C_j\mid A)} = \frac{\Pr(B \cap C_j \cap A)/\Pr(A)}{\Pr(C_j \cap A)/\Pr(A)} = \frac{\Pr(B\cap (C_j\cap A))}{\Pr(C_j \cap A)} = \Pr(B \mid C_j,A)</math>
:<math>\Pr_A(B \mid C_j) = \frac{\Pr_A(B \cap C_j)}{\Pr_A(C_j)} = \frac{\Pr(B\cap C_j \mid A)}{\Pr(C_j\mid A)} = \frac{\Pr(B \cap C_j \cap A)/\Pr(A)}{\Pr(C_j \cap A)/\Pr(A)} = \frac{\Pr(B\cap (C_j\cap A))}{\Pr(C_j \cap A)} = \Pr(B \mid C_j,A)</math>

Latest revision as of 04:03, 1 September 2019

Bellman equation for vπ.

We want to show vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)] for all states s.

The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.

Law of total probability

The law of total probability states that if B is an event, and C1,,Cn are events that partition the sample space, then Pr(B)=j=1nPr(BCj)=j=1nPr(BCj)Pr(Cj).

For fixed event A with non-zero probability, the mapping BPr(BA) is another valid probability measure, so the law of total probability also applies to it. Define PrA by PrA(B):=Pr(BA) for all events B. Now the law of total probability for PrA states that PrA(B)=j=1nPrA(BCj)PrA(Cj). We also have

PrA(BCj)=PrA(BCj)PrA(Cj)=Pr(BCjA)Pr(CjA)=Pr(BCjA)/Pr(A)Pr(CjA)/Pr(A)=Pr(B(CjA))Pr(CjA)=Pr(BCj,A)

So the law of total probability states that Pr(BA)=j=1nPr(BCj,A)Pr(CjA).

Now we see how the law of total probability interacts with conditional expectation. Let X be a random variable. Then

E[XA]=xxPr(X=xA)=xxjPr(X=xCj,A)Pr(CjA)

Here the event X=x is playing the role of B in the statement of the conditional law of total probability.

This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.

Start of proof

By definition, vπ(s)=Eπ[GtSt=s]. Now rewrite Gt=Rt+1+γGt+1 and use the linearity of expectation to get Eπ[Rt+1St=s]+γEπ[Gt+1St=s]. From here, we can work separately with Eπ[Rt+1St=s] and Eπ[Gt+1St=s] for a while.

Simplifying expectation for reward term

Using the law of total probability while conditioning over actions, we have

Eπ[Rt+1St=s]=rrPr(Rt+1=rSt=s)=rraPr(Rt+1=rAt=a,St=s)Pr(At=aSt=s)

Using the convention that Pr(At=aSt=s)=π(as), this becomes

Eπ[Rt+1St=s]=rraPr(Rt+1=rAt=a,St=s)π(as)

Now we can reorder the sums to get

aπ(as)rrPr(Rt+1=rAt=a,St=s)

Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have

aπ(as)rrsPr(Rt+1=r,St+1=sAt=a,St=s)

Reordering the sums, this becomes

aπ(as)srPr(Rt+1=r,St+1=sAt=a,St=s)r

Sutton and Barto abbreviate Pr(Rt+1=r,St+1=sAt=a,St=s) as p(s,rs,a) (strictly speaking, we should track the timestep parameter t, but we will omit this detail here). We can also combine the nested sums sr into a single sum that iterates over pairs (s,r). So we obtain

aπ(as)s,rp(s,rs,a)r

This completes the part for Eπ[Rt+1St=s]. In other words, we have shown that

Eπ[Rt+1St=s]=aπ(as)s,rp(s,rs,a)r

Simplifying expression for return term

Now we do a similar series of steps for Eπ[Gt+1St=s]. Conditioning over actions,

Eπ[Gt+1St=s]=ggPr(Gt+1=gSt=s)=ggaPr(Gt+1=gAt=a,St=s)π(as)

Rearranging sums, this is

aπ(as)ggPr(Gt+1=gAt=a,St=s)

Conditioning over states and rewards, we have

aπ(as)ggs,rPr(Gt+1=g,St+1=s,Rt+1=rAt=a,St=s)

Now write Pr(Gt+1=g,St+1=s,Rt+1=rAt=a,St=s) as Pr(Gt+1=gSt+1=s,Rt+1=r,At=a,St=s)Pr(St+1=s,Rt+1=rAt=a,St=s). Since Pr(St+1=s,Rt+1=rAt=a,St=s)=p(s,rs,a) we have Pr(Gt+1=g,St+1=s,Rt+1=rAt=a,St=s)=Pr(Gt+1=gSt+1=s,Rt+1=r,At=a,St=s)p(s,rs,a). Thus, substituting this expression and rearranging sums, we have

aπ(as)s,rp(s,rs,a)ggPr(Gt+1=gSt+1=s,Rt+1=r,At=a,St=s)

This is basically what we want, except we want to say that ggPr(Gt+1=gSt+1=s,Rt+1=r,At=a,St=s)=Eπ[Gt+1St+1=s], since the latter expression equals vπ(s). Thankfully, we can say this, because of the Markov assumption. Actually proving this is a little complicated; see the section at the end of this page.

So we end up with

aπ(as)s,rp(s,rs,a)vπ(s)

Finishing the proof

So now we have

vπ(s)=aπ(as)s,rp(s',rs,a)r+γaπ(as)s,rp(s',rs,a)vπ(s')=aπ(as)s,rp(s',rs,a)[r+γvπ(s')]

This completes the proof. I'm pretty confused why Sutton and Barto seem to think abbreviating all these steps without comment is a good idea.

Using the Markov assumption

I think the following works: It suffices to show Pr(Gt+1=gSt+1=s,Rt+1=r,At=a,St=s)=Pr(Gt+1=gSt+1=s). Conditioning over actions in the t+1 timestep, we have aPr(Gt+1=gAt+1=a,St+1=s,Rt+1=r,At=a,St=s)Pr(At+1=aSt+1=s,Rt+1=r,At=a,St=s). But the action is chosen by the policy, which depends only on the state, so Pr(At+1=aSt+1=s,Rt+1=r,At=a,St=s)=π(as). As for Pr(Gt+1=gAt+1=a,St+1=s,Rt+1=r,At=a,St=s), recall that Gt+1=Rt+2+Rt+3+ so has only rewards starting on timestep t+2. Since we are working with an MDP, the environment's dynamics are completely determined by the state and action on timestep t+1, i.e. on s and a (NOTE: Sutton and Barto's text seems to have a small error here, where they describe the Markovian property as "That is, the probability of each possible value for St and Rt depends only on the immediately preceding state and action, St1 and At1, and, given them, not at all on earlier states and actions." -- the last part should probably say "not at all on earlier states, actions, and rewards". Otherwise I'm not sure how to prove that we can ignore the conditioning on Rt+1=r). So this is equal to Pr(Gt+1=gAt+1=a,St+1=s). Thus we have aPr(Gt+1=gAt+1=a,St+1=s)π(as). Now we can reverse the conditionalization, i.e. we marginalize to obtain Pr(Gt+1=gSt+1=s).

See this question for a similar manipulation.