Bellman equation derivation

From Machinelearning
Revision as of 00:56, 1 September 2019 by IssaRice (talk | contribs)

Bellman equation for vπ.

We want to show vπ(s)=aπ(as)s,rp(s,rs,a)[r+γvπ(s)] for all states s.

The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.

The law of total probability states that if B is an event, and C1,,Cn are events that partition the sample space, then Pr(B)=j=1nPr(BCj)Pr(Cj).

For fixed event A with non-zero probability, the mapping BPr(BA) is another valid probability measure. In other words, define PrA by PrA(B):=Pr(BA) for all events B. Now the law of total probability for PA states that PrA(B)=j=1nPrA(BCj)PrA(Cj). We also have

PrA(BCj)=PrA(BCj)PrA(Cj)=Pr(BCjA)Pr(CjA)=Pr(BCjA)/Pr(A)Pr(CjA)/Pr(A)=Pr(B(CjA))Pr(CjA)=Pr(BCj,A)

So the law of total probability states that Pr(BA)=j=1nPr(BCj,A)Pr(CjA).