Bellman equation derivation: Difference between revisions

Latest revision as of 04:03, 1 September 2019

Bellman equation for $v_{\pi }$ .

We want to show $v_{\pi }(s)=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)[r+\gamma v_{\pi }(s')]$ for all states $s$ .

The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.

Law of total probability

The law of total probability states that if $B$ is an event, and $C_{1},\ldots ,C_{n}$ are events that partition the sample space, then ${\textstyle \Pr(B)=\sum _{j=1}^{n}\Pr(B\cap C_{j})=\sum _{j=1}^{n}\Pr(B\mid C_{j})\Pr(C_{j})}$ .

For fixed event $A$ with non-zero probability, the mapping $B\mapsto \Pr(B\mid A)$ is another valid probability measure, so the law of total probability also applies to it. Define ${\textstyle \Pr _{A}}$ by ${\textstyle \Pr _{A}(B):=\Pr(B\mid A)}$ for all events $B$ . Now the law of total probability for ${\textstyle \Pr _{A}}$ states that ${\textstyle \Pr _{A}(B)=\sum _{j=1}^{n}\Pr _{A}(B\mid C_{j})\Pr _{A}(C_{j})}$ . We also have

\Pr _{A}(B\mid C_{j})={\frac {\Pr _{A}(B\cap C_{j})}{\Pr _{A}(C_{j})}}={\frac {\Pr(B\cap C_{j}\mid A)}{\Pr(C_{j}\mid A)}}={\frac {\Pr(B\cap C_{j}\cap A)/\Pr(A)}{\Pr(C_{j}\cap A)/\Pr(A)}}={\frac {\Pr(B\cap (C_{j}\cap A))}{\Pr(C_{j}\cap A)}}=\Pr(B\mid C_{j},A)

So the law of total probability states that ${\textstyle \Pr(B\mid A)=\sum _{j=1}^{n}\Pr(B\mid C_{j},A)\Pr(C_{j}\mid A)}$ .

Now we see how the law of total probability interacts with conditional expectation. Let $X$ be a random variable. Then

\mathbb {E} [X\mid A]=\sum _{x}x\cdot \Pr(X=x\mid A)=\sum _{x}x\sum _{j}\Pr(X=x\mid C_{j},A)\Pr(C_{j}\mid A)

Here the event $X=x$ is playing the role of $B$ in the statement of the conditional law of total probability.

This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.

Start of proof

By definition, $v_{\pi }(s)=\mathbb {E} _{\pi }[G_{t}\mid S_{t}=s]$ . Now rewrite $G_{t}=R_{t+1}+\gamma G_{t+1}$ and use the linearity of expectation to get $\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]+\gamma \mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]$ . From here, we can work separately with $\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]$ and $\mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]$ for a while.

Simplifying expectation for reward term

Using the law of total probability while conditioning over actions, we have

\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{r}r\cdot \Pr(R_{t+1}=r\mid S_{t}=s)=\sum _{r}r\sum _{a}\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)\Pr(A_{t}=a\mid S_{t}=s)

Using the convention that $\Pr(A_{t}=a\mid S_{t}=s)=\pi (a\mid s)$ , this becomes

\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{r}r\sum _{a}\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)\pi (a\mid s)

Now we can reorder the sums to get

\sum _{a}\pi (a\mid s)\sum _{r}r\Pr(R_{t+1}=r\mid A_{t}=a,S_{t}=s)

Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have

\sum _{a}\pi (a\mid s)\sum _{r}r\sum _{s'}\Pr(R_{t+1}=r,S_{t+1}=s'\mid A_{t}=a,S_{t}=s)

Reordering the sums, this becomes

\sum _{a}\pi (a\mid s)\sum _{s'}\sum _{r}\Pr(R_{t+1}=r,S_{t+1}=s'\mid A_{t}=a,S_{t}=s)r

Sutton and Barto abbreviate $\Pr(R_{t+1}=r,S_{t+1}=s'\mid A_{t}=a,S_{t}=s)$ as $p(s',r\mid s,a)$ (strictly speaking, we should track the timestep parameter $t$ , but we will omit this detail here). We can also combine the nested sums ${\textstyle \sum _{s'}\sum _{r}}$ into a single sum that iterates over pairs $(s',r)$ . So we obtain

\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)r

This completes the part for $\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]$ . In other words, we have shown that

\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s]=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)r

Simplifying expression for return term

Now we do a similar series of steps for $\mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]$ . Conditioning over actions,

\mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s]=\sum _{g}g\cdot \Pr(G_{t+1}=g\mid S_{t}=s)=\sum _{g}g\sum _{a}\Pr(G_{t+1}=g\mid A_{t}=a,S_{t}=s)\pi (a\mid s)

Rearranging sums, this is

\sum _{a}\pi (a\mid s)\sum _{g}g\Pr(G_{t+1}=g\mid A_{t}=a,S_{t}=s)

Conditioning over states and rewards, we have

\sum _{a}\pi (a\mid s)\sum _{g}g\sum _{s',r}\Pr(G_{t+1}=g,S_{t+1}=s',R_{t+1}=r\mid A_{t}=a,S_{t}=s)

Now write $\Pr(G_{t+1}=g,S_{t+1}=s',R_{t+1}=r\mid A_{t}=a,S_{t}=s)$ as $\Pr(G_{t+1}=g\mid S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)\Pr(S_{t+1}=s',R_{t+1}=r\mid A_{t}=a,S_{t}=s)$ . Since $\Pr(S_{t+1}=s',R_{t+1}=r\mid A_{t}=a,S_{t}=s)=p(s',r\mid s,a)$ we have $\Pr(G_{t+1}=g,S_{t+1}=s',R_{t+1}=r\mid A_{t}=a,S_{t}=s)=\Pr(G_{t+1}=g\mid S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)p(s',r\mid s,a)$ . Thus, substituting this expression and rearranging sums, we have

\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)\sum _{g}g\Pr(G_{t+1}=g\mid S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)

This is basically what we want, except we want to say that $\sum _{g}g\Pr(G_{t+1}=g\mid S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)=\mathbb {E} _{\pi }[G_{t+1}\mid S_{t+1}=s']$ , since the latter expression equals $v_{\pi }(s')$ . Thankfully, we can say this, because of the Markov assumption. Actually proving this is a little complicated; see the section at the end of this page.

So we end up with

\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)v_{\pi }(s')

Finishing the proof

So now we have

{\begin{aligned}v_{\pi }(s)&=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)r+\gamma \sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)v_{\pi }(s')\\&=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)[r+\gamma v_{\pi }(s')]\end{aligned}}

This completes the proof. I'm pretty confused why Sutton and Barto seem to think abbreviating all these steps without comment is a good idea.

Using the Markov assumption

I think the following works: It suffices to show $\Pr(G_{t+1}=g\mid S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)=\Pr(G_{t+1}=g\mid S_{t+1}=s')$ . Conditioning over actions in the $t+1$ timestep, we have $\sum _{a'}\Pr(G_{t+1}=g\mid A_{t+1}=a',S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)\Pr(A_{t+1}=a'\mid S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)$ . But the action is chosen by the policy, which depends only on the state, so $\Pr(A_{t+1}=a'\mid S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)=\pi (a'\mid s')$ . As for $\Pr(G_{t+1}=g\mid A_{t+1}=a',S_{t+1}=s',R_{t+1}=r,A_{t}=a,S_{t}=s)$ , recall that $G_{t+1}=R_{t+2}+R_{t+3}+\cdots$ so has only rewards starting on timestep $t+2$ . Since we are working with an MDP, the environment's dynamics are completely determined by the state and action on timestep $t+1$ , i.e. on $s'$ and $a'$ (NOTE: Sutton and Barto's text seems to have a small error here, where they describe the Markovian property as "That is, the probability of each possible value for $S_{t}$ and $R_{t}$ depends only on the immediately preceding state and action, $S_{t-1}$ and $A_{t-1}$ , and, given them, not at all on earlier states and actions." -- the last part should probably say "not at all on earlier states, actions, and rewards". Otherwise I'm not sure how to prove that we can ignore the conditioning on $R_{t+1}=r$ ). So this is equal to $\Pr(G_{t+1}=g\mid A_{t+1}=a',S_{t+1}=s')$ . Thus we have $\sum _{a'}\Pr(G_{t+1}=g\mid A_{t+1}=a',S_{t+1}=s')\pi (a'\mid s')$ . Now we can reverse the conditionalization, i.e. we marginalize to obtain $\Pr(G_{t+1}=g\mid S_{t+1}=s')$ .

See this question for a similar manipulation.

@@ Line 5: / Line 5: @@
 The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.
-The law of total probability states that if <math>B</math> is an event, and <math>C_1, \ldots, C_n</math> are events that partition the sample space, then <math display="inline">\Pr(B) = \sum_{j=1}^n \Pr(B \mid C_j)\Pr(C_j)</math>.
+==Law of total probability==
-For fixed event <math>A</math> with non-zero probability, the mapping <math>B \mapsto \Pr(B \mid A)</math> is another valid probability measure. In other words, define <math>\Pr_A</math> by <math>\Pr_A(B) := \Pr(B \mid A)</math> for all events <math>B</math>. Now the law of total probability for <math>P_A</math> states that <math display="inline">\Pr_A(B) = \sum_{j=1}^n \Pr_A(B \mid C_j)\Pr_A(C_j)</math>. We also have
+The law of total probability states that if <math>B</math> is an event, and <math>C_1, \ldots, C_n</math> are events that partition the sample space, then <math display="inline">\Pr(B) = \sum_{j=1}^n \Pr(B \cap C_j) = \sum_{j=1}^n \Pr(B \mid C_j)\Pr(C_j)</math>.
+For fixed event <math>A</math> with non-zero probability, the mapping <math>B \mapsto \Pr(B \mid A)</math> is another valid probability measure, so the law of total probability also applies to it. Define <math display="inline">\Pr_A</math> by <math display="inline">\Pr_A(B) := \Pr(B \mid A)</math> for all events <math>B</math>. Now the law of total probability for <math display="inline">\Pr_A</math> states that <math display="inline">\Pr_A(B) = \sum_{j=1}^n \Pr_A(B \mid C_j)\Pr_A(C_j)</math>. We also have
 :<math>\Pr_A(B \mid C_j) = \frac{\Pr_A(B \cap C_j)}{\Pr_A(C_j)} = \frac{\Pr(B\cap C_j \mid A)}{\Pr(C_j\mid A)} = \frac{\Pr(B \cap C_j \cap A)/\Pr(A)}{\Pr(C_j \cap A)/\Pr(A)} = \frac{\Pr(B\cap (C_j\cap A))}{\Pr(C_j \cap A)} = \Pr(B \mid C_j,A)</math>
@@ Line 20: / Line 22: @@
 This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.
+==Start of proof==
 By definition, <math>v_\pi(s) = \mathbb E_\pi[G_t \mid S_t = s]</math>. Now rewrite <math>G_t = R_{t+1} + \gamma G_{t+1}</math> and use the linearity of expectation to get <math>\mathbb E_\pi[R_{t+1} \mid S_t=s] + \gamma\mathbb E_\pi[G_{t+1}\mid S_t = s]</math>. From here, we can work separately with <math>\mathbb E_\pi[R_{t+1} \mid S_t=s]</math> and <math>\mathbb E_\pi[G_{t+1}\mid S_t = s]</math> for a while.
+==Simplifying expectation for reward term==
 Using the law of total probability while conditioning over actions, we have
@@ Line 33: / Line 39: @@
 Now we can reorder the sums to get
-:<math>\sum_a \pi(a\mid s) \sum_r r \Pr(R_{t+1}=r \mid a,S_t=s)</math>
+:<math>\sum_a \pi(a\mid s) \sum_r r \Pr(R_{t+1}=r \mid A_t=a,S_t=s)</math>
+Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have
+:<math>\sum_a \pi(a\mid s) \sum_r r \sum_{s'} \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s)</math>
+Reordering the sums, this becomes
+:<math>\sum_a \pi(a\mid s) \sum_{s'} \sum_r \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s) r</math>
+Sutton and Barto abbreviate <math>\Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s)</math> as <math>p(s',r \mid s,a)</math> (strictly speaking, we should track the timestep parameter <math>t</math>, but we will omit this detail here). We can also combine the nested sums <math display="inline">\sum_{s'}\sum_r</math> into a single sum that iterates over pairs <math>(s',r)</math>. So we obtain
+:<math>\sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) r</math>
+This completes the part for <math>\mathbb E_\pi[R_{t+1} \mid S_t=s]</math>. In other words, we have shown that
+:<math>\mathbb E_\pi[R_{t+1} \mid S_t=s] = \sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) r</math>
+==Simplifying expression for return term==
+Now we do a similar series of steps for <math>\mathbb E_\pi[G_{t+1}\mid S_t = s]</math>. Conditioning over actions,
+:<math>\mathbb E_\pi[G_{t+1}\mid S_t = s] = \sum_g g \cdot \Pr(G_{t+1} = g \mid S_t=s) = \sum_g g \sum_a \Pr(G_{t+1} = g \mid A_t=a, S_t=s)\pi(a \mid s)</math>
+Rearranging sums, this is
+:<math>\sum_a \pi(a \mid s) \sum_g g  \Pr(G_{t+1} = g \mid A_t=a, S_t=s)</math>
+Conditioning over states and rewards, we have
+:<math>\sum_a \pi(a \mid s) \sum_g g  \sum_{s',r} \Pr(G_{t+1} = g, S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s)</math>
+Now write <math>\Pr(G_{t+1} = g, S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s)</math> as <math>\Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)\Pr(S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s)</math>. Since <math>\Pr(S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s) = p(s',r\mid s,a)</math> we have <math>\Pr(G_{t+1} = g, S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s) = \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)p(s',r\mid s,a)</math>. Thus, substituting this expression and rearranging sums, we have
+:<math>\sum_a \pi(a \mid s) \sum_{s',r} p(s',r\mid s,a) \sum_g g \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)</math>
+This is basically what we want, except we want to say that <math>\sum_g g \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \mathbb E_\pi[G_{t+1} \mid S_{t+1}=s']</math>, since the latter expression equals <math>v_\pi(s')</math>. Thankfully, we can say this, because of the Markov assumption. Actually proving this is a little complicated; see the section at the end of this page.
+So we end up with
+:<math>\sum_a \pi(a \mid s) \sum_{s',r} p(s',r\mid s,a) v_\pi(s')</math>
+==Finishing the proof==
+So now we have
+:<math>\begin{align}v_\pi(s) &= \sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) r + \gamma \sum_a \pi(a \mid s) \sum_{s',r} p(s',r\mid s,a) v_\pi(s') \\ &= \sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) [r + \gamma v_\pi(s')]\end{align}</math>
+This completes the proof. I'm pretty confused why Sutton and Barto seem to think abbreviating all these steps without comment is a good idea.
+==Using the Markov assumption==
-Again, using the law of total probability while conditioning this time over states, we have
+I think the following works: It suffices to show <math>\Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \Pr(G_{t+1} = g \mid S_{t+1}=s')</math>. Conditioning over actions in the <math>t+1</math> timestep, we have <math>\sum_{a'} \Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)\Pr(A_{t+1}=a' \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)</math>. But the action is chosen by the policy, which depends only on the state, so <math>\Pr(A_{t+1}=a' \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \pi(a' \mid s')</math>. As for <math>\Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)</math>, recall that <math>G_{t+1} = R_{t+2} + R_{t+3} + \cdots</math> so has only rewards starting on timestep <math>t+2</math>. Since we are working with an MDP, the environment's dynamics are completely determined by the state and action on timestep <math>t+1</math>, i.e. on <math>s'</math> and <math>a'</math> (NOTE: Sutton and Barto's text seems to have a small error here, where they describe the Markovian property as "That is, the probability of each possible value for <math>S_t</math> and <math>R_t</math> depends only on the immediately preceding state and action, <math>S_{t-1}</math> and <math>A_{t-1}</math>, and, given them, not at all on earlier states and actions." -- the last part should probably say "not at all on earlier states, actions, and rewards". Otherwise I'm not sure how to prove that we can ignore the conditioning on <math>R_{t+1}=r</math>). So this is equal to <math>\Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s')</math>. Thus we have <math>\sum_{a'} \Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s')\pi(a'\mid s')</math>. Now we can reverse the conditionalization, i.e. we marginalize to obtain <math>\Pr(G_{t+1} = g \mid S_{t+1}=s')</math>.
-:<math>\begin{align}\sum_a \pi(a\mid s) \sum_r r \Pr(R_{t+1}=r \mid A_t=a,S_t=s) \\ &= \sum_a \pi(a\mid s) \sum_r r \sum_{s'} \Pr(R_{t+1}=r \mid S_{t+1}=s',A_t=a,S_t=s)\Pr(S_{t+1}=s' \mid A_t=a,S_t=s)\end{align}</math>
+See [https://math.stackexchange.com/questions/3143290/writing-action-value-function-in-terms-of-state-value-function-for-a-markov-deci this question] for a similar manipulation.