Bellman equation derivation: Difference between revisions

Revision as of 03:38, 1 September 2019

Bellman equation for $v_{π}$ .

We want to show $v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]$ for all states $s$ .

The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.

The law of total probability states that if $B$ is an event, and $C_{1}, \dots, C_{n}$ are events that partition the sample space, then $Pr (B) = \sum_{j = 1}^{n} Pr (B \cap C_{j}) = \sum_{j = 1}^{n} Pr (B ∣ C_{j}) Pr (C_{j})$ .

For fixed event $A$ with non-zero probability, the mapping $B \mapsto Pr (B ∣ A)$ is another valid probability measure. In other words, define ${Pr}_{A}$ by ${Pr}_{A} (B) : = Pr (B ∣ A)$ for all events $B$ . Now the law of total probability for $P_{A}$ states that ${Pr}_{A} (B) = \sum_{j = 1}^{n} {Pr}_{A} (B ∣ C_{j}) {Pr}_{A} (C_{j})$ . We also have

{Pr}_{A} (B ∣ C_{j}) = \frac{{Pr}_{A} (B \cap C_{j})}{{Pr}_{A} (C_{j})} = \frac{Pr (B \cap C_{j} ∣ A)}{Pr (C_{j} ∣ A)} = \frac{Pr (B \cap C_{j} \cap A) / Pr (A)}{Pr (C_{j} \cap A) / Pr (A)} = \frac{Pr (B \cap (C_{j} \cap A))}{Pr (C_{j} \cap A)} = Pr (B ∣ C_{j}, A)

So the law of total probability states that $Pr (B ∣ A) = \sum_{j = 1}^{n} Pr (B ∣ C_{j}, A) Pr (C_{j} ∣ A)$ .

Now we see how the law of total probability interacts with conditional expectation. Let $X$ be a random variable. Then

E [X ∣ A] = \sum_{x} x \cdot Pr (X = x ∣ A) = \sum_{x} x \sum_{j} Pr (X = x ∣ C_{j}, A) Pr (C_{j} ∣ A)

Here the event $X = x$ is playing the role of $B$ in the statement of the conditional law of total probability.

This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.

By definition, $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$ . Now rewrite $G_{t} = R_{t + 1} + γ G_{t + 1}$ and use the linearity of expectation to get $E_{π} [R_{t + 1} ∣ S_{t} = s] + γ E_{π} [G_{t + 1} ∣ S_{t} = s]$ . From here, we can work separately with $E_{π} [R_{t + 1} ∣ S_{t} = s]$ and $E_{π} [G_{t + 1} ∣ S_{t} = s]$ for a while.

Using the law of total probability while conditioning over actions, we have

E_{π} [R_{t + 1} ∣ S_{t} = s] = \sum_{r} r \cdot Pr (R_{t + 1} = r ∣ S_{t} = s) = \sum_{r} r \sum_{a} Pr (R_{t + 1} = r ∣ A_{t} = a, S_{t} = s) Pr (A_{t} = a ∣ S_{t} = s)

Using the convention that $Pr (A_{t} = a ∣ S_{t} = s) = π (a ∣ s)$ , this becomes

E_{π} [R_{t + 1} ∣ S_{t} = s] = \sum_{r} r \sum_{a} Pr (R_{t + 1} = r ∣ A_{t} = a, S_{t} = s) π (a ∣ s)

Now we can reorder the sums to get

\sum_{a} π (a ∣ s) \sum_{r} r Pr (R_{t + 1} = r ∣ A_{t} = a, S_{t} = s)

Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have

\sum_{a} π (a ∣ s) \sum_{r} r \sum_{s^{'}} Pr (R_{t + 1} = r, S_{t + 1} = s^{'} ∣ A_{t} = a, S_{t} = s)

Reordering the sums, this becomes

\sum_{a} π (a ∣ s) \sum_{s^{'}} \sum_{r} Pr (R_{t + 1} = r, S_{t + 1} = s^{'} ∣ A_{t} = a, S_{t} = s) r

Sutton and Barto abbreviate $Pr (R_{t + 1} = r, S_{t + 1} = s^{'} ∣ A_{t} = a, S_{t} = s)$ as $p (s^{'}, r ∣ s, a)$ (strictly speaking, we should track the timestep parameter $t$ , but we will omit this detail here). We can also combine the nested sums $\sum_{s^{'}} \sum_{r}$ into a single sum that iterates over pairs $(s^{'}, r)$ . So we obtain

\sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) r

This completes the part for $E_{π} [R_{t + 1} ∣ S_{t} = s]$ . In other words, we have shown that

E_{π} [R_{t + 1} ∣ S_{t} = s] = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) r

Now we do a similar series of steps for $E_{π} [G_{t + 1} ∣ S_{t} = s]$ . Conditioning over actions,

E_{π} [G_{t + 1} ∣ S_{t} = s] = \sum_{g} g \cdot Pr (G_{t + 1} = g ∣ S_{t} = s) = \sum_{g} g \sum_{a} Pr (G_{t + 1} = g ∣ A_{t} = a, S_{t} = s) π (a ∣ s)

Rearranging sums, this is

\sum_{a} π (a ∣ s) \sum_{g} g Pr (G_{t + 1} = g ∣ A_{t} = a, S_{t} = s)

Conditioning over states and rewards, we have

\sum_{a} π (a ∣ s) \sum_{g} g \sum_{s^{'}, r} Pr (G_{t + 1} = g, S_{t + 1} = s^{'}, R_{t + 1} = r ∣ A_{t} = a, S_{t} = s)

Now write $Pr (G_{t + 1} = g, S_{t + 1} = s^{'}, R_{t + 1} = r ∣ A_{t} = a, S_{t} = s)$ as $Pr (G_{t + 1} = g ∣ S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s) Pr (S_{t + 1} = s^{'}, R_{t + 1} = r ∣ A_{t} = a, S_{t} = s)$ . Since $Pr (S_{t + 1} = s^{'}, R_{t + 1} = r ∣ A_{t} = a, S_{t} = s) = p (s^{'}, r ∣ s, a)$ we have $Pr (G_{t + 1} = g, S_{t + 1} = s^{'}, R_{t + 1} = r ∣ A_{t} = a, S_{t} = s) = Pr (G_{t + 1} = g ∣ S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s) p (s^{'}, r ∣ s, a)$ . Thus, substituting this expression and rearranging sums, we have

\sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) \sum_{g} g Pr (G_{t + 1} = g ∣ S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s)

This is basically what we want, except we want to say that $\sum_{g} g Pr (G_{t + 1} = g ∣ S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s) = E_{π} [G_{t + 1} ∣ S_{t + 1} = s^{'}]$ , since the latter expression equals $v_{π} (s^{'})$ . Thankfully, we can say this, because of the Markov assumption. Actually proving this is a little complicated; see this question for some details. I think the following works: It suffices to show $Pr (G_{t + 1} = g ∣ S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s) = Pr (G_{t + 1} = g ∣ S_{t + 1} = s^{'})$ . Conditioning over actions in the $t + 1$ timestep, we have $\sum_{a^{'}} Pr (G_{t + 1} = g ∣ A_{t + 1} = a^{'}, S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s) Pr (A_{t + 1} = a^{'} ∣ S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s)$ . But the action is chosen by the policy, which depends only on the state, so $Pr (A_{t + 1} = a^{'} ∣ S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s) = π (a^{'} ∣ s^{'})$ . As for $Pr (G_{t + 1} = g ∣ A_{t + 1} = a^{'}, S_{t + 1} = s^{'}, R_{t + 1} = r, A_{t} = a, S_{t} = s)$ , recall that $G_{t + 1} = R_{t + 2} + R_{t + 3} + \dots$ so has only rewards starting on timestep $t + 2$ . Since we are working with an MDP, the environment's dynamics are completely determined by the state and action on timestep $t + 1$ , i.e. on $s^{'}$ and $a^{'}$ . So this is equal to $Pr (G_{t + 1} = g ∣ A_{t + 1} = a^{'}, S_{t + 1} = s^{'})$ . Thus we have $\sum_{a^{'}} Pr (G_{t + 1} = g ∣ A_{t + 1} = a^{'}, S_{t + 1} = s^{'}) π (a^{'} ∣ s^{'})$ . Now we can reverse the conditionalization, i.e. we marginalize to obtain $Pr (G_{t + 1} = g ∣ S_{t + 1} = s^{'})$ .

So we end up with

\sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) v_{π} (s^{'})

So now we have

\begin{array}{r} v_{π} (s) & = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s', r ∣ s, a) r + γ \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s', r ∣ s, a) v_{π} (s') \\ = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s', r ∣ s, a) [r + γ v_{π} (s')] \end{array}

This completes the proof. I'm pretty confused why Sutton and Barto seem to think abbreviating all these steps without comment is a good idea.

@@ Line 67: / Line 67: @@
 :<math>\sum_a \pi(a \mid s) \sum_{s',r} p(s',r\mid s,a) \sum_g g \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)</math>
-This is basically what we want, except we want to say that <math>\sum_g g \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \mathbb E_\pi[G_{t+1} \mid S_{t+1}=s']</math>, since the latter expression equals <math>v_\pi(s')</math>. Thankfully, we can say this, because of the Markov assumption. Actually proving this is a little complicated; see [https://math.stackexchange.com/questions/3143290/writing-action-value-function-in-terms-of-state-value-function-for-a-markov-deci this question] for some details. I think the following works: It suffices to show <math>\Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \Pr(G_{t+1} = g \mid S_{t+1}=s')</math>. Conditioning over actions in the <math>t+1</math> timestep, we have <math>\Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) =\sum_{a'} \Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)\Pr(A_{t+1}=a' \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)</math>. But the action is chosen by the policy, which depends only on the state, so <math>\Pr(A_{t+1}=a' \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \pi(a' \mid s')</math>. As for <math>\Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)</math>, recall that <math>G_{t+1} = R_{t+2} + R_{t+3} + \cdots</math> so has only rewards starting on timestep <math>t+2</math>. Since we are working with an MDP, the environment's dynamics are completely determined by the state and action on timestep <math>t+1</math>, i.e. on <math>s'</math> and <math>a'</math>. So this is equal to <math>\Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s')</math>. Thus we have <math>\sum_{a'} \Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s')\pi(a'\mid s')</math>. Now we can reverse the conditionalization, i.e. we marginalize to obtain <math>\Pr(G_{t+1} = g \mid S_{t+1}=s')</math>.
+This is basically what we want, except we want to say that <math>\sum_g g \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \mathbb E_\pi[G_{t+1} \mid S_{t+1}=s']</math>, since the latter expression equals <math>v_\pi(s')</math>. Thankfully, we can say this, because of the Markov assumption. Actually proving this is a little complicated; see [https://math.stackexchange.com/questions/3143290/writing-action-value-function-in-terms-of-state-value-function-for-a-markov-deci this question] for some details. I think the following works: It suffices to show <math>\Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \Pr(G_{t+1} = g \mid S_{t+1}=s')</math>. Conditioning over actions in the <math>t+1</math> timestep, we have <math>\sum_{a'} \Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)\Pr(A_{t+1}=a' \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)</math>. But the action is chosen by the policy, which depends only on the state, so <math>\Pr(A_{t+1}=a' \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \pi(a' \mid s')</math>. As for <math>\Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)</math>, recall that <math>G_{t+1} = R_{t+2} + R_{t+3} + \cdots</math> so has only rewards starting on timestep <math>t+2</math>. Since we are working with an MDP, the environment's dynamics are completely determined by the state and action on timestep <math>t+1</math>, i.e. on <math>s'</math> and <math>a'</math>. So this is equal to <math>\Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s')</math>. Thus we have <math>\sum_{a'} \Pr(G_{t+1} = g \mid A_{t+1}=a', S_{t+1}=s')\pi(a'\mid s')</math>. Now we can reverse the conditionalization, i.e. we marginalize to obtain <math>\Pr(G_{t+1} = g \mid S_{t+1}=s')</math>.
 So we end up with