Bellman equation derivation: Difference between revisions

Revision as of 01:09, 1 September 2019

Bellman equation for $v_{π}$ .

We want to show $v_{π} (s) = \sum_{a} π (a ∣ s) \sum_{s^{'}, r} p (s^{'}, r ∣ s, a) [r + γ v_{π} (s^{'})]$ for all states $s$ .

The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.

The law of total probability states that if $B$ is an event, and $C_{1}, \dots, C_{n}$ are events that partition the sample space, then $Pr (B) = \sum_{j = 1}^{n} Pr (B ∣ C_{j}) Pr (C_{j})$ .

For fixed event $A$ with non-zero probability, the mapping $B \mapsto Pr (B ∣ A)$ is another valid probability measure. In other words, define ${Pr}_{A}$ by ${Pr}_{A} (B) : = Pr (B ∣ A)$ for all events $B$ . Now the law of total probability for $P_{A}$ states that ${Pr}_{A} (B) = \sum_{j = 1}^{n} {Pr}_{A} (B ∣ C_{j}) {Pr}_{A} (C_{j})$ . We also have

{Pr}_{A} (B ∣ C_{j}) = \frac{{Pr}_{A} (B \cap C_{j})}{{Pr}_{A} (C_{j})} = \frac{Pr (B \cap C_{j} ∣ A)}{Pr (C_{j} ∣ A)} = \frac{Pr (B \cap C_{j} \cap A) / Pr (A)}{Pr (C_{j} \cap A) / Pr (A)} = \frac{Pr (B \cap (C_{j} \cap A))}{Pr (C_{j} \cap A)} = Pr (B ∣ C_{j}, A)

So the law of total probability states that $Pr (B ∣ A) = \sum_{j = 1}^{n} Pr (B ∣ C_{j}, A) Pr (C_{j} ∣ A)$ .

Now we see how the law of total probability interacts with conditional expectation. Let $X$ be a random variable. Then

E [X ∣ A] = \sum_{x} x \cdot Pr (X = x ∣ A) = \sum_{x} x \sum_{j} Pr (X = x ∣ C_{j}, A) Pr (C_{j} ∣ A)

Here the event $X = x$ is playing the role of $B$ in the statement of the conditional law of total probability.

This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.

By definition, $v_{π} (s) = E_{π} [G_{t} ∣ S_{t} = s]$ . Now rewrite $G_{t} = R_{t + 1} + γ G_{t + 1}$ and use the linearity of expectation to get $E_{π} [R_{t + 1} ∣ S_{t} = s] + γ E_{π} [G_{t + 1} ∣ S_{t} = s]$ . From here, we can work separately with $E_{π} [R_{t + 1} ∣ S_{t} = s]$ and $E_{π} [G_{t + 1} ∣ S_{t} = s]$ for a while.

Using the law of total probability while conditioning over actions, we have

E_{π} [R_{t + 1} ∣ S_{t} = s] = \sum_{r} r \cdot Pr (R_{t + 1} = r ∣ S_{t} = s) = \sum_{r} r \sum_{a} Pr (R_{t + 1} = r ∣ A_{t} = a, S_{t} = s) Pr (A_{t} = a ∣ S_{t} = s)

Using the convention that $Pr (A_{t} = a ∣ S_{t} = s) = π (a ∣ s)$ , this becomes

E_{π} [R_{t + 1} ∣ S_{t} = s] = \sum_{r} r \sum_{a} Pr (R_{t + 1} = r ∣ a, S_{t} = s) π (a ∣ s)

@@ Line 15: / Line 15: @@
 Now we see how the law of total probability interacts with conditional expectation. Let <math>X</math> be a random variable. Then
-:<math>\mathbb E[X \mid A] = \sum_x x \cdot \Pr(X = x \mid A) = \sum_x x \cdot \sum_j \Pr(X =x \mid C_j,A)\Pr(C_j\mid A)</math>
+:<math>\mathbb E[X \mid A] = \sum_x x \cdot \Pr(X = x \mid A) = \sum_x x \sum_j \Pr(X =x \mid C_j,A)\Pr(C_j\mid A)</math>
 Here the event <math>X=x</math> is playing the role of <math>B</math> in the statement of the conditional law of total probability.
@@ Line 23: / Line 23: @@
 By definition, <math>v_\pi(s) = \mathbb E_\pi[G_t \mid S_t = s]</math>. Now rewrite <math>G_t = R_{t+1} + \gamma G_{t+1}</math> and use the linearity of expectation to get <math>\mathbb E_\pi[R_{t+1} \mid S_t=s] + \gamma\mathbb E_\pi[G_{t+1}\mid S_t = s]</math>. From here, we can work separately with <math>\mathbb E_\pi[R_{t+1} \mid S_t=s]</math> and <math>\mathbb E_\pi[G_{t+1}\mid S_t = s]</math> for a while.
-Using the law of total probability, we have
+Using the law of total probability while conditioning over actions, we have
-:<math>\mathbb E_\pi[R_{t+1} \mid S_t=s] = \sum_r r \cdot \Pr(R_{t+1}=r \mid S_t=s) = \sum_r r \cdot \sum_a \Pr(R_{t+1}=r \mid a,S_t=s)\pi(a\mid s)</math>
+:<math>\mathbb E_\pi[R_{t+1} \mid S_t=s] = \sum_r r \cdot \Pr(R_{t+1}=r \mid S_t=s) = \sum_r r \sum_a \Pr(R_{t+1}=r \mid A_t=a,S_t=s)\Pr(A_t=a \mid S_t=s)</math>
+Using the convention that <math>\Pr(A_t=a \mid S_t=s) = \pi(a\mid s)</math>, this becomes
+:<math>\mathbb E_\pi[R_{t+1} \mid S_t=s] = \sum_r r \sum_a \Pr(R_{t+1}=r \mid a,S_t=s)\pi(a\mid s)</math>