Bellman equation derivation
Bellman equation for .
We want to show Failed to parse (Conversion error. Server ("https://wikimedia.org/api/rest_") reported: "Cannot get mml. Server problem."): {\displaystyle v_{\pi }(s)=\sum _{a}\pi (a\mid s)\sum _{s',r}p(s',r\mid s,a)[r+\gamma v_{\pi }(s')]} for all states .
The core idea of the proof is to use the law of total probability to go from marginal to conditional probabilities, and then invoke the Markov assumption.
The law of total probability states that if is an event, and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle C_1, \ldots, C_n} are events that partition the sample space, then Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle \Pr(B) = \sum_{j=1}^n \Pr(B \cap C_j) = \sum_{j=1}^n \Pr(B \mid C_j)\Pr(C_j)} .
For fixed event Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle A} with non-zero probability, the mapping Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle B \mapsto \Pr(B \mid A)} is another valid probability measure. In other words, define Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr_A} by Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr_A(B) := \Pr(B \mid A)} for all events Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle B} . Now the law of total probability for Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle P_A} states that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle \Pr_A(B) = \sum_{j=1}^n \Pr_A(B \mid C_j)\Pr_A(C_j)} . We also have
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr_A(B \mid C_j) = \frac{\Pr_A(B \cap C_j)}{\Pr_A(C_j)} = \frac{\Pr(B\cap C_j \mid A)}{\Pr(C_j\mid A)} = \frac{\Pr(B \cap C_j \cap A)/\Pr(A)}{\Pr(C_j \cap A)/\Pr(A)} = \frac{\Pr(B\cap (C_j\cap A))}{\Pr(C_j \cap A)} = \Pr(B \mid C_j,A)}
So the law of total probability states that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle \Pr(B \mid A) = \sum_{j=1}^n \Pr(B \mid C_j,A)\Pr(C_j\mid A)} .
Now we see how the law of total probability interacts with conditional expectation. Let Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X} be a random variable. Then
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E[X \mid A] = \sum_x x \cdot \Pr(X = x \mid A) = \sum_x x \sum_j \Pr(X =x \mid C_j,A)\Pr(C_j\mid A)}
Here the event Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle X=x} is playing the role of Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle B} in the statement of the conditional law of total probability.
This is the basic trick of the proof; we keep conditioning on different things (actions, next states, rewards) and using the law of total probability.
By definition, Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle v_\pi(s) = \mathbb E_\pi[G_t \mid S_t = s]} . Now rewrite Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle G_t = R_{t+1} + \gamma G_{t+1}} and use the linearity of expectation to get Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[R_{t+1} \mid S_t=s] + \gamma\mathbb E_\pi[G_{t+1}\mid S_t = s]} . From here, we can work separately with Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[R_{t+1} \mid S_t=s]} and Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[G_{t+1}\mid S_t = s]} for a while.
Using the law of total probability while conditioning over actions, we have
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[R_{t+1} \mid S_t=s] = \sum_r r \cdot \Pr(R_{t+1}=r \mid S_t=s) = \sum_r r \sum_a \Pr(R_{t+1}=r \mid A_t=a,S_t=s)\Pr(A_t=a \mid S_t=s)}
Using the convention that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr(A_t=a \mid S_t=s) = \pi(a\mid s)} , this becomes
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[R_{t+1} \mid S_t=s] = \sum_r r \sum_a \Pr(R_{t+1}=r \mid A_t=a,S_t=s)\pi(a\mid s)}
Now we can reorder the sums to get
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_a \pi(a\mid s) \sum_r r \Pr(R_{t+1}=r \mid A_t=a,S_t=s)}
Again, using the law of total probability (in its conjunction form) while conditioning this time over states, we have
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_a \pi(a\mid s) \sum_r r \sum_{s'} \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s)}
Reordering the sums, this becomes
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_a \pi(a\mid s) \sum_{s'} \sum_r \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s) r}
Sutton and Barto abbreviate Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr(R_{t+1}=r,S_{t+1}=s' \mid A_t=a,S_t=s)} as Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle p(s',r \mid s,a)} (strictly speaking, we should track the timestep parameter Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle t} , but we will omit this detail here). We can also combine the nested sums Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\textstyle \sum_{s'}\sum_r} into a single sum that iterates over pairs Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle (s',r)} . So we obtain
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) r}
This completes the part for Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[R_{t+1} \mid S_t=s]} . In other words, we have shown that
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[R_{t+1} \mid S_t=s] = \sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) r}
Now we do a similar series of steps for Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[G_{t+1}\mid S_t = s]} . Conditioning over actions,
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \mathbb E_\pi[G_{t+1}\mid S_t = s] = \sum_g g \cdot \Pr(G_{t+1} = g \mid S_t=s) = \sum_g g \sum_a \Pr(G_{t+1} = g \mid A_t=a, S_t=s)\pi(a \mid s)}
Rearranging sums, this is
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_a \pi(a \mid s) \sum_g g \Pr(G_{t+1} = g \mid A_t=a, S_t=s)}
Conditioning over states and rewards, we have
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_a \pi(a \mid s) \sum_g g \sum_{s',r} \Pr(G_{t+1} = g, S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s)}
Now write Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr(G_{t+1} = g, S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s)} as Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)\Pr(S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s)} . Since Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr(S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s) = p(s',r\mid s,a)} we have Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \Pr(G_{t+1} = g, S_{t+1}=s', R_{t+1}=r \mid A_t=a, S_t=s) = \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)p(s',r\mid s,a)} . Thus, substituting this expression and rearranging sums, we have
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_a \pi(a \mid s) \sum_{s',r} p(s',r\mid s,a) \sum_g g \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s)}
This is basically what we want, except we want to say that Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_g g \Pr(G_{t+1} = g \mid S_{t+1}=s', R_{t+1}=r, A_t=a, S_t=s) = \mathbb E_\pi[G_{t+1} \mid S_{t+1}=s']} , since the latter expression equals Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle v_\pi(s')} . Thankfully, we can say this, because of the Markov assumption. Actually proving this is a little complicated; see this question for some details.
So we end up with
- Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle \sum_a \pi(a \mid s) \sum_{s',r} p(s',r\mid s,a) v_\pi(s')}
So now we have
Failed to parse (SVG (MathML can be enabled via browser plugin): Invalid response ("Math extension cannot connect to Restbase.") from server "https://wikimedia.org/api/rest_v1/":): {\displaystyle v_pi(s) = \sum_a \pi(a\mid s) \sum_{s',r} p(s',r \mid s,a) r + \gamma \sum_a \pi(a \mid s) \sum_{s',r} p(s',r\mid s,a) v_\pi(s')}