User:IssaRice/Random variable switching trick

i'm not sure what to call this, so i've temporarily given it a name, "Random variable switching trick".

in reinforcement learning, the value of a state, $v_{\pi }(s)$ , is defined as the expected value of the return, $\mathbb {E} _{\pi }[G_{t}\mid S_{t}=s]$ .

but it turns out that we can do a weird sort of thing where we swap around which random variable we use.

for example, $\mathbb {E} _{\pi }[G_{t}]$ can be written as $v_{\pi }(S_{t})$ -- what we've done is to tuck in the expected value stuff into the definition of v, so that the random variable changes from the return rv to the state rv.

teaching this trick actually seems to the purpose of exercises 3.18 and 3.19 in sutton and barto (besides the more pedestrian purpose of gaining familiarity with backup diagrams). but exercise 3.18 is oddly stated, because it says "in terms of the value at the expected leaf node, $q_{\pi }(s,a)$ , given $S_{t}=s$ ". I think they meant to write $q_{\pi }(s,A_{t})$ instead, i.e. the point was to swap in the action rv for the return rv.

so it would look like $\mathbb {E} _{\pi }[G_{t}\mid S_{t}=s]=\mathbb {E} _{\pi }[q_{\pi }(s,A_{t})\mid S_{t}=s]=\sum _{a}\Pr _{\pi }(A_{t}=a\mid S_{t}=s)q_{\pi }(s,a)=\sum _{a}\pi (a|s)q_{\pi }(s,a)$ .

similarly exercise 3.19 is $q_{\pi }(s,a)=\mathbb {E} _{\pi }[G_{t}\mid S_{t}=s,A_{t}=a]=\mathbb {E} [R_{t+1}+\gamma v_{\pi }(S_{t+1})\mid S_{t}=s,A_{t}=a]=\sum _{s',r}p(s',r|s,a)(r+\gamma v_{\pi }(s'))$ .

this kind of switch intuitively makes sense, but how can we prove it?

let's take $\mathbb {E} _{\pi }[G_{t}\mid S_{t}=s,A_{t}=a]=\mathbb {E} [R_{t+1}+\gamma v_{\pi }(S_{t+1})\mid S_{t}=s,A_{t}=a]$ as an example.

${\begin{aligned}\mathbb {E} _{\pi }[G_{t}\mid S_{t}=s,A_{t}=a]&=\mathbb {E} _{\pi }[R_{t+1}+\gamma G_{t+1}\mid S_{t}=s,A_{t}=a]\\&=\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s,A_{t}=a]+\gamma \mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s,A_{t}=a]\end{aligned}}$

But $\mathbb {E} _{\pi }[R_{t+1}\mid S_{t}=s,A_{t}=a]$ is the same as $\mathbb {E} [R_{t+1}\mid S_{t}=s,A_{t}=a]$ because the policy $\pi$ won't even be used, since we are already conditioning on the action that is taken.

So now we've reduced our problem to showing that $\mathbb {E} _{\pi }[G_{t+1}\mid S_{t}=s,A_{t}=a]=\mathbb {E} [v_{\pi }(S_{t+1})\mid S_{t}=s,A_{t}=a]$ .

By definition of $v_{\pi }$ , we have $\mathbb {E} [v_{\pi }(S_{t+1})\mid S_{t}=s,A_{t}=a]=\mathbb {E} [\mathbb {E} _{\pi }[G_{t+1}\mid S_{t+1}]\mid S_{t}=s,A_{t}=a]$

here are some chicken scratches for how this could work from here:

$\mathbb {E} [\mathbb {E} _{\pi }[G_{t+1}\mid S_{t+1}]\mid S_{t}=s,A_{t}=a]$

$\mathbb {E} \left[\sum _{g'}g'p_{\pi }(g'|S_{t+1})\;{\bigg \vert }\;S_{t}=s,A_{t}=a\right]$

$\sum _{s'}p(s'|s,a)\sum _{g'}g'p_{\pi }(g'|s')$

$\sum _{s'}\sum _{g'}g'p(s'|s,a)p_{\pi }(g'|s')$

$\sum _{s'}\sum _{g'}g'p_{\pi }(g',s'|s,a)$ (we used the markov property here! $p_{\pi }(g',s'|s,a)=p_{\pi }(g'|s',s,a)p_{\pi }(s'|s,a)$ so we needed to know that $p_{\pi }(g'|s',s,a)=p_{\pi }(g'|s')$ . that's exactly what markov property tells us, after a laborious calculation...)

$\sum _{g'}g'\sum _{s'}p_{\pi }(g',s'|s,a)$

$\sum _{g'}g'p_{\pi }(g'|s,a)$

$\mathbb {E} _{\pi }[G_{t+1}\mid s,a]$

i'm MOSTLY convinced, but i'm a little nervous about some of the steps.

So this random variable switching trick, once proven, is a way to modularize the derivation of the bellman equation even further. exercises 3.18 and 3.19 give a trivial solution to exercises 3.12 and 3.13, and these in turn give a trivial derivation of the bellman equation.

so the three ways to do the bellman equation are:

do it straight. see this answer (though i think the markov property step is a little under-justified).
do exercises 3.12 and 3.13, then do bellman. that's my answer.
prove the random variable switching trick. this justifies summing over different random variables like s', r, a instead of g in the expected value definition of v_pi and q_pi. once you have this in hand, you can do exercises 3.12/3.13, which then give you bellman.