User:IssaRice/Random variable switching trick

From Machinelearning

i'm not sure what to call this, so i've temporarily given it a name, "Random variable switching trick".

in reinforcement learning, the value of a state, , is defined as the expected value of the return, .

but it turns out that we can do a weird sort of thing where we swap around which random variable we use.

for example, can be written as -- what we've done is to tuck in the expected value stuff into the definition of v, so that the random variable changes from the return rv to the state rv.

teaching this trick actually seems to the purpose of exercises 3.18 and 3.19 in sutton and barto (besides the more pedestrian purpose of gaining familiarity with backup diagrams). but exercise 3.18 is oddly stated, because it says "in terms of the value at the expected leaf node, , given ". I think they meant to write instead, i.e. the point was to swap in the action rv for the return rv.

so it would look like .

similarly exercise 3.19 is .

this kind of switch intuitively makes sense, but how can we prove it?

let's take as an example.

But is the same as because the policy won't even be used, since we are already conditioning on the action that is taken.

So now we've reduced our problem to showing that .

By definition of , we have

here are some chicken scratches for how this could work from here:

(we used the markov property here! so we needed to know that . that's exactly what markov property tells us, after a laborious calculation...)

i'm MOSTLY convinced, but i'm a little nervous about some of the steps.

So this random variable switching trick, once proven, is a way to modularize the derivation of the bellman equation even further. exercises 3.18 and 3.19 give a trivial solution to exercises 3.12 and 3.13, and these in turn give a trivial derivation of the bellman equation.

so the three ways to do the bellman equation are:

  1. do it straight. see this answer (though i think the markov property step is a little under-justified).
  2. do exercises 3.12 and 3.13, then do bellman. that's my answer.
  3. prove the random variable switching trick. this justifies summing over different random variables like s', r, a instead of g in the expected value definition of v_pi and q_pi. once you have this in hand, you can do exercises 3.12/3.13, which then give you bellman.