User:IssaRice/Chain rule proofs

From Machinelearning
Jump to: navigation, search

a differentiable function looks locally like a linear transformation. if you compose two differentiable functions, then it seems pretty obvious that locally, that composed map also looks like a linear transformation. but which linear transformation? well, you first apply the linear transformation that locally approximates the inner function. then you land in some target space, and you find what the outer function locally looks like at the place you land. and that's all the chain rule says.

So Folland actually says this on page 109: "Since the product of two matrices gives the composition of the linear transformations defined by those matrices, the chain rule just says that the linear approximation of a composition is the composition of the linear approximations." But... putting this short paragraph away after a proof, and only after you've already discussed two versions of the chain rule in previous chapters, is in my opinion not trying hard enough to communicate the central message. This should be in bold flashing letters at the TOP of the FIRST discussion of the chain rule.

Using Newton's approximation

Main idea

The main idea of using Newton's approximation to prove the chain rule is that since f is differentiable at x_0 we have the approximation f(x) \approx f(x_0) + f'(x_0)(x-x_0) when x is near x_0. Similarly since g is differentiable at f(x_0) we have the approximation g(y) \approx g(f(x_0)) + g'(f(x_0))(y - f(x_0)) when y is near f(x_0). Since f is differentiable at x_0, it is continuous there also, so we know that f(x) is near f(x_0) whenever x is near x_0. This allows us to substitute f(x) into y whenever x is near x_0. So we get

{\displaystyle \begin{align}g(f(x)) &\approx g(f(x_0)) + g'(f(x_0))(f(x) - f(x_0)) \\ &\approx g(f(x_0)) + g'(f(x_0))(f'(x_0)(x-x_0))\end{align}}

Thus we get g\circ f(x) \approx g\circ f(x_0) + g'(f(x_0))f'(x_0)(x-x_0), which is what the chain rule says.

Slightly more formally:

f(x) = f(x_0) + f'(x_0)(x-x_0) + o(x-x_0) when x is near x_0

g(y) = g(f(x_0)) + g'(f(x_0))(y - f(x_0)) + o(y-f(x_0)) when y is near f(x_0)

f is continuous at x_0 so f(x) is near f(x_0) whenever x is near x_0.

Thus if x is near x_0

{\displaystyle \begin{align}g(f(x)) &= g(f(x_0)) + g'(f(x_0))(f'(x_0)(x-x_0) + o(x-x_0)) + o(f(x)-f(x_0)) \\ &= g(f(x_0)) + g'(f(x_0))f'(x_0)(x-x_0) + g'(f(x_0))o(x-x_0) + o(f(x)-f(x_0))\end{align}}

to complete the proof, we just need to show that the error g'(f(x_0))o(x-x_0) + o(f(x)-f(x_0)) is o(x-x_0). The former term clearly is. The latter term is o(f'(x_0)(x-x_0) + o(x-x_0)) so it is as well.


We want to show g\circ f is differentiable at x_0 with derivative L:=g'(f(x_0))f'(x_0). By Newton's approximation, this is equivalent to showing that for every \epsilon > 0 there exists \delta > 0 such that

{\displaystyle |g\circ f(x) - (g\circ f(x_0) + L(x-x_0))| \leq \epsilon|x-x_0|}

whenever |x-x_0|\leq \delta. So let \epsilon > 0.

Now we do some algebraic manipulation. Write

{\displaystyle g(y) = g(y_0) + g'(y_0)(y - y_0) + E_g(y,y_0)}

where E_g(y,y_0) := g(y) - (g(y_0) + g'(y_0)(y - y_0)). This holds for every y \in Y. Since f(x) \in Y we thus have

{\displaystyle g(f(x)) = g(f(x_0)) + g'(f(x_0))(f(x) - f(x_0)) + E_g(f(x),f(x_0))}

Similarly write

{\displaystyle f(x) = f(x_0) + f'(x_0)(x - x_0) + E_f(x,x_0)}

where E_f(x,x_0) := f(x) - (f(x_0) + f'(x_0)(x - x_0)).

Substituting the expression for f(x) in the expression for g(f(x)) we get

{\displaystyle \begin{align}g(f(x)) &= g(f(x_0)) + g'(f(x_0))(f'(x_0)(x - x_0) + E_f(x,x_0)) + E_g(f(x),f(x_0)) \\ &= g(f(x_0)) + g'(f(x_0))f'(x_0)(x - x_0) + g'(f(x_0))E_f(x,x_0) + E_g(f(x),f(x_0))\end{align}}

we can rewrite this as g\circ f(x) - (g\circ f(x_0) + L(x-x_0)) = g'(f(x_0))E_f(x,x_0) + E_g(f(x),f(x_0))

Thus our goal now is to show |g'(f(x_0))E_f(x,x_0) + E_g(f(x),f(x_0))| \leq \epsilon|x-x_0|.

But by the triangle inequality it suffices to show |g'(f(x_0))E_f(x,x_0)| + |E_g(f(x),f(x_0))| \leq \epsilon|x-x_0|.

|g'(f(x_0))E_f(x,x_0)| \leq |g'(f(x_0))|\epsilon_1|x-x_0| where we are free to choose \epsilon_1.

To get the bound for |E_g(f(x),f(x_0))| (using Newton's approximation), we need to make sure |f(x)-f(x_0)| is small. But by continuity of f at x_0 we can do this.

{\displaystyle \begin{align}|E_g(f(x),f(x_0))| &\leq \epsilon_2 |f(x) - f(x_0)| \\ &= \epsilon_2 |f'(x_0)(x - x_0) + E_f(x,x_0)| \\ &\leq \epsilon_2(|f'(x_0)||x-x_0| + |x-x_0|) \\ &= \epsilon_2(|f'(x_0)| + 1)|x-x_0|\end{align}}

where again we are free to choose \epsilon_2.

TODO: can we do this same proof but without using the error term notation?

TODO: somehow Folland does this without explicitly using continuity of f; i need to understand if he's using it implicitly somehow or he's actually proving it when bounding |\mathbf h| using |u|

old proof

Since g is differentiable at y_0, we know g'(y_0) is a real number, and we can write

{\displaystyle g(y) = g(y_0) + g'(y_0)(y - y_0) + [g(y) - (g(y_0) + g'(y_0)(y-y_0))]}

(there is no magic: the terms just cancel out)

If we define E_g(y,y_0) := g(y) - (g(y_0) + g'(y_0)(y-y_0)) we can write

{\displaystyle g(y) = g(y_0) + g'(f(x_0))(y - y_0) + E_g(y,y_0)}

Newton's approximation says that |E_g(y,y_0)| \leq \epsilon|y-y_0| as long as |y-y_0|\leq \delta.

Since f is differentiable at x_0, we know that it must be continuous at x_0. This means we can keep |f(x)-y_0|\leq \delta as long as we keep |x-x_0|\leq \delta'.

Since f(x) \in Y and |f(x)-y_0|\leq \delta, this means we can substitute y = f(x) and get

{\displaystyle g(f(x)) = g(y_0) + g'(f(x_0))(f(x) - y_0) + E_g(f(x),y_0)}

Now we use the differentiability of f. We can write

{\displaystyle f(x) = f(x_0) + f'(x_0)(x - x_0) + [f(x) - (f(x_0) + f'(x_0)(x-x_0))]}

Again, we can define E_f(x,x_0) := f(x) - (f(x_0) + f'(x_0)(x-x_0)) and write this as

{\displaystyle f(x) = f(x_0) + f'(x_0)(x - x_0) + E_f(x,x_0)}

Now we can substitute this into the expression for g(f(x)) to get

{\displaystyle g(f(x)) = g(y_0) + g'(f(x_0))(f'(x_0)(x - x_0) + E_f(x,x_0)) + E_g(f(x),f(x_0))}

where we have canceled out two terms using f(x_0) = y_0.

Thus we have

{\displaystyle g(f(x)) = g(y_0) + g'(f(x_0))f'(x_0)(x - x_0) + [g'(f(x_0))E_f(x,x_0) + E_g(f(x), f(x_0))]}

We can write this as

{\displaystyle (g\circ f)(x) - ((g\circ f)(x_0) + L(x - x_0)) = [g'(f(x_0))E_f(x,x_0) + E_g(f(x), f(x_0))]}

where L := g'(f(x_0))f'(x_0). Now the left hand side looks like the expression in Newton's approximation. This means to show g\circ f is differentiable at x_0, we just need to show that |g'(f(x_0))E_f(x,x_0) + E_g(f(x), f(x_0))| \leq \epsilon|x - x_0|.

The stuff in square brackets is our "error term" for g\circ f. Now we just need to make sure it is small, even after dividing by |x-x_0|.

But f is differentiable at x_0, so by Newton's approximation,

{\displaystyle |g'(f(x_0))E_f(x,x_0)| \leq |g'(f(x_0))| \epsilon_1 |x-x_0|}

we also have

{\displaystyle |E_g(f(x), f(x_0))| \leq \epsilon_2 |f(x)-f(x_0)| = \epsilon_2 |f'(x_0)(x-x_0) + E_f(x,x_0)|}

We can bound this from above using the triangle inequality:

{\displaystyle \begin{align}|E_g(f(x), f(x_0))| &\leq \epsilon_2 |f'(x_0)(x-x_0)| + \epsilon_2 |E_f(x,x_0)| \\ &\leq \epsilon_2 |f'(x_0)| |x-x_0| + \epsilon_2 \epsilon_1 |x-x_0|\end{align}}

Now we can just choose \epsilon_1, \epsilon_2 small enough.

Limits of sequences

Main idea

Let (x_n)_{n=1}^\infty be a sequence taking values in X \setminus \{x_0\} that converges to x_0. Then we want to write

{\displaystyle \frac{g(f(x_n)) - g(f(x_0))}{x_n - x_0} = \frac{g(f(x_n)) - g(f(x_0))}{f(x_n) - f(x_0)} \cdot \frac{f(x_n) - f(x_0)}{x_n - x_0}}

Now use the limit laws to conclude that the limit is g'(f(x_0))\cdot f'(x_0). The problem is that f(x_n) - f(x_0) can be zero even when x_n \ne x_0.


Let (x_n)_{n=1}^\infty be a sequence taking values in X \setminus \{x_0\} that converges to x_0.

Define a function \phi : Y \to \mathbf R by

{\displaystyle \phi(y) := \begin{cases}\frac{g(y) - g(f(x_0))}{y - f(x_0)} & \text{if }y \ne f(x_0) \\ g'(f(x_0)) & \text{if } y = f(x_0)\end{cases}}

The idea is that we want to say \frac{g(f(x_n)) - g(f(x_0))}{f(x_n) - f(x_0)} is going to g'(f(x_0)), so we just define it at the undefined points to already be at that limit.

Now we have

{\displaystyle \frac{g(f(x_n)) - g(f(x_0))}{x_n - x_0} = \phi(f(x_n)) \cdot \frac{f(x_n) - f(x_0)}{x_n - x_0}}

for all x_n. (Why? Consider the cases f(x_n) = f(x_0) and f(x_n) \ne f(x_0) separately.)

Differentiability of g at f(x_0) says that if (y_n)_{n=1}^\infty is a sequence taking values in Y \setminus \{y_0\} that converges to f(x_0), then \frac{g(y_n) - g(f(x_0))}{y_n - f(x_0)} \to g'(f(x_0)) as n \to \infty. What if (y_n)_{n=1}^\infty is instead a sequence taking values in Y? Then we can say \phi(y_n) \to g'(f(x_0)) as n\to\infty. To show this, let \epsilon > 0.

Now we can find N \geq 1 such that for all n \geq N, if y_n \ne f(x_0), then |\phi(y_n) - g'(f(x_0))| = \left\vert\frac{g(y_n) - g(f(x_0))}{y_n - f(x_0)} - g'(f(x_0))\right\vert \leq \epsilon. (TODO: I think here we need to break off into two cases: one where there's an infinite number of y_n such that y_n \ne f(x_0), and one where there's only a finite number so that eventually the sequence is all just y_n = f(x_0). Only in the former case can we find N, by considering the subsequence that isn't equal to f(x_0), but this is not a problem because in the latter case the sequence's tail is already at the place where we need it to be, so we don't even need to find N. The question is, is there some more elegant way to do this that doesn't break off into cases?)

But this means if n \geq N, then we have two cases: either y_n \in Y and y_n \ne f(x_0), in which case |\phi(y_n) - g'(f(x_0))| \leq \epsilon as above, or else y_n = f(x_0), in which case \phi(y_n) = g'(f(x_0)) so |\phi(y) - g'(f(x_0))| = 0 \leq \epsilon.

Differentiability of f at x_0 implies continuity of f at x_0, so this means that f(x_n) \to f(x_0) as n \to \infty. Since f(x_n) \in Y for each n \geq 1, we can use (f(x_n))_{n=1}^\infty as our sequence in Y to conclude that as n \to \infty we have \phi(f(x_n)) \to g'(f(x_0)).

Now by the limit laws

{\displaystyle \begin{align}\lim_{n \to \infty} \frac{g(f(x_n)) - g(f(x_0))}{x_n - x_0} &= \left(\lim_{n \to \infty}\phi(f(x_n))\right) \left(\lim_{n \to \infty} \frac{f(x_n) - f(x_0)}{x_n - x_0}\right) \\ &= g'(f(x_0)) f'(x_0)\end{align}}

Since the sequence (x_n)_{n=1}^\infty was arbitrary, we can conclude that \lim_{x\to x_0;\, x \in X\setminus\{x_0\}} \frac{g(f(x)) - g(f(x_0))}{x - x_0} = g'(f(x_0)) f'(x_0).

\frac{g(f(x_n)) - g(f(x_0))}{f(x_n) - f(x_0)} \to g'(f(x_0))

TODO: Tao says that division by zero occurs when f'(x_0) = 0, which seems strange to me.