|
|
| (12 intermediate revisions by the same user not shown) |
| Line 1: |
Line 1: |
| | a differentiable function looks locally like a linear transformation. if you compose two differentiable functions, then it seems pretty obvious that locally, that composed map also looks like a linear transformation. but which linear transformation? well, you first apply the linear transformation that locally approximates the inner function. then you land in some target space, and you find what the outer function locally looks like at the place you land. and that's all the chain rule says. |
| | |
| | So Folland actually says this on page 109: "Since the product of two matrices gives the composition of the linear transformations defined by those matrices, the chain rule just says that ''the linear approximation of a composition is the composition of the linear approximations''." But... putting this short paragraph away after a proof, and only after you've already discussed two versions of the chain rule in previous chapters, is in my opinion not trying hard enough to communicate the central message. This should be in bold flashing letters at the TOP of the FIRST discussion of the chain rule. |
| | |
| ==Using Newton's approximation== | | ==Using Newton's approximation== |
|
| |
|
| Line 8: |
Line 12: |
|
| |
|
| Thus we get <math>g\circ f(x) \approx g\circ f(x_0) + g'(f(x_0))f'(x_0)(x-x_0)</math>, which is what the chain rule says. | | Thus we get <math>g\circ f(x) \approx g\circ f(x_0) + g'(f(x_0))f'(x_0)(x-x_0)</math>, which is what the chain rule says. |
| | |
| | Slightly more formally: |
| | |
| | <math>f(x) = f(x_0) + f'(x_0)(x-x_0) + o(x-x_0)</math> when <math>x</math> is near <math>x_0</math> |
| | |
| | <math>g(y) = g(f(x_0)) + g'(f(x_0))(y - f(x_0)) + o(y-f(x_0))</math> when <math>y</math> is near <math>f(x_0)</math> |
| | |
| | <math>f</math> is continuous at <math>x_0</math> so <math>f(x)</math> is near <math>f(x_0)</math> whenever <math>x</math> is near <math>x_0</math>. |
| | |
| | Thus if <math>x</math> is near <math>x_0</math> |
| | |
| | <math display="block">\begin{align}g(f(x)) &= g(f(x_0)) + g'(f(x_0))(f'(x_0)(x-x_0) + o(x-x_0)) + o(f(x)-f(x_0)) \\ &= g(f(x_0)) + g'(f(x_0))f'(x_0)(x-x_0) + g'(f(x_0))o(x-x_0) + o(f(x)-f(x_0))\end{align}</math> |
| | |
| | to complete the proof, we just need to show that the error <math>g'(f(x_0))o(x-x_0) + o(f(x)-f(x_0))</math> is <math>o(x-x_0)</math>. The former term clearly is. The latter term is <math>o(f'(x_0)(x-x_0) + o(x-x_0))</math> so it is as well. |
|
| |
|
| ===Proof=== | | ===Proof=== |
| Line 45: |
Line 63: |
| To get the bound for <math>|E_g(f(x),f(x_0))|</math> (using Newton's approximation), we need to make sure <math>|f(x)-f(x_0)|</math> is small. But by continuity of <math>f</math> at <math>x_0</math> we can do this. | | To get the bound for <math>|E_g(f(x),f(x_0))|</math> (using Newton's approximation), we need to make sure <math>|f(x)-f(x_0)|</math> is small. But by continuity of <math>f</math> at <math>x_0</math> we can do this. |
|
| |
|
| <math display="block">\begin{align}|E_g(f(x),f(x_0))| &\leq \epsilon_2 |f(x) - f(x_0)| \\ &= \epsilon_2 |f'(x_0)(x - x_0) + E_f(x,x_0)| \\ &\leq \epsilon_2|f'(x_0)||x-x_0| + \epsilon_2\epsilon_3|x-x_0| \\ &= (\epsilon_2|f'(x_0)| + \epsilon_2\epsilon_3)|x-x_0|\end{align}</math> | | <math display="block">\begin{align}|E_g(f(x),f(x_0))| &\leq \epsilon_2 |f(x) - f(x_0)| \\ &= \epsilon_2 |f'(x_0)(x - x_0) + E_f(x,x_0)| \\ &\leq \epsilon_2(|f'(x_0)||x-x_0| + |x-x_0|) \\ &= \epsilon_2(|f'(x_0)| + 1)|x-x_0|\end{align}</math> |
|
| |
|
| where again we are free to choose <math>\epsilon_2, \epsilon_3</math>. | | where again we are free to choose <math>\epsilon_2</math>. |
|
| |
|
| TODO: can we do this same proof but without using the error term notation? | | TODO: can we do this same proof but without using the error term notation? |
| Line 117: |
Line 135: |
| ===Main idea=== | | ===Main idea=== |
|
| |
|
| Let <math>(x_n)_{n=1}^\infty</math> be a sequence in <math>X \setminus \{x_0\}</math> that converges to <math>x_0</math>. Then we want to write | | Let <math>(x_n)_{n=1}^\infty</math> be a sequence taking values in <math>X \setminus \{x_0\}</math> that converges to <math>x_0</math>. Then we want to write |
|
| |
|
| <math display="block">\frac{g(f(x_n)) - g(f(x_0))}{x_n - x_0} = \frac{g(f(x_n)) - g(f(x_0))}{f(x_n) - f(x_0)} \cdot \frac{f(x_n) - f(x_0)}{x_n - x_0}</math> | | <math display="block">\frac{g(f(x_n)) - g(f(x_0))}{x_n - x_0} = \frac{g(f(x_n)) - g(f(x_0))}{f(x_n) - f(x_0)} \cdot \frac{f(x_n) - f(x_0)}{x_n - x_0}</math> |
| Line 125: |
Line 143: |
| ===Proof=== | | ===Proof=== |
|
| |
|
| Let <math>(x_n)_{n=1}^\infty</math> be a sequence in <math>X \setminus \{x_0\}</math> that converges to <math>x_0</math>. | | Let <math>(x_n)_{n=1}^\infty</math> be a sequence taking values in <math>X \setminus \{x_0\}</math> that converges to <math>x_0</math>. |
|
| |
|
| Define a function <math>\phi : Y \to \mathbf R</math> by | | Define a function <math>\phi : Y \to \mathbf R</math> by |
| Line 139: |
Line 157: |
| for all <math>x_n</math>. (Why? Consider the cases <math>f(x_n) = f(x_0)</math> and <math>f(x_n) \ne f(x_0)</math> separately.) | | for all <math>x_n</math>. (Why? Consider the cases <math>f(x_n) = f(x_0)</math> and <math>f(x_n) \ne f(x_0)</math> separately.) |
|
| |
|
| Differentiability of <math>g</math> at <math>f(x_0)</math> says that if <math>(y_n)_{n=1}^\infty</math> is a sequence in <math>Y \setminus \{y_0\}</math> that converges to <math>f(x_0)</math>, then <math>\frac{g(y_n) - g(f(x_0))}{y_n - f(x_0)} \to g'(f(x_0))</math> as <math>n \to \infty</math>. What if <math>(y_n)_{n=1}^\infty</math> is instead a sequence in <math>Y</math>? Then we can say <math>\phi(y_n) \to g'(f(x_0))</math> as <math>n\to\infty</math>. To show this, let <math>\epsilon > 0</math>. Now we can find <math>N \geq 1</math> such that <math>\left\vert\frac{g(y_n) - g(f(x_0))}{y_n - f(x_0)} - g'(f(x_0))\right\vert \leq \epsilon</math> for all <math>n \geq N</math>. But this means if <math>n \geq N</math>, then we have two cases: either <math>y_n \in Y</math> and <math>y_n \ne f(x_0)</math>, in which case <math>|\phi(y_n) - g'(f(x_0))| \leq \epsilon</math> as above, or else <math>y_n = f(x_0)</math>, in which case <math>\phi(y_n) = g'(f(x_0))</math> so <math>|\phi(y) - g'(f(x_0))| = 0 \leq \epsilon</math>. | | Differentiability of <math>g</math> at <math>f(x_0)</math> says that if <math>(y_n)_{n=1}^\infty</math> is a sequence taking values in <math>Y \setminus \{y_0\}</math> that converges to <math>f(x_0)</math>, then <math>\frac{g(y_n) - g(f(x_0))}{y_n - f(x_0)} \to g'(f(x_0))</math> as <math>n \to \infty</math>. What if <math>(y_n)_{n=1}^\infty</math> is instead a sequence taking values in <math>Y</math>? Then we can say <math>\phi(y_n) \to g'(f(x_0))</math> as <math>n\to\infty</math>. To show this, let <math>\epsilon > 0</math>. |
| | |
| | Now we can find <math>N \geq 1</math> such that for all <math>n \geq N</math>, if <math>y_n \ne f(x_0)</math>, then <math>|\phi(y_n) - g'(f(x_0))| = \left\vert\frac{g(y_n) - g(f(x_0))}{y_n - f(x_0)} - g'(f(x_0))\right\vert \leq \epsilon</math>. (TODO: I think here we need to break off into two cases: one where there's an infinite number of <math>y_n</math> such that <math>y_n \ne f(x_0)</math>, and one where there's only a finite number so that eventually the sequence is all just <math>y_n = f(x_0)</math>. Only in the former case can we find <math>N</math>, by considering the subsequence that isn't equal to <math>f(x_0)</math>, but this is not a problem because in the latter case the sequence's tail is already at the place where we need it to be, so we don't even need to find <math>N</math>. The question is, is there some more elegant way to do this that doesn't break off into cases?) |
| | |
| | But this means if <math>n \geq N</math>, then we have two cases: either <math>y_n \in Y</math> and <math>y_n \ne f(x_0)</math>, in which case <math>|\phi(y_n) - g'(f(x_0))| \leq \epsilon</math> as above, or else <math>y_n = f(x_0)</math>, in which case <math>\phi(y_n) = g'(f(x_0))</math> so <math>|\phi(y) - g'(f(x_0))| = 0 \leq \epsilon</math>. |
|
| |
|
| Differentiability of <math>f</math> at <math>x_0</math> implies continuity of <math>f</math> at <math>x_0</math>, so this means that <math>f(x_n) \to f(x_0)</math> as <math>n \to \infty</math>. Since <math>f(x_n) \in Y</math> for each <math>n \geq 1</math>, we can use <math>(f(x_n))_{n=1}^\infty</math> as our sequence in <math>Y</math> to conclude that as <math>n \to \infty</math> we have <math>\phi(f(x_n)) \to g'(f(x_0))</math>. | | Differentiability of <math>f</math> at <math>x_0</math> implies continuity of <math>f</math> at <math>x_0</math>, so this means that <math>f(x_n) \to f(x_0)</math> as <math>n \to \infty</math>. Since <math>f(x_n) \in Y</math> for each <math>n \geq 1</math>, we can use <math>(f(x_n))_{n=1}^\infty</math> as our sequence in <math>Y</math> to conclude that as <math>n \to \infty</math> we have <math>\phi(f(x_n)) \to g'(f(x_0))</math>. |
a differentiable function looks locally like a linear transformation. if you compose two differentiable functions, then it seems pretty obvious that locally, that composed map also looks like a linear transformation. but which linear transformation? well, you first apply the linear transformation that locally approximates the inner function. then you land in some target space, and you find what the outer function locally looks like at the place you land. and that's all the chain rule says.
So Folland actually says this on page 109: "Since the product of two matrices gives the composition of the linear transformations defined by those matrices, the chain rule just says that the linear approximation of a composition is the composition of the linear approximations." But... putting this short paragraph away after a proof, and only after you've already discussed two versions of the chain rule in previous chapters, is in my opinion not trying hard enough to communicate the central message. This should be in bold flashing letters at the TOP of the FIRST discussion of the chain rule.
Using Newton's approximation
Main idea
The main idea of using Newton's approximation to prove the chain rule is that since f is differentiable at
we have the approximation
when
is near
. Similarly since g is differentiable at
we have the approximation
when
is near
. Since f is differentiable at
, it is continuous there also, so we know that
is near
whenever
is near
. This allows us to substitute
into
whenever
is near
. So we get

Thus we get
, which is what the chain rule says.
Slightly more formally:
when
is near
when
is near
is continuous at
so
is near
whenever
is near
.
Thus if
is near

to complete the proof, we just need to show that the error
is
. The former term clearly is. The latter term is
so it is as well.
Proof
We want to show
is differentiable at
with derivative
. By Newton's approximation, this is equivalent to showing that for every
there exists
such that

whenever
. So let
.
Now we do some algebraic manipulation. Write

where
. This holds for every
. Since
we thus have

Similarly write

where
.
Substituting the expression for
in the expression for
we get

we can rewrite this as
Thus our goal now is to show
.
But by the triangle inequality it suffices to show
.
where we are free to choose
.
To get the bound for
(using Newton's approximation), we need to make sure
is small. But by continuity of
at
we can do this.

where again we are free to choose
.
TODO: can we do this same proof but without using the error term notation?
TODO: somehow Folland does this without explicitly using continuity of f; i need to understand if he's using it implicitly somehow or he's actually proving it when bounding
using
old proof
Since
is differentiable at
, we know
is a real number, and we can write
![{\displaystyle g(y)=g(y_{0})+g'(y_{0})(y-y_{0})+[g(y)-(g(y_{0})+g'(y_{0})(y-y_{0}))]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/21a5d9db1491e64d74696727b0c30489937ab5a5)
(there is no magic: the terms just cancel out)
If we define
we can write

Newton's approximation says that
as long as
.
Since
is differentiable at
, we know that it must be continuous at
. This means we can keep
as long as we keep
.
Since
and
, this means we can substitute
and get

Now we use the differentiability of
. We can write
![{\displaystyle f(x)=f(x_{0})+f'(x_{0})(x-x_{0})+[f(x)-(f(x_{0})+f'(x_{0})(x-x_{0}))]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/4190c124dc947d09465aa4d7160af300eec39d6c)
Again, we can define
and write this as

Now we can substitute this into the expression for
to get

where we have canceled out two terms using
.
Thus we have
![{\displaystyle g(f(x))=g(y_{0})+g'(f(x_{0}))f'(x_{0})(x-x_{0})+[g'(f(x_{0}))E_{f}(x,x_{0})+E_{g}(f(x),f(x_{0}))]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/cf4b7fbafe4345cba66fea13e53c33287066db54)
We can write this as
![{\displaystyle (g\circ f)(x)-((g\circ f)(x_{0})+L(x-x_{0}))=[g'(f(x_{0}))E_{f}(x,x_{0})+E_{g}(f(x),f(x_{0}))]}](https://wikimedia.org/api/rest_v1/media/math/render/svg/ec56373a6b82d2c398e3038088aa869509c66969)
where
. Now the left hand side looks like the expression in Newton's approximation. This means to show
is differentiable at
, we just need to show that
.
The stuff in square brackets is our "error term" for
. Now we just need to make sure it is small, even after dividing by
.
But f is differentiable at
, so by Newton's approximation,

we also have

We can bound this from above using the triangle inequality:

Now we can just choose
small enough.
Limits of sequences
Main idea
Let
be a sequence taking values in
that converges to
. Then we want to write

Now use the limit laws to conclude that the limit is
. The problem is that
can be zero even when
.
Proof
Let
be a sequence taking values in
that converges to
.
Define a function
by

The idea is that we want to say
is going to
, so we just define it at the undefined points to already be at that limit.
Now we have

for all
. (Why? Consider the cases
and
separately.)
Differentiability of
at
says that if
is a sequence taking values in
that converges to
, then
as
. What if
is instead a sequence taking values in
? Then we can say
as
. To show this, let
.
Now we can find
such that for all
, if
, then
. (TODO: I think here we need to break off into two cases: one where there's an infinite number of
such that
, and one where there's only a finite number so that eventually the sequence is all just
. Only in the former case can we find
, by considering the subsequence that isn't equal to
, but this is not a problem because in the latter case the sequence's tail is already at the place where we need it to be, so we don't even need to find
. The question is, is there some more elegant way to do this that doesn't break off into cases?)
But this means if
, then we have two cases: either
and
, in which case
as above, or else
, in which case
so
.
Differentiability of
at
implies continuity of
at
, so this means that
as
. Since
for each
, we can use
as our sequence in
to conclude that as
we have
.
Now by the limit laws

Since the sequence
was arbitrary, we can conclude that
.
TODO: Tao says that division by zero occurs when
, which seems strange to me.