User:IssaRice/Chain rule proofs: Difference between revisions

From Machinelearning
Line 143: Line 143:
Differentiability of <math>g</math> at <math>f(x_0)</math> says that if <math>(y_n)_{n=1}^\infty</math> is a sequence in <math>Y \setminus \{y_0\}</math> that converges to <math>f(x_0)</math>, then <math>\frac{g(y_n) - g(f(x_0))}{y_n - f(x_0)} \to g'(f(x_0))</math> as <math>n \to \infty</math>. What if <math>(y_n)_{n=1}^\infty</math> is instead a sequence in <math>Y</math>? Then we can say <math>\phi(y_n) \to g'(f(x_0))</math> as <math>n\to\infty</math>.
Differentiability of <math>g</math> at <math>f(x_0)</math> says that if <math>(y_n)_{n=1}^\infty</math> is a sequence in <math>Y \setminus \{y_0\}</math> that converges to <math>f(x_0)</math>, then <math>\frac{g(y_n) - g(f(x_0))}{y_n - f(x_0)} \to g'(f(x_0))</math> as <math>n \to \infty</math>. What if <math>(y_n)_{n=1}^\infty</math> is instead a sequence in <math>Y</math>? Then we can say <math>\phi(y_n) \to g'(f(x_0))</math> as <math>n\to\infty</math>.


But differentiability of <math>f</math> at <math>x_0</math> implies ''continuity'' of <math>f</math> at <math>x_0</math>, so this means that <math>f(x_n) \to f(x_0)</math> as <math>n \to \infty</math>. Since <math>f(x_n) \in Y</math> for each <math>n \geq 1</math>, we can use <math>(f(x_n))_{n=1}^\infty</math> as our sequence in <math>Y</math> to conclude that as <math>n \to \infty</math> we have <math>\phi(f(x_n)) \to g'(f(x_0))</math>.
Differentiability of <math>f</math> at <math>x_0</math> implies continuity of <math>f</math> at <math>x_0</math>, so this means that <math>f(x_n) \to f(x_0)</math> as <math>n \to \infty</math>. Since <math>f(x_n) \in Y</math> for each <math>n \geq 1</math>, we can use <math>(f(x_n))_{n=1}^\infty</math> as our sequence in <math>Y</math> to conclude that as <math>n \to \infty</math> we have <math>\phi(f(x_n)) \to g'(f(x_0))</math>.


Now by the limit laws
Now by the limit laws

Revision as of 23:38, 28 November 2018

Using Newton's approximation

Main idea

The main idea of using Newton's approximation to prove the chain rule is that since f is differentiable at x0 we have the approximation f(x)f(x0)+f(x0)(xx0) when x is near x0. Similarly since g is differentiable at f(x0) we have the approximation g(y)g(f(x0))+g(f(x0))(yf(x0)) when y is near f(x0). Since f is differentiable at x0, it is continuous there also, so we know that f(x) is near f(x0) whenever x is near x0. This allows us to substitute f(x) into y whenever x is near x0. So we get

g(f(x))g(f(x0))+g'(f(x0))(f(x)f(x0))g(f(x0))+g'(f(x0))(f'(x0)(xx0))

Thus we get gf(x)gf(x0)+g(f(x0))f(x0)(xx0), which is what the chain rule says.

Proof

We want to show gf is differentiable at x0 with derivative L:=g(f(x0))f(x0). By Newton's approximation, this is equivalent to showing that for every ϵ>0 there exists δ>0 such that

|gf(x)(gf(x0)+L(xx0))|ϵ|xx0|

whenever |xx0|δ. So let ϵ>0.

Now we do some algebraic manipulation. Write

g(y)=g(y0)+g(y0)(yy0)+Eg(y,y0)

where Eg(y,y0):=g(y)(g(y0)+g(y0)(yy0)). This holds for every yY. Since f(x)Y we thus have

g(f(x))=g(f(x0))+g(f(x0))(f(x)f(x0))+Eg(f(x),f(x0))

Similarly write

f(x)=f(x0)+f(x0)(xx0)+Ef(x,x0)

where Ef(x,x0):=f(x)(f(x0)+f(x0)(xx0)).

Substituting the expression for f(x) in the expression for g(f(x)) we get

g(f(x))=g(f(x0))+g'(f(x0))(f'(x0)(xx0)+Ef(x,x0))+Eg(f(x),f(x0))=g(f(x0))+g'(f(x0))f'(x0)(xx0)+g'(f(x0))Ef(x,x0)+Eg(f(x),f(x0))

we can rewrite this as gf(x)(gf(x0)+L(xx0))=g(f(x0))Ef(x,x0)+Eg(f(x),f(x0))

Thus our goal now is to show |g(f(x0))Ef(x,x0)+Eg(f(x),f(x0))|ϵ|xx0|.

But by the triangle inequality it suffices to show |g(f(x0))Ef(x,x0)|+|Eg(f(x),f(x0))|ϵ|xx0|.

|g(f(x0))Ef(x,x0)||g(f(x0))|ϵ1|xx0| where we are free to choose ϵ1.

To get the bound for |Eg(f(x),f(x0))| (using Newton's approximation), we need to make sure |f(x)f(x0)| is small. But by continuity of f at x0 we can do this.

|Eg(f(x),f(x0))|ϵ2|f(x)f(x0)|=ϵ2|f'(x0)(xx0)+Ef(x,x0)|ϵ2|f'(x0)||xx0|+ϵ2ϵ3|xx0|=(ϵ2|f'(x0)|+ϵ2ϵ3)|xx0|

where again we are free to choose ϵ2,ϵ3.

TODO: can we do this same proof but without using the error term notation?

TODO: somehow Folland does this without explicitly using continuity of f; i need to understand if he's using it implicitly somehow or he's actually proving it when bounding |h| using |u|

old proof

Since g is differentiable at y0, we know g(y0) is a real number, and we can write

g(y)=g(y0)+g(y0)(yy0)+[g(y)(g(y0)+g(y0)(yy0))]

(there is no magic: the terms just cancel out)

If we define Eg(y,y0):=g(y)(g(y0)+g(y0)(yy0)) we can write

g(y)=g(y0)+g(f(x0))(yy0)+Eg(y,y0)

Newton's approximation says that |Eg(y,y0)|ϵ|yy0| as long as |yy0|δ.

Since f is differentiable at x0, we know that it must be continuous at x0. This means we can keep |f(x)y0|δ as long as we keep |xx0|δ.

Since f(x)Y and |f(x)y0|δ, this means we can substitute y=f(x) and get

g(f(x))=g(y0)+g(f(x0))(f(x)y0)+Eg(f(x),y0)

Now we use the differentiability of f. We can write

f(x)=f(x0)+f(x0)(xx0)+[f(x)(f(x0)+f(x0)(xx0))]

Again, we can define Ef(x,x0):=f(x)(f(x0)+f(x0)(xx0)) and write this as

f(x)=f(x0)+f(x0)(xx0)+Ef(x,x0)

Now we can substitute this into the expression for g(f(x)) to get

g(f(x))=g(y0)+g(f(x0))(f(x0)(xx0)+Ef(x,x0))+Eg(f(x),f(x0))

where we have canceled out two terms using f(x0)=y0.

Thus we have

g(f(x))=g(y0)+g(f(x0))f(x0)(xx0)+[g(f(x0))Ef(x,x0)+Eg(f(x),f(x0))]

We can write this as

(gf)(x)((gf)(x0)+L(xx0))=[g(f(x0))Ef(x,x0)+Eg(f(x),f(x0))]

where L:=g(f(x0))f(x0). Now the left hand side looks like the expression in Newton's approximation. This means to show gf is differentiable at x0, we just need to show that |g(f(x0))Ef(x,x0)+Eg(f(x),f(x0))|ϵ|xx0|.

The stuff in square brackets is our "error term" for gf. Now we just need to make sure it is small, even after dividing by |xx0|.

But f is differentiable at x0, so by Newton's approximation,

|g(f(x0))Ef(x,x0)||g(f(x0))|ϵ1|xx0|

we also have

|Eg(f(x),f(x0))|ϵ2|f(x)f(x0)|=ϵ2|f(x0)(xx0)+Ef(x,x0)|

We can bound this from above using the triangle inequality:

|Eg(f(x),f(x0))|ϵ2|f'(x0)(xx0)|+ϵ2|Ef(x,x0)|ϵ2|f'(x0)||xx0|+ϵ2ϵ1|xx0|

Now we can just choose ϵ1,ϵ2 small enough.

Limits of sequences

Main idea

Let (xn)n=1 be a sequence that converges to x0.

g(f(xn))g(f(x0))xnx0=g(f(xn))g(f(x0))f(xn)f(x0)f(xn)f(x0)xnx0

Now use the limit laws to conclude that the limit is g(f(x0))f(x0).

The problem is that f(xn)f(x0) can be zero even when xnx0.

Proof

Let (xn)n=1 be a sequence in X{x0} that converges to x0.

Define a function ϕ:YR by

ϕ(y):={yf(x0)g'(f(x0))y=f(x0)

The idea is that we want to say g(f(xn))g(f(x0))f(xn)f(x0) is going to g(f(x0)), so we just define it at the undefined points to already be at that limit.

Now we have

g(f(xn))g(f(x0))xnx0=ϕ(f(xn))f(xn)f(x0)xnx0

for all xn. (Why? Consider the cases f(xn)=f(x0) and f(xn)f(x0) separately.)

Differentiability of g at f(x0) says that if (yn)n=1 is a sequence in Y{y0} that converges to f(x0), then g(yn)g(f(x0))ynf(x0)g(f(x0)) as n. What if (yn)n=1 is instead a sequence in Y? Then we can say ϕ(yn)g(f(x0)) as n.

Differentiability of f at x0 implies continuity of f at x0, so this means that f(xn)f(x0) as n. Since f(xn)Y for each n1, we can use (f(xn))n=1 as our sequence in Y to conclude that as n we have ϕ(f(xn))g(f(x0)).

Now by the limit laws

limng(f(xn))g(f(x0))xnx0=(limnϕ(f(xn)))(limnf(xn)f(x0)xnx0)=g'(f(x0))f'(x0)

Since the sequence (xn)n=1 was arbitrary, we can conclude that limxx0;xX{x0}g(f(x))g(f(x0))xx0=g(f(x0))f(x0).

g(f(xn))g(f(x0))f(xn)f(x0)g(f(x0))

TODO: Tao says that division by zero occurs when f(x0)=0, which seems strange to me.