# Chain rule proof

1. Aug 9, 2004

### Corneo

Can someone give me hints on how to prove the chain rule? I want to be able to learn this.

Let $h(x)=f(g(x))$, then

$$h'(x)=\lim\limits_{h \to 0}\frac {f(g(x+h))-f(g(x))}{h}$$

But what now?

2. Aug 9, 2004

### arildno

What level of rigour do you want?

3. Aug 9, 2004

### Galileo

There is an equivalent statement for differentiability of a function at a certain point $x_0$ than to say that the limit

$$\lim_{x \rightarrow x_0}\frac{f(x)-f(x_0)}{x-x_0}$$
exists.
A function is differentiable at a point $x_0$ if there exists
a nuber $A$ and an $\epsilon>0$ such that for $|h|<\epsilon$:
$$f(z_0+h)=f(z_0)+hA+R(h)$$
with $$\lim_{h\rightarrow 0}\frac{R(h)}{h}=0$$
Note: By comparison we see that $A=f'(x_0)$

You can check for yourself that these definitions are equivalent. The second one
is sometimes more convenient for giving proofs like the chain rule.
Let $H(x)=f((g(x))$. (f differentiable at $x_0$, g(x) defined in the range of f, etcetc).
Write:
$$f(x)-f(h)=(x-h)(f'(h)+u(x))$$
$$g(s)-g(y)=(s-y)(g'(y)+v(s))$$
with u going to zero if x->h and v->0 if s->y

Express h'(x) this way and stick in y=f(h) and s=f(x).

4. Aug 9, 2004

### Corneo

Err can you translate what Galileo just did? The went right over my head.

5. Aug 9, 2004

### HallsofIvy

y= f(g(x))

dy/dx= lim (f(g(x+h))- f(g(x))/h

Let k= g(x+ h)- g(x) so that g(x+h)= g(x)+ k

Then (f(g(x+h))- f(g(x))/h= (f(g(x)+ k)- f(g(x)) /h
= ((f(g(x)+ k)- f(g(x))/k) (k/h)
= ((f(g(x)+ k)- f(g(x))/k) ((g(x+h)- g(x))/h)
Now let u= g(x) so that we can write this as
(f(g(x+h))- f(g(x))/h= (f(u+k)- f(u)/k) ((g(x+h)- g(x))/h)

Since g is differentiable at x, it is continuous there: as h-> 0, g(x+h)-> g(x) and we must have k-> 0.

That is,
lim(h->0)(f(g(x+h))-f(g(x))/h= (lim(k->0)(f(u+k)-f(u))/k)(lim(h->0)((g(x+h)-g(x))/h)

each of those limits is the derivative so
df(g(x))/dx= df(u)/du dg(x)/dx which, since u is defined as g(x), is

df(g(x))/dx= df(u)/du dg(x)/dx, the chain rule.

6. Aug 9, 2004

### sal

If you want help with the proof, just see the other posts in the thread -- I've nothing to add there.

But perhaps you're actually asking something else. Are you trying to understand why the chain rule is true?

The truth of it should be obvious, and if it's not, then it hasn't been explained well enough, or you haven't meditated on it long enough.

First, in plain English: If Harry runs four times as fast as Sally, and Sally is running at three miles per hour, how fast is Harry running? Answer: twelve miles per hour. Picture it, doodle it on paper, try to act it out, think about it until it's obvious -- this scenario should be clear; it is not deep.

And then write it out with the speeds shown as derivatives. The rate of change in location as a function of time is speed. Let's say H(t) = Harry's location as a function of time, S(t) = Sally's location as a function of time. Harry's speed is then the derivative of his location with respect to time, which is

$$\frac{dH}{dt}$$ ;

Sally's speed is the derivative of her position with respect to time, which is

$$\frac{dS}{dt}$$ .

The chain rule just says in an equation what you can figure out for yourself in the above word problem:

$$\frac {dH} {dt} = \frac {dH} {dS} \times \frac {dS} {dt}$$

It's common in elementary calculus classes to say "You can't just cancel the terms in the chain rule!" But really, if you keep in mind that all the differences represented by the d's are functions of each other (they're not independent numbers), then, for the purposes of understanding it, you can just cancel them.

The formal proofs, given elsewhere in this thread, are just needed to confirm that intuition is correct in this case.

(Hope this wasn't totally off the beam...)

7. Aug 9, 2004

### mathwonk

The chain rule tells how to compute the derivative of a composite function. The formula says that the derivative of a composite, is the product of the derivatives of each component. Why is it true? how to prove it?

1) philosophical: The derivative of a function, is a linear function which approximates the original function. Now it seems plausible that the best linear approximation to a composite function, is obtained by composing the best approximations to the component functions. On the other hand for a linear function, composing means simply multiplying. (Please see my quote from Dieudonne in another recent thread started by Singleton.)

QED.

2) an example: let g(x) = ax + bx^2, and f(x) = cx + dx^3. The derivative of g at 0 is a, (the coefficient of the linear term), and the derivative of f at 0 is c, (also the coefficient of the linear term). Now compose f and g and notice that the coefficient of the linear term is the product of the coefficients of the linear terms of f and g. QED.

3) a proof: To show the derivative dy/dx = (dy/dz) (dz/dx) means to show the limit of (?y/?x) is the same as the product of the limits of (?y/?z) and (?z/?x), as ?x goes to zero but never equals zero.

Of course by the product rule for limits, this follows immediately from the expression

(?y/?x) = (?y/?z) (?z/?x), (i.e. by cancelling),

except in the special case that for infinitely many small non - zero values of ?x, one may have that ?z = 0. Then the product rule does not apply.

To deal with this special case many tricks are employed. One is to go to the higher dimensional point of view that a derivative is a linear approximation, and reason as some respondents do above.

Another easier argument in my opinion is simply to observe that this special case is actually easier than the general case. I.e. suppose that there exist arbitrarily small values of ?x such that the corresponding difference ?z is zero. It is then clear that dz/dx = 0, since it is a limit of zeroes.

Thus to prove the formula that dy/dx = (dy/dz) (dz/dx), now that we know the right hand side is zero, one need only show that dy/dx = 0 in this case, i.e. that the left hand side is also zero.

Now a function whose VALUES are zero, obviously has limit zero, so the only difficulty is to deal with those values which are non zero.

I.e. to show that h(x) = ?y/?x approaches zero as ?x approaches 0, it suffices to show that those values of h(x) which are not actually EQUAL to zero, approach zero.

But in this situation, when ?z = 0 then by definition ?y = 0 also, hence also in this case h(x) = ?y/?x = 0.

Thus to show h(x) = ?y/?x approaches zero, we can ignore those points where it has the VALUE zero, i.e. those points where ?y = 0, hence we can also ignore the points where ?z = 0. But at the other points, where ?z is NOT zero, the product rule for limits does apply.

History: This argument was once standard in very old books such as that of Pierpont, around 1897?, but in about 1910 GH Hardy wrote his great book, Pure Mathematics, and seems to have either missed this point, or perhaps merely felt it important to present the linear approximation point of view.

His remark however that the classical argument is not fully rigorous, although incorrect, has been reproduced for many decades in many books. I.e. most authors give the false impression that the classical product argument is actually wrong, rather than that it only needs a few extra words of comment.

There did exist one modern book, now out of print where the argument I have sketched above is given as an exercise.

Remark: To Halls of Ivy, are you sure your argument does not fall prey to the fallacy of letting k = g(x+h) - g(x) = 0 in the special case? I say this with the greatest diffidence since your posts have impresseed me very much. No doubt you have a way around this, but it seems obscure in your presentation.

8. Aug 9, 2004

### HallsofIvy

First, the "philosophical" statement, that "it is reasonable that", is clearly not a proof in the mathematical sense.
Second, an example is NOT a proof! If you teacher asks for a proof and you give an example expect to get a 0 (unless, he/she is in a particularly good mood in which case you might get 1 out of 10 points).

As for your point about k= g(x+h)-g(x)= 0: good point. Of course, that can only happen for all h in the vicinity of 0 (which is all that matters) if g is a constant.
In that case, f(g(x)) is also a constant and has derivative 0. It is still the case that df(g(x))/dx= (df/dg)(dg/dx) since that becomes 0= (df/dg)*0 but you are right- it has to be given as a special case.

9. Aug 9, 2004

### mathwonk

Thank you Halls of Ivy, for clearing up the confusion I may have left as to what is a proof and what is not. Indeed Corneo explicitly asked for a proof, and yet I offered first a plausibility argument, and second an example, and only third a proof.

I confess that to me sometimes a plausibility argument or an example is even more convincing psychologically than a proof. Although I did present it as an example, the method used in my example 2., is actually a completely general proof at least in the case of "analytic functions", or all functions that can be expressed as power series. I am sure you see what I mean (just substitute one series into the other , expand, and pick off the linear term). So it works in particular for all polynomials, and for all trig functions as well as all logarithm and exponential functions.

Please forgive me Halls of Ivy if I am wrong, but I think you may be missing something as to the special case we were discussing in your proof. That is, should one not consider also the case of a function like g(x) = (x^2)sin(1/x) for x not zero, and g(0) = 0?

I.e. here the function is not identically zero on any nbhd of x=0 but is zero infinitely many times on every nbhd of zero. This function is differentiable at x=0 with derivative zero.

Such functions, which are not constant, yet have the same value infinitely many times in each nbhd of a point, are rare in elementary texts but they nonetheless do exist in the family of differentiable functions. Of course they are also not analytic functions, so my example 2. type argument would also fail for them.

In fact such exotic functions are the only counter examples (other than constants) to your argument. So perhaps it would be wisest simply to ignore them. But mathematicians are notoriously compulsive about special cases.

What do you think of this?

To Corneo: The common, mathematically correct argument by Galileo, is the one in the book of GH Hardy, and most books since, that uses the idea that a derivative is a linear approximation to a given function. It seems strange to us because we have not learned the derivative as a linear function in elementary calculus but as a number.

Let me try to give the general, higher dimensional version of this argument, a proof for the chain rule that actually works in all dimensions, and even, properly stated, in infinite dimensions. You may find it even easier to follow.

Let us make the simplifying assumption that our functions all have the value zero at zero, and we are finding the derivative at zero.

The key idea is that a function is differentiable if it can be approximated very well by a linear function, i.e. can be written as a sum of a linear function plus a small function (plus a constant, but our simplifying assumption removes this constant).

Of course we must define "small". Before doing that, let me just say that a small function has the property that when composed with any differentiable function the result is also small, and the sum of two small functions is again small.

Recall that a linear function L has the property that L(x+y) = L(x) + L(y) for any x,y.

Then we can prove the chain rule in complete generality and any dimension as follows:

(We are assuming that f(0) = 0 and g(0) = 0). If f and g are differentiable, then by definition,

f = L+small, and g = M+small, where L and M are linear.

Now we want to show that the derivative of f(g) is L(M), which means we must show that f(g) can be written as L(M) plus a small function.

But since f = L+small and g = M + small, we get

f(g) = f(M+small) = L(M+small) + small(M+small). But L is linear,

so we get further that f(g) = L(M) + L(small) + small(M+small).

Now we said that a small function composed with any differentiable function is small, and the sum of two small functions is small, so we can write this as

f(g) = L(M) + small + small = L(M) + small.

But that proves what we want. I.e. since the composition f(g) is written as L(M) plus a small function, and since the composition L(M) is linear, it follows that L(M) is the derivative of f(g).

Now for h to be small means that as x approaches zero, then not only does h(x) approach zero, but also h(x)/|x| approaches zero, where |x| is some appropriate notion of length. It is then an exercise to prove the two properties of small functions that we used here. It is helpful to know that (in finite dimensions) a linear function L has the property that there exists a positive number C such that |L(x)| < C|x| for all small values of x.

If you look back at the argument of Galileo, you will see the expression

f(x+h) = f(x) + Ah + R(h), where R(h)/h approaches zero as h does.

Thus f(x) here is the constant I ignored by assuming f(0) = 0, and Ah is the linear function approximating to the difference f(x+h)-f(x), and R(h) is the small function.

I.e. if we rewrite Galileo's equation as f(x+h) -f(x) = Ah + R(h).

then with my simplifications, both x=0 and f(x) = 0, so this becomes

f(h) = Ah + R(h) = linear + small.

does this help?

(I learned this argument from Lynn Loomis, and it occurs in his advanced calculus book, as well as in Dieudonne.)

Last edited: Aug 9, 2004
10. Aug 10, 2004

### HallsofIvy

"Galileo"?? Galileo died when Newton was 2 years old so he certainly didn't write a proof of the "chain rule". He may well have done something about tangent lines- many scientists of that time were concerned with the tangent line problem. If you have more information on that, I would like to see it.

11. Aug 10, 2004

### Muzza

I think mathwonk is referring to the person on these boards with the nickname "Galileo". See post #3 in this thread.

12. Aug 10, 2004

### Galileo

I thought it was a bad idea to use this nick when I signed up for the forums,
but I couldn't think of another.
I don't have some internet identity I use everywhere and nobody uses their real name anyway, so I didn't either.

Might be wise for me to change my nick, eh? :uhh:

13. Aug 10, 2004

### sal

Actually lots of people just use their initials, or even a real name, when posting to the Usenet groups. This forum is a great deal friendlier than Usenet -- it's not apparent to me that total anonymity is all that important here.

Strangely, I seem to see more anonymous handles here than in Usenet, though. Odd.

14. Aug 10, 2004

### mathwonk

Galileo, the fault was mine for not citing the previous post I was referencing. In fact I was trying to answer the question in Corneo's post 4:

"Err can you translate what Galileo just did? The went right over my head."

As to Galileo, the historical one, I have only read his "Two new sciences", and noticed nothing about the chain rule as such, although the second science which he treats, that of motion, certainly involves the chain rule conceptually as pointed out in Sal's post 6.

Somewhat off topic, I have wondered why Galileo was only then making the discovery that a falling body, whose speed increased linearly with time, falls a distance which is quadratic in the time.

I.e. this amounts in our formulation to integrating a linear function to obtain a quadratic one. On the other hand Archimedes, over a thousand years earlier, had already solved the problem of the volume of a sphere and the area under a parabola, which are equivalent to the significantly harder problem of integrating a quadratic and obtaining a cubic.

Of course they apparently did not have efficient ways even to manipulate real numbers still in Galileo's time, since he represents a real number as the ancients did, by displaying two finite length segments, and considering their ratio. His subsequent reasoning is all geometry.

Perhaps Galileo's contribution was in realizing one could render the problem of motion into one of geometry, even though Archimedes had solved more difficult geometry problems before. Or perhaps Archimedes' work was lost at the time of Galileo? Does anyone know?

15. Feb 6, 2010

### ronray2

why on earth are you folks replying to someone who just wants you to do them homework for them? especially the "rigour = over my head" post. translation: "i am cool. cool people don't use words beyond 5th grade vocabulary. you are all nerds. you are good for nothing other than doing my homework for me. do it now."

personally i've more self-respect than that. suit yourselves

16. Feb 6, 2010

### arildno

Why on earth do people like you dig up 6 year old threads?

Suit yourself..