Here is the traditional proof of the chain rule: (Stewart's first proof made correct):
We have a composite function y(u(x)), and assume both component functions y(u) and u(x) are differentiable, and we claim that also y(u(x)) is differentiable, and that its derivative equals y’(u).u’(x) = y’(u(x)).u’(x).
All we have to do is show the limit of (∆y/∆x) as ∆x -->0, equals y’(u).u’(x).
We are assuming that (∆u/∆x)-->u’(x) as ∆x -->0, and also that (∆y/∆u)-->y’(u) as ∆u -->0.
Of course one needs to know the meaning of a limit. I.e. (∆y/∆u)-->y’(u) as ∆u -->0, means that the fraction (∆y/∆u) gets really close to the number y’(u) as long as ∆u is really small but not zero, and the same for the other limit.
Unfortunately since the function u(x) is not assumed to be “one to one”, it can happen that two different values of x give the same value of u, and then we would have ∆u = 0 even though ∆x ≠ 0.
So if we try to break up the fraction ∆y/∆x into a product (∆y/∆u)(∆u/∆x) and use the product rule for limits, we have a problem since this product may not really equal ∆y/∆x for all ∆x that is really small but not zero. I.e. if there is a small non zero ∆x such that ∆u = 0, then ∆y/∆x does not equal (∆y/∆u)( ∆u/∆x), since the fraction (∆y/∆u) does not make sense.
Now we get to start from as small a ∆x as we want in this limit, so if there is ever a ∆x so small that ∆u ≠ 0 for that ∆x and also for all smaller ∆x, there is no problem. So the only case where we have not proved the chain rule is when there is a sequence of ∆x’s approaching zero, and for all of them we still have ∆u = 0.
Now in that case, it follows that the fraction ∆u/∆x equals zero for all those ∆x’s, and since this fraction has a limit, the only possible limit is zero. I.e. in the only case where the proof does not work, we know that u’(x) = 0. Thus for the theorem to hold in that case, we only need to prove that y’(x) = y’(u).u’(x) = y’(u).0 = 0. I.e. all we have to do is prove that in this case the fraction ∆y/∆x still approaches zero even though we cannot always factor it into a product of fractions.
The secret is to notice that we can still factor it as ∆y/∆x = (∆y/∆u)(∆u/∆x),
as long as ∆u ≠ 0. I.e. there are two kinds of ∆x’s, those for which ∆u = 0, and those for which ∆u ≠ 0. But when ∆u = 0 we do not need to factor it, i.e. it is trivial then that the fraction ∆y/∆x = 0, since the top is the difference of the values of y at the same two values of u, so of course it equals zero. I.e. ∆u = 0 means the two values of u are the same, so y has the same value at bo0th of them so ∆y = 0, hence also ∆y/∆ = 0.
And in the case where ∆u ≠ 0, we can still factor the fraction as ∆y/∆x = (∆y/∆u)(∆u/∆x), and use the other product argument. I.e. as long as ∆x is really small, if ∆u ≠ 0, then the fraction ∆y/∆x = (∆y/∆u)(∆u/∆x). And since u’(x) = 0 in this case, (∆u/∆x) is a small number, and (∆y/∆u) is close to the finite number y’(u), so the product (∆y/∆u)(∆u/∆x), is a small number.
And in the case where ∆u = 0, things are actually even better. I.e. although we cannot factor the fraction, it does not matter because then ∆u= 0 implies also ∆y = 0, so the fraction ∆y/∆x is as close to zero as it can get, since it equals zero.
Thus in the “bad” case where ∆x is small and non zero, but ∆u = 0, the chain rule still holds because both sides of the equaion equal zero.
Thus Stewart’s second proof is unnecessary. It works because he has managed to take the denominators out of the argument. But he has also managed to make the argument less understandable.
This result was traditionally proved correctly in turn of
the century English language books, such as Pierpont's Theory of functions
of a real variable, and in 19th century European books such as that of
Tannery [see the article by Carslaw, in vol XXIX of B.A.M.S.], but
unfortunately not in the first three editions of the influential book Pure
Mathematics, by G.H.Hardy. Although Hardy reinstated the classical proof in later editions, modern books usually deal with the problem by giving the slightly more sophisticated linear approximation proof, or making what to me are somewhat artificial constructions.
Summary:
The point is simply that in proving a function has limit L, one only needs
to prove it at points where the function does not already have value L.
Thus to someone who says that the usual argument for the chain rule for
y(u(x)), does not work for x's where ∆u = 0, one can simply reply that
these points are irrelevant.
Assume f is differentiable at g(a), g is differentiable at a, and on every
neighborhood of a there are points x where g(x) = g(a). We claim the
derivative of f(g(x)) at a equals f'(g(a))(g'(a)).
Proof:
1) Clearly under these hypotheses, g'(a) = 0.
Consequently,
2) the chain rule holds at a if and only if lim∆f/∆x = 0 as x approaches a.
3) Note that ∆f = ∆f/∆x = 0 at all x such that g(x) = g(a).
4) In general, to prove that lim h(x) = L, as x approaches a, it suffices
to prove it for the restriction of h to those x such that h(x) ≠ L.
5) Thus in arguing that ∆f/∆x approaches 0, we may restrict to x such that
g(x) ≠ g(a), where the usual argument applies.