Thank you Halls of Ivy, for clearing up the confusion I may have left as to what is a proof and what is not. Indeed Corneo explicitly asked for a proof, and yet I offered first a plausibility argument, and second an example, and only third a proof.
I confess that to me sometimes a plausibility argument or an example is even more convincing psychologically than a proof. Although I did present it as an example, the method used in my example 2., is actually a completely general proof at least in the case of "analytic functions", or all functions that can be expressed as power series. I am sure you see what I mean (just substitute one series into the other , expand, and pick off the linear term). So it works in particular for all polynomials, and for all trig functions as well as all logarithm and exponential functions.
Please forgive me Halls of Ivy if I am wrong, but I think you may be missing something as to the special case we were discussing in your proof. That is, should one not consider also the case of a function like g(x) = (x^2)sin(1/x) for x not zero, and g(0) = 0?
I.e. here the function is not identically zero on any nbhd of x=0 but is zero infinitely many times on every nbhd of zero. This function is differentiable at x=0 with derivative zero.
Such functions, which are not constant, yet have the same value infinitely many times in each nbhd of a point, are rare in elementary texts but they nonetheless do exist in the family of differentiable functions. Of course they are also not analytic functions, so my example 2. type argument would also fail for them.
In fact such exotic functions are the only counter examples (other than constants) to your argument. So perhaps it would be wisest simply to ignore them. But mathematicians are notoriously compulsive about special cases.
What do you think of this?
To Corneo: The common, mathematically correct argument by Galileo, is the one in the book of GH Hardy, and most books since, that uses the idea that a derivative is a linear approximation to a given function. It seems strange to us because we have not learned the derivative as a linear function in elementary calculus but as a number.
Let me try to give the general, higher dimensional version of this argument, a proof for the chain rule that actually works in all dimensions, and even, properly stated, in infinite dimensions. You may find it even easier to follow.
Let us make the simplifying assumption that our functions all have the value zero at zero, and we are finding the derivative at zero.
The key idea is that a function is differentiable if it can be approximated very well by a linear function, i.e. can be written as a sum of a linear function plus a small function (plus a constant, but our simplifying assumption removes this constant).
Of course we must define "small". Before doing that, let me just say that a small function has the property that when composed with any differentiable function the result is also small, and the sum of two small functions is again small.
Recall that a linear function L has the property that L(x+y) = L(x) + L(y) for any x,y.
Then we can prove the chain rule in complete generality and any dimension as follows:
(We are assuming that f(0) = 0 and g(0) = 0). If f and g are differentiable, then by definition,
f = L+small, and g = M+small, where L and M are linear.
Now we want to show that the derivative of f(g) is L(M), which means we must show that f(g) can be written as L(M) plus a small function.
But since f = L+small and g = M + small, we get
f(g) = f(M+small) = L(M+small) + small(M+small). But L is linear,
so we get further that f(g) = L(M) + L(small) + small(M+small).
Now we said that a small function composed with any differentiable function is small, and the sum of two small functions is small, so we can write this as
f(g) = L(M) + small + small = L(M) + small.
But that proves what we want. I.e. since the composition f(g) is written as L(M) plus a small function, and since the composition L(M) is linear, it follows that L(M) is the derivative of f(g).
Now for h to be small means that as x approaches zero, then not only does h(x) approach zero, but also h(x)/|x| approaches zero, where |x| is some appropriate notion of length. It is then an exercise to prove the two properties of small functions that we used here. It is helpful to know that (in finite dimensions) a linear function L has the property that there exists a positive number C such that |L(x)| < C|x| for all small values of x.
If you look back at the argument of Galileo, you will see the expression
f(x+h) = f(x) + Ah + R(h), where R(h)/h approaches zero as h does.
Thus f(x) here is the constant I ignored by assuming f(0) = 0, and Ah is the linear function approximating to the difference f(x+h)-f(x), and R(h) is the small function.
I.e. if we rewrite Galileo's equation as f(x+h) -f(x) = Ah + R(h).
then with my simplifications, both x=0 and f(x) = 0, so this becomes
f(h) = Ah + R(h) = linear + small.
does this help?
(I learned this argument from Lynn Loomis, and it occurs in his advanced calculus book, as well as in Dieudonne.)