# Chain Rule

1. Aug 2, 2011

### dalcde

Is there an elegant and simple proof of the Chain Rule? Every proof I've found is complex and mind-boggling

2. Aug 2, 2011

3. Aug 3, 2011

### I like Serena

Hi dalcde!

I like this one.

Say y is a function of x, and x is a function of t.
Then y=y(x(t)) and:
$$\frac {dy}{dt} = \frac {dy}{dx} \frac {dx}{dt}$$

That's it. Were done!
The equality follows algebraically.

This is not a proof in the delta-epsilon school of thinking, but with the definition of infinitesimals it amounts to the same thing.

4. Aug 3, 2011

### HallsofIvy

Right. You swept all the dirt under the "definition of infinitesmals" carpet!

5. Aug 3, 2011

We can also sweep it under another carpet, but I think that will make all lot of (perhaps all) of mathematicians cringe.

We know that

$$\frac{dy}{dt} = \frac{dy}{dt}$$

Now

$$\frac{dy}{dt} = \frac{dy}{dx} \frac{dx}{dt}$$

Because, we can cancel the two dx`s against each other...

$$\frac{dy}{dt} = \frac{dy}{ \rlap{///}(dx) } \frac{ \rlap{///}(dx) }{dt} = \frac{dy}{dt}$$

EDIT: What doesnt PF have the cancel latex package? ..-

6. Aug 3, 2011

### I like Serena

I like to think of it as intuitive shorthand notation.

Seriously, do you know of an example where a proof based on infinitesimals may be wrong?

7. Aug 3, 2011

### lugita15

In single variable calculus things usually work out if you just assume infinitesimals work like ordinary numbers. But you have to be more careful in multivariable calculus. For instance let's say you wanted to find the infinitesimal area element in polar coordinates, which is
$dA=dxdy=d(rcos\theta)d(rsin\theta)=(drcos\theta-rsin\theta d\theta)(drsin\theta+rcos\theta d\theta)$. If you treat $dr$ and $d\theta$ like ordinary numbers, you will get $dA=\frac{1}{2}sin2\theta(dr^{2}-(rd\theta)^{2})+cos2\theta rdrd\theta$, which is completely wrong. It's only when you remember the fact that $drd\theta=-d\theta dr$ for differential forms (which I've always found really strange) that you get the right answer $dA=rdrd\theta$.

8. Aug 3, 2011

### HallsofIvy

No, I have no problem with a proof based on infinitesmals- it is simply that to use infinitesmals you have to first rigorously define "infinitesmals"- and that requires some very deep logical steps.

9. Aug 3, 2011

### I like Serena

I'm afraid you're defining two different versions of dA here.
The first expression defines dxdy exactly, but it is not very useful, because your actual coordinates are still x and y.

Your second expression for dA is the surface element as it is in polar coordinates, but it has a different surface area.
I think the ratio between the two is the Jacobian determinant.
Basically your second expression shows how the Jacobian determinant can be calculated in a very intuitive and simple manner (another score for infinitesimals! ).
You should be able to find your minus sign somewhere in the Jacobian determinant.

10. Aug 3, 2011

### dalcde

Thanks! You have helped me a lot. The canceling stuff can be used in my case because I'm doing it with nonstandard calculus!

But is the canceling proof technically correct even in nonstandard calculus?

Last edited: Aug 3, 2011
11. Aug 3, 2011

### Fredrik

Staff Emeritus
I haven't looked at the details of this argument and your counterargument, but if you need an example of when cancellation gives you the wrong results, how about this version of the chain rule? $$\frac{\partial f}{\partial u}=\frac{\partial f}{\partial x}\frac{\partial x}{\partial u}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial u}$$

12. Aug 3, 2011

### Fredrik

Staff Emeritus
If we're going to suggest non-rigorous arguments instead of proofs, I suggest the following instead of that stuff about cancellations. Let's use the notation O(h) to mean "anything that goes to zero at least as fast as h" and O(h2) to mean "anything that goes go zero at least as fast as h2" *. A rigorous argument would have to explain exactly what that means, and prove at every step that the definition is satisfied. This argument is non-rigorous because those details are ignored.
$$f(g(x+h)) =f\big(g(x)+hg'(x)+O(h^2)\big) =f(g(x))+\big(hg'(x)+O(h^2)\big)f'(g(x))+O(h^2)$$ $$\frac{f(g(x+h))-f(g(x))}{h} = \frac{\big(hg'(x)+O(h^2)\big)f'(g(x)) +O(h^2)}{h} =f'(g(x))g'(x)+f'(g(x))O(h)+O(h)$$ Weird. Some of the primes look really small. You will have to look closely or zoom to see them.

*) Note that O(h) doesn't have to represent the same thing in every place it's used. For example, we have O(h)+h+h2=O(h), even though O(h) obviously doesn't represent the same thing on both sides. Similar comments apply to O(h2).

By the way, one of the problems with the dx, dy arguments is that even when you get the right result, it doesn't tell you at what point in the domain to evaluate the function. For example, $$\frac{dy}{dx}=\frac{1}{\frac{dx}{dy}}$$ isn't the wrong result, but it's certainly less accurate than $$(f^{-1})'(x)=\frac{1}{f'(f^{-1}(x))}.$$

Last edited: Aug 4, 2011
13. Aug 4, 2011

### I like Serena

You're pulling in partial derivatives here, which are not quite infinitesimals.
Btw, I know the formula as
$$\frac{df}{du}=\frac{\partial f}{\partial x}\frac{dx}{du}+\frac{\partial f}{\partial y}\frac{dy}{du}$$
which shows the difference between partials and infinitesimals.
Basically this shows a more intuitive notation for multivariate derivatives.

Yes, but that is because it is a shorthand notation.
Note that the infinitesimal notation is also the proof, which is not the case with the functional notation.
There's nothing wrong in also using the functional notation if that clarifies something, which it does in this case.

14. Aug 4, 2011

### Fredrik

Staff Emeritus
Your notation is appropriate when the left-hand side is the derivative of the function $u\mapsto f(x(u),y(u))$. Mine is appropriate when the left-hand side is the partial derivative with respect to the first variable of the function $(u,v)\mapsto f(x(u,v),y(y,v))$.

Why are partial derivatives "not quite infinitesimals"? Note for example that $\partial f(x,y)/\partial x$, the partial derivative of f with respect to the first variable, evaluated at (x,y), is equal to the ordinary derivative of the function $x\mapsto f(x,y)$, evaluated at x. Hm, I suppose you could say that even though we can write z=f(x,y) in both cases, dz and $\partial z$ would refer to two different functions. But we're still dealing with a small change in z divided by a small change in x, in both cases.

By the way, the notation I like the best (by far) is $$(f\circ g)_{,i}(x) =f_{,j}(g(x))g_{j,i}(x).$$ (I'm using Einstein's summation convention, so there's a sum over j).

Is it? Maybe it is, but I don't think that can follow from the definition of "infinitesimal". I don't know that definition, but obviously dx and dy need to depend on each other in some way for these calculations to be valid, and I don't think that's going to be a part of the definition. You're going to need some pretty fancy definitions of dx and dy to justify interpreting dy/dx=1/(dx/dy) as a proof of the formula I posted for the derivative of an inverse function.

Last edited: Aug 4, 2011
15. Aug 4, 2011

### dalcde

I'm not familiar with Einstein's summation convention, but I think one j should be a superscript and one subscript.

16. Aug 4, 2011

### Fredrik

Staff Emeritus
Einstein's summation convention is supposed to be used in the context of differential geometry, where the vertical position of the index informs us what type of tensor we're dealing with. In this context (no tensors involved), there's no harm in putting all the indices downstairs. The convention I'm using here is really just to not write any summation sigmas, since we can remember to always sum over those indices that appear twice. So yes, you could argue that it's not exactly Einstein's summation convention, but it's a convention that isn't different enough to deserve its own name.

17. Aug 6, 2011

### I like Serena

Good point. Didn't think of that.

You're right of course, I only meant that with partials things become more complicated.
The formula doesn't simply pan out algebraically any more.

I don't know this notation (yet).
The wiki page on derivative shows a number of notations, but not this one.
What does it say?
Why is it your preferred notation?

Let's give it a try.
I'm keeping it a bit informal, referring to x and y as scalar values as well as functions.
If necessary I can make it more formal and introduce more symbols, but I only want to know if the reasoning, possibly after some extensions, is valid as a proof.

Let y be an invertible function of x, given by y(x), and let x(y) be its inverse function.
For any points x where the function y is differentiable, and where the inverse function x is differentiable, and both are non-zero, the following holds.

For any 0 < |epsilon|, we can define a dy=epsilon, such that there is a 0 < |dx|, such that the ratio dy/dx is equal to y'(x).

In this case the inverse ratio given by dx/dy is equal to x'(y).

Qed.

Shoot!

18. Aug 6, 2011

### Fredrik

Staff Emeritus
The LaTeX looks pretty ugly in my Firefox. I recommend that you right-click and "scale all math" to 110% to view this comfortably.

It's just a notation, so it can't be an enormous improvement over any notation that works. All I can tell you is what it means and what I like about it. If f is a function, then $f_{,i}$ denotes its partial derivative with respect to the ith variable. This is an alternative to $D_if$. I don't like the notation $\partial f/\partial x_i$ because it gives the impression that the variable symbols we're using are somehow relevant, which they're not of course. Note that $f_{,i}$ is a function and $f_{,i}(x)$ it's value at x.

I write the chain rule for functions from ℝ into ℝ as $$(f\circ g)'(x)=f'(g(x))g'(x).$$ The corresponding rule for the situation when $f:\mathbb R^m\rightarrow\mathbb R$ and $g:\mathbb R^n\rightarrow\mathbb R^m$ can e.g. be written as $$\frac{\partial (f\circ g)(x)}{\partial x_i}=\sum_{j=1}^m\frac{\partial f(g(x))}{\partial g_j}\frac{\partial g_j(x)}{\partial x_i},$$ where the $g_j$ are defined by $g(x)=(g_1(x),\dots,g_m(x))$. I really don't like this notation. For example, why is the partial derivative of f with respect to the jth variable denoted by $\partial f/\partial g_j$ all of a sudden. The only answer I can think of is extremely ugly to me: Because we intend to evaluate that function at g(x).

To avoid that ugliness, we can write this version of the chain rule as $$D_i(f\circ g)(x)=\sum_{j=1}^m D_j f(g(x)) D_i g_j(x)$$ instead. There's nothing wrong with this, but it doesn't look a lot like the single-variable version in the form I like to see it. So let's use the comma notation instead, and while we're at it, let's just drop the summation sigma. This is harmless as long as we can remember that there's always a sum over each index that appears twice. $$(f\circ g)_{,i}(x)=f_{,j}(g(x))g_{j,i}(x)$$ If we interpret the indices as labeling the rows and columns of matrices, then this is the ith component of a matrix equation, so with the appropriate definitions, we can rewrite it as $$(f\circ g)'(x)=f'(g(x))g'(x),$$ which is exactly the same as the single-variable version. The reasons why I don't consider this an improvement is that we have to remember those definitions to know what the formula says, and that when we actually use the chain rule, we are going to be working with the components anyway. See e.g. post #5 here. In that post, I'm writing the index that labels the component functions upstairs (i.e. I write $g^i$ instead of $g_i$). I think that improves the readability a bit, so I should have said that that's my favorite notation.

I don't understand what you're saying. If you meant that for each positive infinitesimal dy, there's a positive infinitesimal dx such that dy/dx=f'(x), then my questions are "what's an infinitesimal?" and "how do you know this?". What you said doesn't answer either of those questions. It also doesn't explain why dx/dy should have anything to do with the derivative of $f^{-1}$.

Last edited: Aug 6, 2011
19. Aug 6, 2011

### I like Serena

Thanks for the explanation! I understand it now.

Wouldn't g usually denote a coordinate transformation?

If that's the case I would prefer to use different symbols.
Say f(x(u)), where x and u denote vectors.

$$\frac{\partial f(x(u))}{\partial u_i}=\sum_{j=1}^n \frac{\partial f(x(u))}{\partial x_j}\frac{\partial x_j(u)}{\partial u_i}$$
or simply:
$$\frac{\partial f}{\partial u_i}=\frac{\partial f}{\partial x_j}\frac{\partial x_j}{\partial u_i}$$

I like this notation, because it shows that you take partial derivatives of f, which must be corrected by multiplying with the appropriate ratio between coordinates.
The use of the symbols x and u instead of g and x is also more intuitive, because using g suggests that g is a function like f, instead of just another set of coordinates.

20. Aug 6, 2011

### Fredrik

Staff Emeritus
But g is another function. It's not "like f" in the sense that it's not real valued, except in the special case m=1. But it's certainly a function. And in the most general case, f isn't real valued either. Suppose e.g. that $f:\mathbb R^m\rightarrow\mathbb R^k$ and $g:\mathbb R^n\rightarrow\mathbb R^m$. The chain rule satisfied by these functions is a trivial consequence of the one discussed in my previous post. We have $$(f\circ g)^i{}_{,\,j}(x)=(f^i\circ g)_{,\,j}(x) =f^i{}_{,\,k}(g(x)) g^k{}_{,\,j}(x).$$ This is by the way another reason why I like to write the chain rule in that form. It makes it trivial to derive this even more general version, and of course we can recover the one for real-valued functions of one real variable simply by setting n=m=1.

I also don't like the notation $$\frac{\partial f(x(u))}{\partial u_i}$$ that you put on the left. I'm sure lots of people use it, but it looks very ugly to me. It looks like a partial derivative of f evaluated at (u(x)), even though it's actually a partial derivative of $f\circ u$ evaluated at x. For some reason, I have less of a problem with it when it's written in the form $$\frac{\partial}{\partial u_i}f(x(u)),$$ because when I see this expression, I find it easier to tell myself that the $u_i$ in the denominator is there to tell us both that the function we're taking a derivative of is $u_i\mapsto f(x(u))$ rather than any of the other possibilities, and that the derivative is to be evaluated at $u_i$. For example, I would interpret $$\frac{\partial}{\partial y}ax^2y^3$$ as $$(t\mapsto ax^2t^3)'(y).$$

By the way, the sum should go from 1 to m, not 1 to n. I got that wrong in my previous post, and corrected it after you had replied.

Last edited: Aug 6, 2011