# Chain Rule

Is there an elegant and simple proof of the Chain Rule? Every proof I've found is complex and mind-boggling

I like Serena
Homework Helper
Hi dalcde! I like this one.

Say y is a function of x, and x is a function of t.
Then y=y(x(t)) and:
$$\frac {dy}{dt} = \frac {dy}{dx} \frac {dx}{dt}$$

That's it. Were done! The equality follows algebraically.

This is not a proof in the delta-epsilon school of thinking, but with the definition of infinitesimals it amounts to the same thing.

HallsofIvy
Homework Helper
Right. You swept all the dirt under the "definition of infinitesmals" carpet!

We can also sweep it under another carpet, but I think that will make all lot of (perhaps all) of mathematicians cringe.

We know that

$$\frac{dy}{dt} = \frac{dy}{dt}$$

Now

$$\frac{dy}{dt} = \frac{dy}{dx} \frac{dx}{dt}$$

Because, we can cancel the two dx`s against each other...

$$\frac{dy}{dt} = \frac{dy}{ \rlap{///}(dx) } \frac{ \rlap{///}(dx) }{dt} = \frac{dy}{dt}$$

EDIT: What doesnt PF have the cancel latex package? ..- I like Serena
Homework Helper
Right. You swept all the dirt under the "definition of infinitesmals" carpet!
I like to think of it as intuitive shorthand notation.

Seriously, do you know of an example where a proof based on infinitesimals may be wrong?

I like to think of it as intuitive shorthand notation.

Seriously, do you know of an example where a proof based on infinitesimals may be wrong?
In single variable calculus things usually work out if you just assume infinitesimals work like ordinary numbers. But you have to be more careful in multivariable calculus. For instance let's say you wanted to find the infinitesimal area element in polar coordinates, which is
$dA=dxdy=d(rcos\theta)d(rsin\theta)=(drcos\theta-rsin\theta d\theta)(drsin\theta+rcos\theta d\theta)$. If you treat $dr$ and $d\theta$ like ordinary numbers, you will get $dA=\frac{1}{2}sin2\theta(dr^{2}-(rd\theta)^{2})+cos2\theta rdrd\theta$, which is completely wrong. It's only when you remember the fact that $drd\theta=-d\theta dr$ for differential forms (which I've always found really strange) that you get the right answer $dA=rdrd\theta$.

HallsofIvy
Homework Helper
I like to think of it as intuitive shorthand notation.

Seriously, do you know of an example where a proof based on infinitesimals may be wrong?
No, I have no problem with a proof based on infinitesmals- it is simply that to use infinitesmals you have to first rigorously define "infinitesmals"- and that requires some very deep logical steps.

I like Serena
Homework Helper
In single variable calculus things usually work out if you just assume infinitesimals work like ordinary numbers. But you have to be more careful in multivariable calculus. For instance let's say you wanted to find the infinitesimal area element in polar coordinates, which is
$dA=dxdy=d(rcos\theta)d(rsin\theta)=(drcos\theta-rsin\theta d\theta)(drsin\theta+rcos\theta d\theta)$. If you treat $dr$ and $d\theta$ like ordinary numbers, you will get $dA=\frac{1}{2}sin2\theta(dr^{2}-(rd\theta)^{2})+cos2\theta rdrd\theta$, which is completely wrong. It's only when you remember the fact that $drd\theta=-d\theta dr$ for differential forms (which I've always found really strange) that you get the right answer $dA=rdrd\theta$.
I'm afraid you're defining two different versions of dA here.
The first expression defines dxdy exactly, but it is not very useful, because your actual coordinates are still x and y.

Your second expression for dA is the surface element as it is in polar coordinates, but it has a different surface area.
I think the ratio between the two is the Jacobian determinant.
Basically your second expression shows how the Jacobian determinant can be calculated in a very intuitive and simple manner (another score for infinitesimals! ).
You should be able to find your minus sign somewhere in the Jacobian determinant.

Thanks! You have helped me a lot. The canceling stuff can be used in my case because I'm doing it with nonstandard calculus!

But is the canceling proof technically correct even in nonstandard calculus?

Last edited:
Fredrik
Staff Emeritus
Gold Member
I'm afraid you're defining two different versions of dA here.
The first expression defines dxdy exactly, but it is not very useful, because your actual coordinates are still x and y.

Your second expression for dA is the surface element as it is in polar coordinates, but it has a different surface area.
I think the ratio between the two is the Jacobian determinant.
Basically your second expression shows how the Jacobian determinant can be calculated in a very intuitive and simple manner (another score for infinitesimals! ).
You should be able to find your minus sign somewhere in the Jacobian determinant.
I haven't looked at the details of this argument and your counterargument, but if you need an example of when cancellation gives you the wrong results, how about this version of the chain rule? $$\frac{\partial f}{\partial u}=\frac{\partial f}{\partial x}\frac{\partial x}{\partial u}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial u}$$

Fredrik
Staff Emeritus
Gold Member
If we're going to suggest non-rigorous arguments instead of proofs, I suggest the following instead of that stuff about cancellations. Let's use the notation O(h) to mean "anything that goes to zero at least as fast as h" and O(h2) to mean "anything that goes go zero at least as fast as h2" *. A rigorous argument would have to explain exactly what that means, and prove at every step that the definition is satisfied. This argument is non-rigorous because those details are ignored.
$$f(g(x+h)) =f\big(g(x)+hg'(x)+O(h^2)\big) =f(g(x))+\big(hg'(x)+O(h^2)\big)f'(g(x))+O(h^2)$$ $$\frac{f(g(x+h))-f(g(x))}{h} = \frac{\big(hg'(x)+O(h^2)\big)f'(g(x)) +O(h^2)}{h} =f'(g(x))g'(x)+f'(g(x))O(h)+O(h)$$ Weird. Some of the primes look really small. You will have to look closely or zoom to see them.

*) Note that O(h) doesn't have to represent the same thing in every place it's used. For example, we have O(h)+h+h2=O(h), even though O(h) obviously doesn't represent the same thing on both sides. Similar comments apply to O(h2).

By the way, one of the problems with the dx, dy arguments is that even when you get the right result, it doesn't tell you at what point in the domain to evaluate the function. For example, $$\frac{dy}{dx}=\frac{1}{\frac{dx}{dy}}$$ isn't the wrong result, but it's certainly less accurate than $$(f^{-1})'(x)=\frac{1}{f'(f^{-1}(x))}.$$

Last edited:
I like Serena
Homework Helper
I haven't looked at the details of this argument and your counterargument, but if you need an example of when cancellation gives you the wrong results, how about this version of the chain rule? $$\frac{\partial f}{\partial u}=\frac{\partial f}{\partial x}\frac{\partial x}{\partial u}+\frac{\partial f}{\partial y}\frac{\partial y}{\partial u}$$
You're pulling in partial derivatives here, which are not quite infinitesimals.
Btw, I know the formula as
$$\frac{df}{du}=\frac{\partial f}{\partial x}\frac{dx}{du}+\frac{\partial f}{\partial y}\frac{dy}{du}$$
which shows the difference between partials and infinitesimals.
Basically this shows a more intuitive notation for multivariate derivatives.

By the way, one of the problems with the dx, dy arguments is that even when you get the right result, it doesn't tell you at what point in the domain to evaluate the function. For example, $$\frac{dy}{dx}=\frac{1}{\frac{dx}{dy}}$$ isn't the wrong result, but it's certainly less accurate than $$(f^{-1})'(x)=\frac{1}{f'(f^{-1}(x))}.$$
Yes, but that is because it is a shorthand notation.
Note that the infinitesimal notation is also the proof, which is not the case with the functional notation. There's nothing wrong in also using the functional notation if that clarifies something, which it does in this case.

Fredrik
Staff Emeritus
Gold Member
You're pulling in partial derivatives here, which are not quite infinitesimals.
Btw, I know the formula as
$$\frac{df}{du}=\frac{\partial f}{\partial x}\frac{dx}{du}+\frac{\partial f}{\partial y}\frac{dy}{du}$$
which shows the difference between partials and infinitesimals.
Basically this shows a more intuitive notation for multivariate derivatives.
Your notation is appropriate when the left-hand side is the derivative of the function $u\mapsto f(x(u),y(u))$. Mine is appropriate when the left-hand side is the partial derivative with respect to the first variable of the function $(u,v)\mapsto f(x(u,v),y(y,v))$.

Why are partial derivatives "not quite infinitesimals"? Note for example that $\partial f(x,y)/\partial x$, the partial derivative of f with respect to the first variable, evaluated at (x,y), is equal to the ordinary derivative of the function $x\mapsto f(x,y)$, evaluated at x. Hm, I suppose you could say that even though we can write z=f(x,y) in both cases, dz and $\partial z$ would refer to two different functions. But we're still dealing with a small change in z divided by a small change in x, in both cases.

By the way, the notation I like the best (by far) is $$(f\circ g)_{,i}(x) =f_{,j}(g(x))g_{j,i}(x).$$ (I'm using Einstein's summation convention, so there's a sum over j).

Note that the infinitesimal notation is also the proof, which is not the case with the functional notation. Is it? Maybe it is, but I don't think that can follow from the definition of "infinitesimal". I don't know that definition, but obviously dx and dy need to depend on each other in some way for these calculations to be valid, and I don't think that's going to be a part of the definition. You're going to need some pretty fancy definitions of dx and dy to justify interpreting dy/dx=1/(dx/dy) as a proof of the formula I posted for the derivative of an inverse function.

Last edited:
By the way, the notation I like the best (by far) is $$(f\circ g)_{,i}(x) =f_{,j}(g(x))g_{j,i}(x).$$ (I'm using Einstein's summation convention, so there's a sum over j).

I'm not familiar with Einstein's summation convention, but I think one j should be a superscript and one subscript.

Fredrik
Staff Emeritus
Gold Member
I'm not familiar with Einstein's summation convention, but I think one j should be a superscript and one subscript.
Einstein's summation convention is supposed to be used in the context of differential geometry, where the vertical position of the index informs us what type of tensor we're dealing with. In this context (no tensors involved), there's no harm in putting all the indices downstairs. The convention I'm using here is really just to not write any summation sigmas, since we can remember to always sum over those indices that appear twice. So yes, you could argue that it's not exactly Einstein's summation convention, but it's a convention that isn't different enough to deserve its own name.

I like Serena
Homework Helper
Your notation is appropriate when the left-hand side is the derivative of the function $u\mapsto f(x(u),y(u))$. Mine is appropriate when the left-hand side is the partial derivative with respect to the first variable of the function $(u,v)\mapsto f(x(u,v),y(y,v))$.
Good point. Didn't think of that.

Why are partial derivatives "not quite infinitesimals"? Note for example that $\partial f(x,y)/\partial x$, the partial derivative of f with respect to the first variable, evaluated at (x,y), is equal to the ordinary derivative of the function $x\mapsto f(x,y)$, evaluated at x. Hm, I suppose you could say that even though we can write z=f(x,y) in both cases, dz and $\partial z$ would refer to two different functions. But we're still dealing with a small change in z divided by a small change in x, in both cases.
You're right of course, I only meant that with partials things become more complicated.
The formula doesn't simply pan out algebraically any more.

By the way, the notation I like the best (by far) is $$(f\circ g)_{,i}(x) =f_{,j}(g(x))g_{j,i}(x).$$ (I'm using Einstein's summation convention, so there's a sum over j).
I don't know this notation (yet).
The wiki page on derivative shows a number of notations, but not this one.
What does it say?
Why is it your preferred notation?

Is it? Maybe it is, but I don't think that can follow from the definition of "infinitesimal". I don't know that definition, but obviously dx and dy need to depend on each other in some way for these calculations to be valid, and I don't think that's going to be a part of the definition. You're going to need some pretty fancy definitions of dx and dy to justify interpreting dy/dx=1/(dx/dy) as a proof of the formula I posted for the derivative of an inverse function.
Let's give it a try.
I'm keeping it a bit informal, referring to x and y as scalar values as well as functions.
If necessary I can make it more formal and introduce more symbols, but I only want to know if the reasoning, possibly after some extensions, is valid as a proof.

Let y be an invertible function of x, given by y(x), and let x(y) be its inverse function.
For any points x where the function y is differentiable, and where the inverse function x is differentiable, and both are non-zero, the following holds.

For any 0 < |epsilon|, we can define a dy=epsilon, such that there is a 0 < |dx|, such that the ratio dy/dx is equal to y'(x).

In this case the inverse ratio given by dx/dy is equal to x'(y).

Qed.

Shoot!

Fredrik
Staff Emeritus
Gold Member
The LaTeX looks pretty ugly in my Firefox. I recommend that you right-click and "scale all math" to 110% to view this comfortably.

I don't know this notation (yet).
The wiki page on derivative shows a number of notations, but not this one.
What does it say?
Why is it your preferred notation?
It's just a notation, so it can't be an enormous improvement over any notation that works. All I can tell you is what it means and what I like about it. If f is a function, then $f_{,i}$ denotes its partial derivative with respect to the ith variable. This is an alternative to $D_if$. I don't like the notation $\partial f/\partial x_i$ because it gives the impression that the variable symbols we're using are somehow relevant, which they're not of course. Note that $f_{,i}$ is a function and $f_{,i}(x)$ it's value at x.

I write the chain rule for functions from ℝ into ℝ as $$(f\circ g)'(x)=f'(g(x))g'(x).$$ The corresponding rule for the situation when $f:\mathbb R^m\rightarrow\mathbb R$ and $g:\mathbb R^n\rightarrow\mathbb R^m$ can e.g. be written as $$\frac{\partial (f\circ g)(x)}{\partial x_i}=\sum_{j=1}^m\frac{\partial f(g(x))}{\partial g_j}\frac{\partial g_j(x)}{\partial x_i},$$ where the $g_j$ are defined by $g(x)=(g_1(x),\dots,g_m(x))$. I really don't like this notation. For example, why is the partial derivative of f with respect to the jth variable denoted by $\partial f/\partial g_j$ all of a sudden. The only answer I can think of is extremely ugly to me: Because we intend to evaluate that function at g(x).

To avoid that ugliness, we can write this version of the chain rule as $$D_i(f\circ g)(x)=\sum_{j=1}^m D_j f(g(x)) D_i g_j(x)$$ instead. There's nothing wrong with this, but it doesn't look a lot like the single-variable version in the form I like to see it. So let's use the comma notation instead, and while we're at it, let's just drop the summation sigma. This is harmless as long as we can remember that there's always a sum over each index that appears twice. $$(f\circ g)_{,i}(x)=f_{,j}(g(x))g_{j,i}(x)$$ If we interpret the indices as labeling the rows and columns of matrices, then this is the ith component of a matrix equation, so with the appropriate definitions, we can rewrite it as $$(f\circ g)'(x)=f'(g(x))g'(x),$$ which is exactly the same as the single-variable version. The reasons why I don't consider this an improvement is that we have to remember those definitions to know what the formula says, and that when we actually use the chain rule, we are going to be working with the components anyway. See e.g. post #5 here. In that post, I'm writing the index that labels the component functions upstairs (i.e. I write $g^i$ instead of $g_i$). I think that improves the readability a bit, so I should have said that that's my favorite notation.

Let's give it a try.
I'm keeping it a bit informal, referring to x and y as scalar values as well as functions.
If necessary I can make it more formal and introduce more symbols, but I only want to know if the reasoning, possibly after some extensions, is valid as a proof.

Let y be an invertible function of x, given by y(x), and let x(y) be its inverse function.
For any points x where the function y is differentiable, and where the inverse function x is differentiable, and both are non-zero, the following holds.

For any 0 < |epsilon|, we can define a dy=epsilon, such that there is a 0 < |dx|, such that the ratio dy/dx is equal to y'(x).

In this case the inverse ratio given by dx/dy is equal to x'(y).

Qed.

Shoot!
I don't understand what you're saying. If you meant that for each positive infinitesimal dy, there's a positive infinitesimal dx such that dy/dx=f'(x), then my questions are "what's an infinitesimal?" and "how do you know this?". What you said doesn't answer either of those questions. It also doesn't explain why dx/dy should have anything to do with the derivative of $f^{-1}$.

Last edited:
I like Serena
Homework Helper
Thanks for the explanation! I understand it now. $$\frac{\partial (f\circ g)(x)}{\partial x_i}=\sum_{j=1}^n\frac{\partial f(g(x))}{\partial g_j}\frac{\partial g_j(x)}{\partial x_i},$$ where the $g_j$ are defined by $g(x)=(g_1(x),\dots,g_n(x))$. I really don't like this notation. For example, why is the partial derivative of f with respect to the jth variable denoted by $\partial f/\partial g_j$ all of a sudden. The only answer I can think of is extremely ugly to me: Because we intend to evaluate that function at g(x).
Wouldn't g usually denote a coordinate transformation?

If that's the case I would prefer to use different symbols.
Say f(x(u)), where x and u denote vectors.

$$\frac{\partial f(x(u))}{\partial u_i}=\sum_{j=1}^n \frac{\partial f(x(u))}{\partial x_j}\frac{\partial x_j(u)}{\partial u_i}$$
or simply:
$$\frac{\partial f}{\partial u_i}=\frac{\partial f}{\partial x_j}\frac{\partial x_j}{\partial u_i}$$

I like this notation, because it shows that you take partial derivatives of f, which must be corrected by multiplying with the appropriate ratio between coordinates.
The use of the symbols x and u instead of g and x is also more intuitive, because using g suggests that g is a function like f, instead of just another set of coordinates.

Fredrik
Staff Emeritus
Gold Member
If that's the case I would prefer to use different symbols.
Say f(x(u)), where x and u denote vectors.

$$\frac{\partial f(x(u))}{\partial u_i}=\sum_{j=1}^n \frac{\partial f(x(u))}{\partial x_j}\frac{\partial x_j(u)}{\partial u_i}$$
or simply:
$$\frac{\partial f}{\partial u_i}=\frac{\partial f}{\partial x_j}\frac{\partial x_j}{\partial u_i}$$

I like this notation, because it shows that you take partial derivatives of f, which must be corrected by multiplying with the appropriate ratio between coordinates.
The use of the symbols x and u instead of g and x is also more intuitive, because using g suggests that g is a function like f, instead of just another set of coordinates.
But g is another function. It's not "like f" in the sense that it's not real valued, except in the special case m=1. But it's certainly a function. And in the most general case, f isn't real valued either. Suppose e.g. that $f:\mathbb R^m\rightarrow\mathbb R^k$ and $g:\mathbb R^n\rightarrow\mathbb R^m$. The chain rule satisfied by these functions is a trivial consequence of the one discussed in my previous post. We have $$(f\circ g)^i{}_{,\,j}(x)=(f^i\circ g)_{,\,j}(x) =f^i{}_{,\,k}(g(x)) g^k{}_{,\,j}(x).$$ This is by the way another reason why I like to write the chain rule in that form. It makes it trivial to derive this even more general version, and of course we can recover the one for real-valued functions of one real variable simply by setting n=m=1.

I also don't like the notation $$\frac{\partial f(x(u))}{\partial u_i}$$ that you put on the left. I'm sure lots of people use it, but it looks very ugly to me. It looks like a partial derivative of f evaluated at (u(x)), even though it's actually a partial derivative of $f\circ u$ evaluated at x. For some reason, I have less of a problem with it when it's written in the form $$\frac{\partial}{\partial u_i}f(x(u)),$$ because when I see this expression, I find it easier to tell myself that the $u_i$ in the denominator is there to tell us both that the function we're taking a derivative of is $u_i\mapsto f(x(u))$ rather than any of the other possibilities, and that the derivative is to be evaluated at $u_i$. For example, I would interpret $$\frac{\partial}{\partial y}ax^2y^3$$ as $$(t\mapsto ax^2t^3)'(y).$$

By the way, the sum should go from 1 to m, not 1 to n. I got that wrong in my previous post, and corrected it after you had replied.

Last edited:
Fredrik
Staff Emeritus
Gold Member
If we're going to suggest non-rigorous arguments instead of proofs,...
Actually, if we're not going to do it rigorously, then we might as well use an even simpler argument. It follows immediately from the definition of the derivative that when h is small, $$f(x+h)\approx f(x)+hf'(x).$$ Let's just use this formula twice, once on g and then once on f. $$f(g(x+h))\approx f\big(g(x)+hg'(x)\big)\approx f(g(x))+hg'(x)f'(g(x))$$ This implies that \begin{align}(f\circ g)'(x) &\approx \frac{f(g(x+h))-f(g(x))}{h}\approx \frac{f(g(x))+hg'(x)f'(g(x))-f(g(x))}{h}\\ &\approx f'(g(x))g'(x).\end{align} What's missing here is of course a proof that the error in this approximation really goes to zero when h goes to zero. But this is still a good way to see that the chain rule is "likely" to be true.

Last edited:
I like Serena
Homework Helper
I hear your arguments, so I'm giving it another try, which I think includes your arguments.

Btw, right now I'm sticking with a function x instead of g, since at least I usually mean a coordinate transformation when I write down something like this.
When the function is not a coordinate transformation, your notation may well be the preferred one, so as to be more general and to not suggest they are coordinates.

How is:
$$\frac{\partial f \circ x}{\partial u_i}(u)=\sum_{j=1}^m \frac{\partial f}{\partial x_j}(x(u))\frac{\partial x_j}{\partial u_i}(u)$$
or:
$$\frac{\partial f \circ x}{\partial u_i}=\frac{\partial f}{\partial x_j}\frac{\partial x_j}{\partial u_i}$$

What I dislike about your form, is that it is one long string of symbols with no real visual cues.
What I like about Leibniz's notation is that is shows the ratios of change, which link directly to any drawing that I might make.
(Yes, I like drawings to clarify and understand what's going on. )

I like Serena
Homework Helper
Actually, if we're not going to do it rigorously, then we might as well use an even simpler argument. It follows immediately from the definition of the derivative that when h is small, $$f(x+h)\approx f(x)+hf'(x).$$ Let's just use this formula twice, once on g and then once on f. $$f(g(x+h))\approx f\big(g(x)+hg'(x)\big)\approx f(g(x))+hg'(x)f'(g(x))$$ This implies that \begin{align}(f\circ g)'(x) &\approx \frac{f(g(x+h))-f(g(x))}{h}\approx \frac{f(g(x))+hg'(x)f'(g(x))-f(g(x))}{h}\\ &\approx f'(g(x))g'(x).\end{align} What's missing here is of course a proof that the error in this approximation really goes to zero when h goes to zero. But this is still a good way to see that the chain rule is "likely" to be true.
Yes, this works too, although I dislike the approximately-symbols.
The use of those symbols makes it specifically non-rigorous.
My "proof" is based on the graphical interpretation of ratios, from which it is immediately evident that the inverse has the ratio inversed.
There is no "approximately" involved, although it is a jump of the mind.
I'm getting the impression that I am more graphically minded, thinking in pictures, and wanting to relate the symbols I write to what I see in my head.

Fredrik
Staff Emeritus
Gold Member
Yes, this works too, although I dislike the approximately-symbols.
The use of those symbols makes it specifically non-rigorous.
That's what I like the most about it. If you make a non-rigorous argument, you need to make sure that no one will mistake it for an actual proof.

My "proof" is based on the graphical interpretation of ratios, from which it is immediately evident that the inverse has the ratio inversed.
Ah, yes, this is almost an actual proof of the formula for the derivative of an inverse function. But I don't see an equally convincing argument of that sort for the chain rule.

Actually, if we're not going to do it rigorously, then we might as well use an even simpler argument. It follows immediately from the definition of the derivative that when h is small, $$f(x+h)\approx f(x)+hf'(x).$$
That doesn't seem to work for all functions. Let f(x)=tan x. Then using that formula, we will get the tangent of pi/2 to be something like 3.8264459099620716 ( here )