# Understand the delta rule increment in gradient descent

• I
Summary:
Understand the delta rule increment in gradient descent
Hello Everyone,

I have a question about the gradient descent algorithm. Given a multivariable function ##f(x,y)##, we can find its minima (local or global) by either setting its gradient ##\nabla f = 0## or by using the gradient descent iterative approach. The first approach (setting the gradient equal to zero) is not always feasible for functions that have too many independent variables and are complicated.

Let's focus on the gradient descent and consider a 1D function ##f(x)## for simplicity. The gradient descent approach is a numerical method that involves the repetitive calculation of gradient ## - \nabla f ## to find the values of x where the function has a minimum.

If we are at a location ##x_0##, we calculate the gradient ##\Delta f(x_0)## and then move in the direction of the negative gradient by a step ##\Delta x##: $$x_{new} = x_{old} + \Delta x$$ $$x_{new} = x_{old} - \eta \nabla f$$
Here my dilemma: why is the the increment ##\Delta x##, which can be positive or negative, equal to the the product of a small and arbitrary constant ##\eta## and the negative of magnitude of the gradient ##\nabla f##?

I don't see how the increment to apply to ##x_{old}## results to be equal to $$\Delta x = \eta \nabla f$$

Is there an in depth derivation to prove that last result?

Thanks!

## Answers and Replies

hutchphd
Science Advisor
Homework Helper
You understand that the gradient descent is a method of incrementally "rolling down" the multivariate hill which is the function? I think the algorithm then makes you go back partway and do it again. When iterated this will put you at the bottom and for proper choice of parameter avoid being stuck in an endless oscillation.

• fog37
You understand that the gradient descent is a method of incrementally "rolling down" the multivariate hill which is the function? I think the algorithm then makes you go back partway and do it again. When iterated this will put you at the bottom and for proper choice of parameter avoid being stuck in an endless oscillation.
Yes, I follow that. I just don't see how the increment ##\Delta x## is equal to the scaled gradient ## - \eta \nabla f##....

hutchphd
Science Advisor
Homework Helper
I think they are saying to choose that increment (for an arbitrary small ##\eta## and then do the gradient again? Obviously it can only be true for arbitrary ##\eta## if ##\nabla f=0##
I don't know the exact algo. they are using.

pasmith
Homework Helper
Yes, I follow that. I just don't see how the increment ##\Delta x## is equal to the scaled gradient ## - \eta \nabla f##....

You are essentially trying to solve this ODE: $$\frac{dx}{dt} = - \nabla f,$$ which has fixed points iff $\nabla f = 0$ (and the fixed points are stable if $f$ has a minimum). If you approximate that using the forward Euler method you get $$\frac{x_{n+1} - x_n}{h} = - \nabla f$$ which is essentially your expreesion for $x_{new}$. A more sophisticated method would be to set $$\frac{dx}{dt} = - \alpha(t) \nabla f$$ where $\alpha(t) > 0$ which leads you to a similar result.

(I don't know if this is how the method is derived, but it does explain it.)

• fog37
You are essentially trying to solve this ODE: $$\frac{dx}{dt} = - \nabla f,$$ which has fixed points iff $\nabla f = 0$ (and the fixed points are stable if $f$ has a minimum). If you approximate that using the forward Euler method you get $$\frac{x_{n+1} - x_n}{h} = - \nabla f$$ which is essentially your expreesion for $x_{new}$. A more sophisticated method would be to set $$\frac{dx}{dt} = - \alpha(t) \nabla f$$ where $\alpha(t) > 0$ which leads you to a similar result.

(I don't know if this is how the method is derived, but it does explain it.)
Thanks passmith.

But the function is ##f(x)## and the independent variable is ##x##. The derivative of ##f(x)## is essentially the gradient: ##\nabla f(x) = \frac {df} {dx}##.

The ODE would be: $$\frac {df} {dx} = \nabla f(x)$$

So $$f(x_{new}) -f(x(_{old}) = h \nabla f(x)$$ and not $$x_{new} -x_{old} = h \nabla f(x)$$

pasmith
Homework Helper
No; you are finding a path $x(t)$ which will lead you to a minimum of $f$. Thus you have $$\frac{dx}{dt} = - \alpha\nabla f$$ where the gradient is evaluated at $x(t)$.

• fog37
No; you are finding a path $x(t)$ which will lead you to a minimum of $f$. Thus you have $$\frac{dx}{dt} = - \alpha\nabla f$$ where the gradient is evaluated at $x(t)$.
Is that different (or the same) from the concept of directional derivative
$$\nabla f \cdot u$$ where ##u## is a unit vector in a particular direction?

Office_Shredder
Staff Emeritus
Science Advisor
Gold Member
I'm a bit lost by the end of the first post. I think the definition of gradient descent is you pick ##\Delta x = -\eta \Delta f##. So there's nothing to prove with that equation, it's just a description of how ##\Delta x ## is picked. Is your question why that is a good choice?

• fog37
pasmith
Homework Helper
Another way to look at it is that you are making a series of guesses $x_0, x_1, \dots$ as yto where the miniimum is. If you guess wrong, then the best way to improve your guess is to head downhill. Which way is downhill? In the direction of $-\nabla f$. Of course if you go too far downhill you may find yourself headding up the other side of te valley, so you need to control how far you head downhill. That's where $\eta > 0$ comes from.

• fog37
I'm a bit lost by the end of the first post. I think the definition of gradient descent is you pick ##\Delta x = -\eta \Delta f##. So there's nothing to prove with that equation, it's just a description of how ##\Delta x ## is picked. Is your question why that is a good choice?
Hi Office_shredder.

Yes, I am wondering where that equation comes from and why it was set up that way, i.e. the increment ##\Delta x## being equal to the derivative (gradient) of the function ##f## multiplied by a constant.

Dimensionally, the left hand side (independent variable ##x##) does not seem to agree with the right hand side of the equation....OR does it?

Thanks

Another way to look at it is that you are making a series of guesses $x_0, x_1, \dots$ as yto where the miniimum is. If you guess wrong, then the best way to improve your guess is to head downhill. Which way is downhill? In the direction of $-\nabla f$. Of course if you go too far downhill you may find yourself headding up the other side of te valley, so you need to control how far you head downhill. That's where $\eta > 0$ comes from.
Conceptually, I with you and understand your point. I am still confused on how we set the increment ##\Delta x## equal to the gradient which is ##\frac {df}{dx}##. The learning parameter must have particular units.

hutchphd
Science Advisor
Homework Helper
It makes more sense in higher dimensions where the gradient specifies the direction you wish to go. It is like following the fall line when snow skiing.....and the ##\eta## just determines the granularity of your approximation.

• pbuk
pbuk
Science Advisor
Gold Member
Yes, I am wondering where that equation comes from and why it was set up that way, i.e. the increment ##\Delta x## being equal to the derivative (gradient) of the function ##f## multiplied by a constant.
If ## \nabla f \ne 0 ## you know you are not at a minimum, so in order to find a minimum you have to travel in some direction. Which is the best direction? Directly downhill i.e. ## -\nabla f ##. How far? Well if you go to far you could easily overshoot, but if you don't go far enough then it will take hours to find a solution, so you need to use a value that is tuned to the particular problem by some algorithm or heuristic - and we call that ## \eta ##.

Dimensionally, the left hand side (independent variable ##x##) does not seem to agree with the right hand side of the equation....OR does it?
The learning parameter must have particular units.

Dimensional analysis is only going to confuse you here, but since you brought it up then for a problem with a single independent variable ## \nabla f ## is a scalar and so ## \eta ## must have the same dimensions as ## x ##. Generally in numerical algorithms we don't think about dimensional analysis, we trust the maths to take care of this (if the dimensions of the problem are correct and our algorithm is correctly derived then it is all good): to the computer they are all just floats.

Last edited:
pbuk
Science Advisor
Gold Member
for a problem with a single independent variable ## \nabla x ## is a scalar and so ## \eta ## must have the same dimensions as ## x ##
Hmm on second thoughts I'm not sure this is correct, which perhaps just strengthens my assertion that dimensional analysis will only confuse things!

pbuk
Science Advisor
Gold Member
Let ## f(x) ## have dimensions ## D_f ## and x have dimensions ## D_x ## then ## \nabla f ## has dimensions ## D_f D^{-1}_x ## and we have ## D_{\Delta x} = D_x = D_\eta D_f D^{-1}_x \implies D_\eta = D^{-1}_fD^2_x##.

I said it wouldn't help.

• fog37
Let ## f(x) ## have dimensions ## D_f ## and x have dimensions ## D_x ## then ## \nabla f ## has dimensions ## D_f D^{-1}_x ## and we have ## D_{\Delta x} = D_x = D_\eta D_f D^{-1}_x \implies D_\eta = D^{-1}_fD^2_x##.

I said it wouldn't help.
Thanks pbuk.

I see your point.

Also, when I mentioned dimensions, I truly intended units :)

So $$x_{new}= x_{old} +\eta \nabla f(x)$$.
The term ##\eta \nabla f(x)## needs to have the same units as the variable ##x##. The equation $$x_{new}- x_{old} = \eta \nabla f(x)$$ does not originate from other mathematical steps/manipulation but it is set up to be this way in the sense that, as a path to find the ##x## for the minimum, we choose to the strategy to set ##\Delta x = \eta \nabla f(x)##...

pbuk
Science Advisor
Gold Member
The equation $$x_{new}- x_{old} = \eta \nabla f(x)$$ does not originate from other mathematical steps/manipulation but it is set up to be this way in the sense that, as a path to find the ##x## for the minimum, we choose to the strategy to set ##\Delta x = \eta \nabla f(x)##...
Yes, you now have the understanding, but be careful about sign conventions, particularly when you come to implement it in code. Using your symbols I would write ## x_{new} = x_{old} - \eta \nabla f(x) ##, and note that this translates simply to a multivariate problem with x as a vector implemented in whatever way fits the language (e.g. an array in C++, a numpy.array in Python...).

hutchphd
Science Advisor
Homework Helper
Summary:: Understand the delta rule increment in gradient descent

Is there an in depth derivation to prove that last result?
I believe you are "overthinking" this badly.
For the 1D problem this technique makes no sense!!! So do not consider it
For 2D (and higher) the gradient gives you the preferred direction to try (i.e. downhill). In order to be sure it is not a local minimum it must work at any choice of scale. QED