Understand the delta rule increment in gradient descent

fog37 · May 1, 2021

Hello Everyone,

I have a question about the gradient descent algorithm. Given a multivariable function ##f(x,y)##, we can find its minima (local or global) by either setting its gradient ##\nabla f = 0## or by using the gradient descent iterative approach. The first approach (setting the gradient equal to zero) is not always feasible for functions that have too many independent variables and are complicated.

Let's focus on the gradient descent and consider a 1D function ##f(x)## for simplicity. The gradient descent approach is a numerical method that involves the repetitive calculation of gradient ## - \nabla f ## to find the values of x where the function has a minimum.

If we are at a location ##x_0##, we calculate the gradient ##\Delta f(x_0)## and then move in the direction of the negative gradient by a step ##\Delta x##: $$x_{new} = x_{old} + \Delta x $$ $$ x_{new} = x_{old} - \eta \nabla f $$
Here my dilemma: why is the the increment ##\Delta x##, which can be positive or negative, equal to the the product of a small and arbitrary constant ##\eta## and the negative of magnitude of the gradient ##\nabla f##?

I don't see how the increment to apply to ##x_{old}## results to be equal to $$ \Delta x = \eta \nabla f $$

Is there an in depth derivation to prove that last result?

Thanks!

hutchphd · May 1, 2021

You understand that the gradient descent is a method of incrementally "rolling down" the multivariate hill which is the function? I think the algorithm then makes you go back partway and do it again. When iterated this will put you at the bottom and for proper choice of parameter avoid being stuck in an endless oscillation.

fog37 · May 1, 2021

hutchphd said:

You understand that the gradient descent is a method of incrementally "rolling down" the multivariate hill which is the function? I think the algorithm then makes you go back partway and do it again. When iterated this will put you at the bottom and for proper choice of parameter avoid being stuck in an endless oscillation.

Yes, I follow that. I just don't see how the increment ##\Delta x## is equal to the scaled gradient ## - \eta \nabla f##...

hutchphd · May 1, 2021

I think they are saying to choose that increment (for an arbitrary small ##\eta## and then do the gradient again? Obviously it can only be true for arbitrary ##\eta## if ##\nabla f=0##
I don't know the exact algo. they are using.

pasmith · May 1, 2021

fog37 said:

Yes, I follow that. I just don't see how the increment ##\Delta x## is equal to the scaled gradient ## - \eta \nabla f##...

You are essentially trying to solve this ODE: [tex] \frac{dx}{dt} = - \nabla f,[/tex] which has fixed points iff [itex]\nabla f = 0[/itex] (and the fixed points are stable if [itex]f[/itex] has a minimum). If you approximate that using the forward Euler method you get [tex] \frac{x_{n+1} - x_n}{h} = - \nabla f[/tex] which is essentially your expreesion for [itex]x_{new}[/itex]. A more sophisticated method would be to set [tex] \frac{dx}{dt} = - \alpha(t) \nabla f[/tex] where [itex]\alpha(t) > 0[/itex] which leads you to a similar result.

(I don't know if this is how the method is derived, but it does explain it.)

fog37 · May 1, 2021

pasmith said:

You are essentially trying to solve this ODE: [tex] \frac{dx}{dt} = - \nabla f,[/tex] which has fixed points iff [itex]\nabla f = 0[/itex] (and the fixed points are stable if [itex]f[/itex] has a minimum). If you approximate that using the forward Euler method you get [tex] \frac{x_{n+1} - x_n}{h} = - \nabla f[/tex] which is essentially your expreesion for [itex]x_{new}[/itex]. A more sophisticated method would be to set [tex] \frac{dx}{dt} = - \alpha(t) \nabla f[/tex] where [itex]\alpha(t) > 0[/itex] which leads you to a similar result.

(I don't know if this is how the method is derived, but it does explain it.)

Thanks passmith.

But the function is ##f(x)## and the independent variable is ##x##. The derivative of ##f(x)## is essentially the gradient: ##\nabla f(x) = \frac {df} {dx}##.

The ODE would be: $$\frac {df} {dx} = \nabla f(x)$$So $$f(x_{new}) -f(x(_{old}) = h \nabla f(x)$$ and not $$x_{new} -x_{old} = h \nabla f(x)$$

pasmith · May 1, 2021

No; you are finding a path [itex]x(t)[/itex] which will lead you to a minimum of [itex]f[/itex]. Thus you have [tex] \frac{dx}{dt} = - \alpha\nabla f[/tex] where the gradient is evaluated at [itex]x(t)[/itex].

fog37 · May 1, 2021

pasmith said:

No; you are finding a path [itex]x(t)[/itex] which will lead you to a minimum of [itex]f[/itex]. Thus you have [tex] \frac{dx}{dt} = - \alpha\nabla f[/tex] where the gradient is evaluated at [itex]x(t)[/itex].

Is that different (or the same) from the concept of directional derivative
$$\nabla f \cdot u$$ where ##u## is a unit vector in a particular direction?

Office_Shredder · May 2, 2021

I'm a bit lost by the end of the first post. I think the definition of gradient descent is you pick ##\Delta x = -\eta \Delta f##. So there's nothing to prove with that equation, it's just a description of how ##\Delta x ## is picked. Is your question why that is a good choice?

pasmith · May 3, 2021

Another way to look at it is that you are making a series of guesses [itex]x_0, x_1, \dots[/itex] as yto where the miniimum is. If you guess wrong, then the best way to improve your guess is to head downhill. Which way is downhill? In the direction of [itex]-\nabla f[/itex]. Of course if you go too far downhill you may find yourself headding up the other side of te valley, so you need to control how far you head downhill. That's where [itex]\eta > 0[/itex] comes from.

fog37 · May 5, 2021

Office_Shredder said:

I'm a bit lost by the end of the first post. I think the definition of gradient descent is you pick ##\Delta x = -\eta \Delta f##. So there's nothing to prove with that equation, it's just a description of how ##\Delta x ## is picked. Is your question why that is a good choice?

Hi Office_shredder.

Yes, I am wondering where that equation comes from and why it was set up that way, i.e. the increment ##\Delta x## being equal to the derivative (gradient) of the function ##f## multiplied by a constant.

Dimensionally, the left hand side (independent variable ##x##) does not seem to agree with the right hand side of the equation...OR does it?

Thanks

fog37 · May 5, 2021

pasmith said:

Another way to look at it is that you are making a series of guesses [itex]x_0, x_1, \dots[/itex] as yto where the miniimum is. If you guess wrong, then the best way to improve your guess is to head downhill. Which way is downhill? In the direction of [itex]-\nabla f[/itex]. Of course if you go too far downhill you may find yourself headding up the other side of te valley, so you need to control how far you head downhill. That's where [itex]\eta > 0[/itex] comes from.

Conceptually, I with you and understand your point. I am still confused on how we set the increment ##\Delta x## equal to the gradient which is ##\frac {df}{dx}##. The learning parameter must have particular units.

hutchphd · May 5, 2021

It makes more sense in higher dimensions where the gradient specifies the direction you wish to go. It is like following the fall line when snow skiing...and the ##\eta## just determines the granularity of your approximation.

pbuk · May 5, 2021

fog37 said:

Yes, I am wondering where that equation comes from and why it was set up that way, i.e. the increment ##\Delta x## being equal to the derivative (gradient) of the function ##f## multiplied by a constant.

If ## \nabla f \ne 0 ## you know you are not at a minimum, so in order to find a minimum you have to travel in some direction. Which is the best direction? Directly downhill i.e. ## -\nabla f ##. How far? Well if you go to far you could easily overshoot, but if you don't go far enough then it will take hours to find a solution, so you need to use a value that is tuned to the particular problem by some algorithm or heuristic - and we call that ## \eta ##.

fog37 said:

Dimensionally, the left hand side (independent variable ##x##) does not seem to agree with the right hand side of the equation...OR does it?

fog37 said:

The learning parameter must have particular units.

Dimensional analysis is only going to confuse you here, but since you brought it up then for a problem with a single independent variable ## \nabla f ## is a scalar and so ## \eta ## must have the same dimensions as ## x ##. Generally in numerical algorithms we don't think about dimensional analysis, we trust the maths to take care of this (if the dimensions of the problem are correct and our algorithm is correctly derived then it is all good): to the computer they are all just floats.

pbuk · May 5, 2021

pbuk said:

for a problem with a single independent variable ## \nabla x ## is a scalar and so ## \eta ## must have the same dimensions as ## x ##

Hmm on second thoughts I'm not sure this is correct, which perhaps just strengthens my assertion that dimensional analysis will only confuse things!

pbuk · May 5, 2021

Let ## f(x) ## have dimensions ## D_f ## and x have dimensions ## D_x ## then ## \nabla f ## has dimensions ## D_f D^{-1}_x ## and we have ## D_{\Delta x} = D_x = D_\eta D_f D^{-1}_x \implies D_\eta = D^{-1}_fD^2_x##.

I said it wouldn't help.

fog37 · May 5, 2021

pbuk said:

Let ## f(x) ## have dimensions ## D_f ## and x have dimensions ## D_x ## then ## \nabla f ## has dimensions ## D_f D^{-1}_x ## and we have ## D_{\Delta x} = D_x = D_\eta D_f D^{-1}_x \implies D_\eta = D^{-1}_fD^2_x##.

I said it wouldn't help.

Thanks pbuk.

I see your point.

Also, when I mentioned dimensions, I truly intended units :)

So $$x_{new}= x_{old} +\eta \nabla f(x)$$.
The term ##\eta \nabla f(x)## needs to have the same units as the variable ##x##. The equation $$x_{new}- x_{old} = \eta \nabla f(x)$$ does not originate from other mathematical steps/manipulation but it is set up to be this way in the sense that, as a path to find the ##x## for the minimum, we choose to the strategy to set ##\Delta x = \eta \nabla f(x)##...

pbuk · May 5, 2021

fog37 said:

The equation $$x_{new}- x_{old} = \eta \nabla f(x)$$ does not originate from other mathematical steps/manipulation but it is set up to be this way in the sense that, as a path to find the ##x## for the minimum, we choose to the strategy to set ##\Delta x = \eta \nabla f(x)##...

Yes, you now have the understanding, but be careful about sign conventions, particularly when you come to implement it in code. Using your symbols I would write ## x_{new} = x_{old} - \eta \nabla f(x) ##, and note that this translates simply to a multivariate problem with x as a vector implemented in whatever way fits the language (e.g. an array in C++, a numpy.array in Python...).

hutchphd · May 5, 2021

fog37 said:

Summary:: Understand the delta rule increment in gradient descent

Is there an in depth derivation to prove that last result?

I believe you are "overthinking" this badly.
For the 1D problem this technique makes no sense! So do not consider it
For 2D (and higher) the gradient gives you the preferred direction to try (i.e. downhill). In order to be sure it is not a local minimum it must work at any choice of scale. QED

Understand the delta rule increment in gradient descent

Discussion Overview

Discussion Character

Main Points Raised

Areas of Agreement / Disagreement

Contextual Notes

Who May Find This Useful

Similar threads

Undergrad Finding the minimum distance between two curves

Undergrad Why ##a^0=1##?

High School Straightforward integration…

High School Arc Length for Hyperbolic Sin

Undergrad Ambiguity of the term "indefinite integral"

Insights Revisiting the Velocity-Time Function

Insights Remote Operated Gate Control System

Insights AI Enriched Problem Solving

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect