I Gradient descent algorithm and optimizers

  • I
  • Thread starter Thread starter fog37
  • Start date Start date
  • Tags Tags
    Minimum
AI Thread Summary
The discussion centers on the gradient descent algorithm as a method for finding the minimum of a function by following the negative of its gradient. While gradient descent is effective, it has limitations, such as potentially getting stuck in local minima and requiring differentiability of the function. Alternative methods like Newton's method and Maximum Likelihood Estimation are mentioned, with a note that they may offer advantages in certain scenarios. The conversation also touches on gradient descent optimizers, which adaptively change the step size during the process. Overall, understanding these concepts is crucial for effective function optimization in various applications.
fog37
Messages
1,566
Reaction score
108
TL;DR Summary
understand what optimizers are in relation to the gradient descent algorithm
Hello,

To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)

An alternative is to setting the gradient equal to zero, is to still use the gradient of the function and the gradient descent method: to minimize the function, we follow the negative of the gradient and go in the direction of steepest descent. We start at a random $x$ value, set a step size, calculate the gradient at that $x$ value to calculate get a new $x$ value. We repeat the process (finding new x value and calculate the gradient there) until the gradient becomes equal to zero (or almost). I think that is correct...So far so good.

Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?

Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable? The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?

Thank you for any input!
 
Mathematics news on Phys.org
Hi,

For me this is the category 'impossible to answer'. Function optimization can fill a lifetime and more. Bookcases full of literature on the subject. The way you casually write
fog37 said:
TL;DR Summary: understand what optimizers are in relation to the gradient descent algorithm

We start at a random ##\bf x ## value, set a step size
lets me suspect you have no experience yet. So keep reading !

Perhaps you can narrow things down a bit ? And
fog37 said:
I have been reading
is good. But with a reference it gets a little bit better.

fog37 said:
if it has not too many parameters (how many?)
Depends on the function. In process simulation (chemical engineering) ##10^5## variables isn't outrageous (Speedup, gPROMS). Oh, and: #\LaTeX## in mathjax is to be enclosed in double # for in-line and in double $ for displayed math ##\ ##
 
  • Like
Likes FactChecker
These are all good questions. The general subject is nonlinear programming, either constrained or unconstrained.
fog37 said:
To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)
If the objective function has only one variable, but it is a high order, then even one variable can be difficult analytically.
fog37 said:
Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?
There are tons of methods. Some use an analytically supplied gradient, some estimate the gradient based on the function values at certain points. You can not always assume that there is an analytical gradient. There are "pattern search" methods that will still work.
fog37 said:
Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Yes. It is advisable to perform several optimizations beginning at different starting points. You have not mentioned constraints. Constraints can make corners in the feasible region that are dead ends and not nearly optimal.
You have also not mentioned any randomness in the function evaluation. There is a lot of randomness in the world. One approach is to start with a general, simple, overall shape of the objective function and use that to concentrate on certain areas in more detail (see response surface methodology).
fog37 said:
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable?
That probably depends on the subject area.
fog37 said:
The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?
Most iterative algorithms adjust the step size as they run.

IMO, one quick way to get an overview of the subject would be to look through the optimization sections of a good R-language or MATLAB reference and see what they are talking about.
 
  • Like
Likes fog37 and BvU
Thing is, it's not just the number of variables that matter, but that it's difficult to find the zeros of gradient functions; often of functions in general. That's why you use methods such as gradient descent.
 
  • Like
Likes fog37 and FactChecker
BvU said:
So keep reading !
 
Insights auto threads is broken atm, so I'm manually creating these for new Insight articles. In Dirac’s Principles of Quantum Mechanics published in 1930 he introduced a “convenient notation” he referred to as a “delta function” which he treated as a continuum analog to the discrete Kronecker delta. The Kronecker delta is simply the indexed components of the identity operator in matrix algebra Source: https://www.physicsforums.com/insights/what-exactly-is-diracs-delta-function/ by...
Fermat's Last Theorem has long been one of the most famous mathematical problems, and is now one of the most famous theorems. It simply states that the equation $$ a^n+b^n=c^n $$ has no solutions with positive integers if ##n>2.## It was named after Pierre de Fermat (1607-1665). The problem itself stems from the book Arithmetica by Diophantus of Alexandria. It gained popularity because Fermat noted in his copy "Cubum autem in duos cubos, aut quadratoquadratum in duos quadratoquadratos, et...
Thread 'Imaginary Pythagorus'
I posted this in the Lame Math thread, but it's got me thinking. Is there any validity to this? Or is it really just a mathematical trick? Naively, I see that i2 + plus 12 does equal zero2. But does this have a meaning? I know one can treat the imaginary number line as just another axis like the reals, but does that mean this does represent a triangle in the complex plane with a hypotenuse of length zero? Ibix offered a rendering of the diagram using what I assume is matrix* notation...

Similar threads

Back
Top