# Gradient descent algorithm and optimizers

• I
• fog37
fog37
TL;DR Summary
understand what optimizers are in relation to the gradient descent algorithm
Hello,

To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)

An alternative is to setting the gradient equal to zero, is to still use the gradient of the function and the gradient descent method: to minimize the function, we follow the negative of the gradient and go in the direction of steepest descent. We start at a random $x$ value, set a step size, calculate the gradient at that $x$ value to calculate get a new $x$ value. We repeat the process (finding new x value and calculate the gradient there) until the gradient becomes equal to zero (or almost). I think that is correct...So far so good.

Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?

Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable? The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?

Thank you for any input!

Hi,

For me this is the category 'impossible to answer'. Function optimization can fill a lifetime and more. Bookcases full of literature on the subject. The way you casually write
fog37 said:
TL;DR Summary: understand what optimizers are in relation to the gradient descent algorithm

We start at a random ##\bf x ## value, set a step size
lets me suspect you have no experience yet. So keep reading !

Perhaps you can narrow things down a bit ? And
fog37 said:
is good. But with a reference it gets a little bit better.

fog37 said:
if it has not too many parameters (how many?)
Depends on the function. In process simulation (chemical engineering) ##10^5## variables isn't outrageous (Speedup, gPROMS). Oh, and: #\LaTeX## in mathjax is to be enclosed in double # for in-line and in double $for displayed math ##\ ## FactChecker These are all good questions. The general subject is nonlinear programming, either constrained or unconstrained. fog37 said: To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the$x\$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)
If the objective function has only one variable, but it is a high order, then even one variable can be difficult analytically.
fog37 said:
Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?
There are tons of methods. Some use an analytically supplied gradient, some estimate the gradient based on the function values at certain points. You can not always assume that there is an analytical gradient. There are "pattern search" methods that will still work.
fog37 said:
Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Yes. It is advisable to perform several optimizations beginning at different starting points. You have not mentioned constraints. Constraints can make corners in the feasible region that are dead ends and not nearly optimal.
You have also not mentioned any randomness in the function evaluation. There is a lot of randomness in the world. One approach is to start with a general, simple, overall shape of the objective function and use that to concentrate on certain areas in more detail (see response surface methodology).
fog37 said:
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable?
That probably depends on the subject area.
fog37 said:
The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?
Most iterative algorithms adjust the step size as they run.

IMO, one quick way to get an overview of the subject would be to look through the optimization sections of a good R-language or MATLAB reference and see what they are talking about.

fog37 and BvU
Thing is, it's not just the number of variables that matter, but that it's difficult to find the zeros of gradient functions; often of functions in general. That's why you use methods such as gradient descent.

fog37 and FactChecker
BvU said:

pbuk

Replies
1
Views
2K
Replies
18
Views
2K
Replies
3
Views
2K
Replies
0
Views
1K
Replies
2
Views
1K
Replies
7
Views
2K
Replies
2
Views
778
Replies
2
Views
1K
Replies
8
Views
1K
Replies
1
Views
2K