Gradient descent algorithm and optimizers

fog37 · Oct 14, 2023

Hello,

To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)

An alternative is to setting the gradient equal to zero, is to still use the gradient of the function and the gradient descent method: to minimize the function, we follow the negative of the gradient and go in the direction of steepest descent. We start at a random $x$ value, set a step size, calculate the gradient at that $x$ value to calculate get a new $x$ value. We repeat the process (finding new x value and calculate the gradient there) until the gradient becomes equal to zero (or almost). I think that is correct...So far so good.

Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?

Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable? The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?

Thank you for any input!

BvU · Oct 14, 2023

Hi,

For me this is the category 'impossible to answer'. Function optimization can fill a lifetime and more. Bookcases full of literature on the subject. The way you casually write

fog37 said:

TL;DR Summary: understand what optimizers are in relation to the gradient descent algorithm

We start at a random ##\bf x ## value, set a step size

lets me suspect you have no experience yet. So keep reading !

Perhaps you can narrow things down a bit ? And

fog37 said:

I have been reading

is good. But with a reference it gets a little bit better.

fog37 said:

if it has not too many parameters (how many?)

Depends on the function. In process simulation (chemical engineering) ##10^5## variables isn't outrageous (Speedup, gPROMS). Oh, and: #\LaTeX## in mathjax is to be enclosed in double # for in-line and in double $ for displayed math ##\ ##

FactChecker · Oct 14, 2023

These are all good questions. The general subject is nonlinear programming, either constrained or unconstrained.

fog37 said:

To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)

If the objective function has only one variable, but it is a high order, then even one variable can be difficult analytically.

fog37 said:

Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?

There are tons of methods. Some use an analytically supplied gradient, some estimate the gradient based on the function values at certain points. You can not always assume that there is an analytical gradient. There are "pattern search" methods that will still work.

fog37 said:

Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?

Yes. It is advisable to perform several optimizations beginning at different starting points. You have not mentioned constraints. Constraints can make corners in the feasible region that are dead ends and not nearly optimal.
You have also not mentioned any randomness in the function evaluation. There is a lot of randomness in the world. One approach is to start with a general, simple, overall shape of the objective function and use that to concentrate on certain areas in more detail (see response surface methodology).

fog37 said:

Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable?

That probably depends on the subject area.

fog37 said:

The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?

Most iterative algorithms adjust the step size as they run.

IMO, one quick way to get an overview of the subject would be to look through the optimization sections of a good R-language or MATLAB reference and see what they are talking about.

WWGD · Oct 15, 2023

Thing is, it's not just the number of variables that matter, but that it's difficult to find the zeros of gradient functions; often of functions in general. That's why you use methods such as gradient descent.

pbuk · Oct 16, 2023

fog37 said:

TL;DR Summary: understand what optimizers are in relation to the gradient descent algorithm

https://ocw.mit.edu/courses/15-084j-nonlinear-programming-spring-2004/

BvU · Oct 16, 2023

BvU said:

So keep reading !

Gradient descent algorithm and optimizers

What is the Gradient Descent algorithm?

How does the Gradient Descent algorithm work?

What are the different types of Gradient Descent optimizers?

What are the challenges of using Gradient Descent algorithm?

How can the performance of Gradient Descent algorithm be improved?

Similar threads

Hot Threads

Recent Insights