Undergrad Gradient descent algorithm and optimizers

fog37 · Oct 14, 2023

Hello,

To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)

An alternative is to setting the gradient equal to zero, is to still use the gradient of the function and the gradient descent method: to minimize the function, we follow the negative of the gradient and go in the direction of steepest descent. We start at a random $x$ value, set a step size, calculate the gradient at that $x$ value to calculate get a new $x$ value. We repeat the process (finding new x value and calculate the gradient there) until the gradient becomes equal to zero (or almost). I think that is correct...So far so good.

Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?

Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable? The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?

Thank you for any input!

BvU · Oct 14, 2023

Hi,

For me this is the category 'impossible to answer'. Function optimization can fill a lifetime and more. Bookcases full of literature on the subject. The way you casually write

fog37 said:

TL;DR Summary: understand what optimizers are in relation to the gradient descent algorithm

We start at a random ##\bf x ## value, set a step size

lets me suspect you have no experience yet. So keep reading !

Perhaps you can narrow things down a bit ? And

fog37 said:

I have been reading

is good. But with a reference it gets a little bit better.

fog37 said:

if it has not too many parameters (how many?)

Depends on the function. In process simulation (chemical engineering) ##10^5## variables isn't outrageous (Speedup, gPROMS). Oh, and: #\LaTeX## in mathjax is to be enclosed in double # for in-line and in double $ for displayed math ##\ ##

FactChecker · Oct 14, 2023

These are all good questions. The general subject is nonlinear programming, either constrained or unconstrained.

fog37 said:

To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)

If the objective function has only one variable, but it is a high order, then even one variable can be difficult analytically.

fog37 said:

Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?

There are tons of methods. Some use an analytically supplied gradient, some estimate the gradient based on the function values at certain points. You can not always assume that there is an analytical gradient. There are "pattern search" methods that will still work.

fog37 said:

Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?

Yes. It is advisable to perform several optimizations beginning at different starting points. You have not mentioned constraints. Constraints can make corners in the feasible region that are dead ends and not nearly optimal.
You have also not mentioned any randomness in the function evaluation. There is a lot of randomness in the world. One approach is to start with a general, simple, overall shape of the objective function and use that to concentrate on certain areas in more detail (see response surface methodology).

fog37 said:

Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable?

That probably depends on the subject area.

fog37 said:

The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?

Most iterative algorithms adjust the step size as they run.

IMO, one quick way to get an overview of the subject would be to look through the optimization sections of a good R-language or MATLAB reference and see what they are talking about.

WWGD · Oct 15, 2023

Thing is, it's not just the number of variables that matter, but that it's difficult to find the zeros of gradient functions; often of functions in general. That's why you use methods such as gradient descent.

pbuk · Oct 16, 2023

fog37 said:

TL;DR Summary: understand what optimizers are in relation to the gradient descent algorithm

https://ocw.mit.edu/courses/15-084j-nonlinear-programming-spring-2004/

BvU · Oct 16, 2023

BvU said:

So keep reading !

Undergrad Gradient descent algorithm and optimizers

Thread 'What Exactly is Dirac’s Delta Function? - Insight'

Similar threads

Undergrad Trigonometry problem of interest

Insights Fixing Things Which Can Go Wrong With Complex Numbers

Undergrad Geometry problem of interest with a 3-4-5 triangle

High School Excel: converting a 3-ish week count into a monthly count

High School Can Six Pencils Be Arranged to Each Touch the Other Five?

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers