Gradient descent algorithm and optimizers

  • I
  • Thread starter fog37
  • Start date
  • Tags
    Minimum
  • #1
fog37
1,568
108
TL;DR Summary
understand what optimizers are in relation to the gradient descent algorithm
Hello,

To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)

An alternative is to setting the gradient equal to zero, is to still use the gradient of the function and the gradient descent method: to minimize the function, we follow the negative of the gradient and go in the direction of steepest descent. We start at a random $x$ value, set a step size, calculate the gradient at that $x$ value to calculate get a new $x$ value. We repeat the process (finding new x value and calculate the gradient there) until the gradient becomes equal to zero (or almost). I think that is correct...So far so good.

Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?

Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable? The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?

Thank you for any input!
 
Mathematics news on Phys.org
  • #2
Hi,

For me this is the category 'impossible to answer'. Function optimization can fill a lifetime and more. Bookcases full of literature on the subject. The way you casually write
fog37 said:
TL;DR Summary: understand what optimizers are in relation to the gradient descent algorithm

We start at a random ##\bf x ## value, set a step size
lets me suspect you have no experience yet. So keep reading !

Perhaps you can narrow things down a bit ? And
fog37 said:
I have been reading
is good. But with a reference it gets a little bit better.

fog37 said:
if it has not too many parameters (how many?)
Depends on the function. In process simulation (chemical engineering) ##10^5## variables isn't outrageous (Speedup, gPROMS). Oh, and: #\LaTeX## in mathjax is to be enclosed in double # for in-line and in double $ for displayed math ##\ ##
 
  • Like
Likes FactChecker
  • #3
These are all good questions. The general subject is nonlinear programming, either constrained or unconstrained.
fog37 said:
To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)
If the objective function has only one variable, but it is a high order, then even one variable can be difficult analytically.
fog37 said:
Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?
There are tons of methods. Some use an analytically supplied gradient, some estimate the gradient based on the function values at certain points. You can not always assume that there is an analytical gradient. There are "pattern search" methods that will still work.
fog37 said:
Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Yes. It is advisable to perform several optimizations beginning at different starting points. You have not mentioned constraints. Constraints can make corners in the feasible region that are dead ends and not nearly optimal.
You have also not mentioned any randomness in the function evaluation. There is a lot of randomness in the world. One approach is to start with a general, simple, overall shape of the objective function and use that to concentrate on certain areas in more detail (see response surface methodology).
fog37 said:
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable?
That probably depends on the subject area.
fog37 said:
The loss of function of neural networks may have millions of variables but are generally differentiable...

Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?
Most iterative algorithms adjust the step size as they run.

IMO, one quick way to get an overview of the subject would be to look through the optimization sections of a good R-language or MATLAB reference and see what they are talking about.
 
  • Like
Likes fog37 and BvU
  • #4
Thing is, it's not just the number of variables that matter, but that it's difficult to find the zeros of gradient functions; often of functions in general. That's why you use methods such as gradient descent.
 
  • Like
Likes fog37 and FactChecker
  • #6
BvU said:
So keep reading !
 
  • Like
Likes pbuk

What is the Gradient Descent algorithm?

The Gradient Descent algorithm is an optimization technique used to minimize a function by iteratively moving in the direction of steepest descent of the function. It is commonly used in machine learning to update the parameters of a model in order to minimize the loss function.

How does the Gradient Descent algorithm work?

The Gradient Descent algorithm works by taking the gradient of the loss function with respect to the parameters of the model. It then updates the parameters in the opposite direction of the gradient, moving towards the minimum of the function. This process is repeated until a stopping criterion is met.

What are the different types of Gradient Descent optimizers?

There are several types of Gradient Descent optimizers, including Stochastic Gradient Descent (SGD), Mini-batch Gradient Descent, and Adam optimizer. Each optimizer has its own advantages and disadvantages, and the choice of optimizer can have a significant impact on the training process.

What are the challenges of using Gradient Descent algorithm?

Some of the challenges of using the Gradient Descent algorithm include choosing an appropriate learning rate, dealing with local minima, and handling large datasets. In addition, the algorithm can be computationally expensive, especially for complex models with many parameters.

How can the performance of Gradient Descent algorithm be improved?

The performance of the Gradient Descent algorithm can be improved by using techniques such as learning rate scheduling, momentum, and adaptive learning rates. Additionally, using batch normalization and regularization techniques can help prevent overfitting and improve the generalization of the model.

Similar threads

Replies
1
Views
2K
Replies
18
Views
2K
Replies
3
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
0
Views
464
  • Calculus and Beyond Homework Help
Replies
8
Views
475
Replies
2
Views
1K
Replies
7
Views
1K
  • Topology and Analysis
Replies
2
Views
919
  • General Math
Replies
13
Views
1K
  • General Math
Replies
9
Views
1K
Back
Top