- #1
fog37
- 1,568
- 108
- TL;DR Summary
- understand what optimizers are in relation to the gradient descent algorithm
Hello,
To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)
An alternative is to setting the gradient equal to zero, is to still use the gradient of the function and the gradient descent method: to minimize the function, we follow the negative of the gradient and go in the direction of steepest descent. We start at a random $x$ value, set a step size, calculate the gradient at that $x$ value to calculate get a new $x$ value. We repeat the process (finding new x value and calculate the gradient there) until the gradient becomes equal to zero (or almost). I think that is correct...So far so good.
Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?
Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable? The loss of function of neural networks may have millions of variables but are generally differentiable...
Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?
Thank you for any input!
To find the minimum (global or local) of a function, we can set the function's gradient equal to zero and solve for the $x$. This is the classic, analytic approach to find the minimum and it works if the function is differentiable and if it has not to many parameters (how many?)
An alternative is to setting the gradient equal to zero, is to still use the gradient of the function and the gradient descent method: to minimize the function, we follow the negative of the gradient and go in the direction of steepest descent. We start at a random $x$ value, set a step size, calculate the gradient at that $x$ value to calculate get a new $x$ value. We repeat the process (finding new x value and calculate the gradient there) until the gradient becomes equal to zero (or almost). I think that is correct...So far so good.
Q1: There are other iterative methods to find the minimum of a function. For example, Newton's method or the Maximum likelihood estimation. Why choose the gradient descent? What virtues does it have?
Q2:The gradient descent methods has limitations. We really want to find the "global" minimum but as soon as the gradient descent algorithm finds some point that is at a local minimum, it never escapes it. Are other iterative methods able to find the "global" minimum instead?
Also, the gradient descent only works if the function is differentiable everywhere. Are most real-life functions differentiable or not differentiable? The loss of function of neural networks may have millions of variables but are generally differentiable...
Q3: I have been reading about gradient descent "optimizers"...What are these optimizers? Are they variants of the gradient descent method where the step size is adaptive, i.e. it changes throughout the iterative process?
Thank you for any input!