# Newton's method vs gradient descent

## Main Question or Discussion Point

I'm working on a problem where I need to find minimum of a 2D surface. I initially coded up a gradient descent algorithm, and though it works, I had to carefully select a step size (which could be problematic), plus I want it to converge quickly. So, I went through immense pain to derive the Hessian and implemented Newton's method. What I'm finding is that while Newton's method can converge impressively fast (a handful of iterations), it has a much smaller range of convergence. The initial guess must be closer to the correct answer.

Is this a known thing, or might I be doing something wrong? I can make sense of what's happening by looking at the slope of my error surface (the thing I'm trying to minimize). What happens is that at a certain distance from the minimum, the slope rate "goes negative", thus the inclusion of the second derivative drives me further from the correct answer.

As an example of what I mean, consider a 1D case where I'm trying to find the minimum of f=cos(x) between 0 and 2pi (looks like a bowl, kind of). If I just look at the slope I can see that a gradient based approach would converge anywhere within [0,2*pi] - the slope is negative below pi (the answer) and positive above.

If I look at f'' though, I see that it goes negative below pi/2 and above 1.5*pi. Thus if I start outside of that region, f'/f'' drives me the wrong way.

I guess I'm just disappointed since I thought Newton's method was hands down better than gradient. I want to make sure I'm not missing something. Maybe this is just the price you pay for the faster convergence?

Off the top of my head, I can't think of a case where it would be beneficial to have the 1/f'' reverse the direction of movement. Would it make sense to watch for this case and revert to gradient descent when it happens? Why would you want to go uphill?

hotvette
Homework Helper
Newton's Method is famous for not working well if the starting point is 'far' from the solution. A pratical thing you can do is use the Levenberg-Marquardt method (i.e. add a small fixed value to the diagonal of the Hessian matrix). The larger the value the more it behaves like the gradient descent method. After a few iterations you should be able to reduce the adder (to zero) and get the great convergence expected with Newton's Method.

Thanks for the reply. In my case loading the diagonal really won't cut it since the diagonal terms can get quite negative at times. I found an article on Newton's method that stated flatly that the Hessian should be positive definite, so I'm putting that check into my code. That avoids flat out divergence, but barely positive definite Hessians are still allowing highly suboptimal updates to the estimate.

I'm working a hybrid gradient/Newton approach, just trying to determine when to switch to Newton. Switching early takes longer than just letting gradient descent coast all the way in, but if I wait until just the right time to switch - and use all or most of the Newton step - I can see significant improvement.

Some years back I had good success with the conjugate gradient method for a system of non linear differential equations for gas pipe networks.

The programs then were all in Fortran.

hotvette
Homework Helper
I'm working a hybrid gradient/Newton approach, just trying to determine when to switch to Newton. Switching early takes longer than just letting gradient descent coast all the way in, but if I wait until just the right time to switch - and use all or most of the Newton step - I can see significant improvement.
Here's a thought. Use gradient descent until Hessian is barely positive, then load the diagonals for a few iterations, then pure Newton. Loading the diagonal is a solution method that is in between gradient descent and Newton's method. The trick is knowing how much to add to the diagonal.

I think my hybrid is working pretty well. I'm reading in a textbook about an iterative least squares technique. You basically linearize around your current estimate and then do a full least squares solution to get a new update. It looks to me pretty much like a Kalman filter with all the dynamic state stuff stripped out (no new information is coming in and the solution is known to be fixed). Can anyone comment on how this compares to gradient/Newton type techniques? I'm going to code it up to check it out, but I can't seem to find much about it on the web.

hotvette
Homework Helper
Are you saying that the surface you are trying to minimize is really a nonlinear least squares problem?

Are you saying that the surface you are trying to minimize is really a nonlinear least squares problem?
Yes, I guess I am. I coded it up as an iterative non-linear least squares thing and found the requirements for accuracy of the initial estimate to be quite strict. Modifying the algorithm to use a step size allowed it to converge in some cases, but then it just looked like a bad version of gradient descent.

hotvette
Homework Helper
Hmmm, let me see if I have this straight:

1. You are trying to solve a nonlinear least squares problem

2. Steepest Descent method works but convergence is slow

3. Treating the problem as a general optimization problem (i.e. using true Hessian and solving Hx = -J) converges rapidly but has severe restrictions on starting point. By the way, I've never seen any literature suggesting this method for solving least squares problems, though I remember trying it once.

4. Hybrid version (2 & 3) works well

Your last post suggests that you've also tried a 4th approach (i.e. classic nonlinear least squares approach iteratively solving JTJx = -JT(f-y). The beauty of this approach is that you can avoid the actual computation of JTJ and JT(f-y) by using QR factorization of J and solving Rx = QT(f-y) instead. By the way, J is different than the J in Hx = -J)

So you've tried all 4 methods?

Last edited:
Let's make sure my terminology is right regarding nonlinear least squares.

I have two non-linear equations: f1(u,v)=a, f2(u,v)=b and I want to solve for (u,v). I'm setting it up as a weighted MMSE problem; minimize w1*(f1-a)^2 + w2*(f2-b)^2 over (u,v).

Note that I can't generally rely on having an initial estimate close to the right answer.

I have tried:

1) Gradient descent. Converges over a wide range, but selecting step size is kludgy. I get my direction and search for the furthest I can go while still seeing improvement.

2) Newton's method with full computed Hessian. Converges rapidly and elegantly when it converges. Has a much smaller region of convergence than gradients.

3) Hybrid. Use gradient descent until Hessian is positive definite then switch over to Newton. Works.

4) Iterative nonlinear least squares (as you describe). Very narrow region of convergence. Practical implementation would require setting down a grid of initial estimates and seeing which ones converged (to the same place or to someplace reasonable).

Incidentally, I have seen people smarter than me using Newton's method for just this type of thing, so I know I'm not totally in left field!

hotvette
Homework Helper
I have two non-linear equations: f1(u,v)=a, f2(u,v)=b and I want to solve for (u,v)
I'm not familiar with the weighted approach. The standard way I've seen for solving simultaneous nonlinear equations (f1=0, f2=0, ..., fn=0) is to iteratively solve Jx = -F. I remember solving a set of 7 simultaneous nonlinear equations this way in a uni class.

I'd love to play with it. Can you tell me the function details either here or PM?

Last edited:
What computer environment (eg Matlab) are you using for this?

I'm prototyping in Matlab. It would eventually end up in C.

hotvette, I'm sorry but I don't think I should provide details on the project. It's not really my project after all. It would be pretty complicated to explain in any case.

Have you looked on the Mathworld website for Matlab resources?

http://www.mathworks.com/access/helpdesk/help/toolbox/optim/ug/brn4nq6.html [Broken]

Last edited by a moderator:
hotvette
Homework Helper
I'm sorry but I don't think I should provide details on the project
No worries. I just like challenges . Unless there is very good reason to take the weighted MMSE approach, you might consider the more straightforward method I mentioned for solving simultaneous nonlinear equations.

No worries. I just like challenges . Unless there is very good reason to take the weighted MMSE approach, you might consider the more straightforward method I mentioned for solving simultaneous nonlinear equations.
Solving it with weighting is not really much tougher. I just pulled the equations out of a textbook. Mostly the same except with some W's floating around. The main issue was just the requirement for a good initial guess compared to the other methods.

Thanks for the help.