I'm working on a problem where I need to find minimum of a 2D surface. I initially coded up a gradient descent algorithm, and though it works, I had to carefully select a step size (which could be problematic), plus I want it to converge quickly. So, I went through immense pain to derive the Hessian and implemented Newton's method. What I'm finding is that while Newton's method can converge impressively fast (a handful of iterations), it has a much smaller range of convergence. The initial guess must be closer to the correct answer. Is this a known thing, or might I be doing something wrong? I can make sense of what's happening by looking at the slope of my error surface (the thing I'm trying to minimize). What happens is that at a certain distance from the minimum, the slope rate "goes negative", thus the inclusion of the second derivative drives me further from the correct answer. As an example of what I mean, consider a 1D case where I'm trying to find the minimum of f=cos(x) between 0 and 2pi (looks like a bowl, kind of). If I just look at the slope I can see that a gradient based approach would converge anywhere within [0,2*pi] - the slope is negative below pi (the answer) and positive above. If I look at f'' though, I see that it goes negative below pi/2 and above 1.5*pi. Thus if I start outside of that region, f'/f'' drives me the wrong way. I guess I'm just disappointed since I thought Newton's method was hands down better than gradient. I want to make sure I'm not missing something. Maybe this is just the price you pay for the faster convergence? Off the top of my head, I can't think of a case where it would be beneficial to have the 1/f'' reverse the direction of movement. Would it make sense to watch for this case and revert to gradient descent when it happens? Why would you want to go uphill?