Failure of Optimisation for Nonlinear Equation Systems

Click For Summary

Discussion Overview

The discussion revolves around the challenges of solving systems of nonlinear equations, particularly focusing on the limitations of optimization methods like Newton-Raphson and gradient descent. Participants explore the theoretical underpinnings and practical implications of these methods, seeking to understand why certain approaches may fail or succeed in specific contexts.

Discussion Character

  • Exploratory
  • Technical explanation
  • Debate/contested
  • Mathematical reasoning

Main Points Raised

  • One participant expresses difficulty in finding literature that explicitly states why minimizing a scalarized function ##f(\vec{X})## may not suffice for solving nonlinear equations, suggesting that the structure of ##\vec{F}(\vec{X})## is lost in the process.
  • Another participant mentions that the subject falls under "quadratic programming" and highlights the ongoing issue of local minima in optimization, referencing convexification as a current research approach.
  • Some participants discuss the use of existing tools like MATLAB and R for optimization, emphasizing the importance of understanding the algorithms used and their documentation.
  • A participant contrasts the Newton-Raphson method with gradient descent, noting that while N-R can be faster in certain quadratic regions, gradient descent may struggle in ill-conditioned scenarios, leading to zig-zagging behavior.
  • There is a discussion about the differences in the directions of updates produced by N-R and gradient descent, with one participant seeking further intuitive understanding of these differences.
  • Concerns are raised about the convergence speed of gradient descent in distorted objective functions, which may mislead the algorithm and slow down the optimization process.

Areas of Agreement / Disagreement

Participants generally agree on the complexities involved in solving nonlinear equations and the limitations of various optimization methods. However, there are competing views on the effectiveness of these methods in different scenarios, and the discussion remains unresolved regarding the best approach to take.

Contextual Notes

Participants acknowledge that the performance of optimization methods can depend heavily on the specific characteristics of the objective functions involved, such as their shape and dimensionality. There is also mention of the need for further exploration of literature and practical examples to clarify these issues.

aphirst
Gold Member
Messages
25
Reaction score
5
I wasn't sure into which category I should post this, so feel free to move it into a more appropriate place.

As part of my work I'm solving a system of nonlinear equations, of a usual form:
$$\vec{F}(\vec{X})=\begin{pmatrix}F_1(X_1, X_2, \cdots X_N) \\ F_2(\cdots) \\ \vdots \\ F_N(\cdots)\end{pmatrix}=\vec{0}$$

There's lots of literature about solving systems of nonlinear equations, and lots of discussion about "how Newton-Raphson methods can fail" and "how to counter it", usual methods being to combine the N-R step with a gradient-descent step, obtained by scalarising the function usually something like

$$f(\vec{X}) = ||\vec{F}(\vec{X})||^2$$

However, I'm struggling to find any literature which makes it explicit (either with proofs, or worked examples, in 2D would probably be the most useful) that you can't just minimise ##f(\vec{X})## in the first place, and that it becomes necessary to move to methods like N-R which explicitly handle the dimensionality and simultaneity. I mean, I know that that doesn't tend to work, and I would usually hand-wave it away by making reference to the extra structure of ##\vec{F}(\vec{X})## which is lost by "flattening" things down.

If I have some free time this week, I intend to construct a pair of explicit functions of 2 variables, which definitely have regions where they cross the F=0 axis, and then some contour plots of both functions along with the L2-norm combination, and then graph some N-R and Grad-Desc directions at some sample x,y values. I hope that doing so might make things more explicit that in some cases the ##f(\vec{X})## approach will lead you astray where the ##\vec{F}(\vec{X})## behaves.

But until then, I'd really appreciate if anyone here happens to know of any relevant literature, that they provide me some citations or links.
 
Last edited:
Technology news on Phys.org
This subject is called "quadratic programming" in the general subject of "nonlinear programming". My references are old (Hadley Nonlinear and Dynamic Programming) but I'm sure that the basics have not changed. Google quadratic programming to find references. You should be aware that there is no fool-proof solution to the problem of local minimum. A current research attempt to address the local minimum problem is called convexification. I have had mixed success with it.
My advice is to use existing quadratic programming tools like those available in MATLAB or R and do several optimizations from different starting points. That may give you several local minimums. Hopefully one of those is the global minimum.
 
  • Like
Likes   Reactions: aphirst
I am of course using existing tools for solving my actual problems, my question was more to ensure that I have a thorough understanding of what I'm doing and why I'm using the tools I am, and why they do what they do, along with the ability to provide comprehensive arguments and citations.

Thanks, I'll look in that direction, and report back with what I find.
 
  • Like
Likes   Reactions: FactChecker
aphirst said:
I am of course using existing tools for solving my actual problems,
Good. I just mentioned it to make sure.
my question was more to ensure that I have a thorough understanding of what I'm doing and why I'm using the tools I am, and why they do what they do, along with the ability to provide comprehensive arguments and citations.
I think you will find that the tools in major packages like MATLAB or R are reasonably well documented and respected. I would start there since they will address the particular algorithm you are using.
 
FactChecker said:
I think you will find that the tools in major packages like MATLAB or R are reasonably well documented and respected. I would start there.

I think your comment on N-R being a case of quadratic programming has already set me in the right direction. From what I'm seeing so far, it's never so much that you "need N-R to solve the higher dimensional problem" but rather that "if you know you're in a region of essentially quadratic behaviour, and you have an N-by-N system, N-R is another option available to you which is going to get you to the root really, really fast"; i.e. there's a balance to be struck between robustness and speed.

In terms of algorithms I already have a huge body of literature at my disposal (I'd be here all day if I started listing, but an excellent general text so far has been "Numerical Optimization" by Nodecal and Wright), and for the nonlinear equation system solution methods I have a few things including "Numerical Methods for Nonlinear Algebraic Equations" by P. Rabinowitz which also covers the hybrid Powell method in great detail. It and the other algorithms in MINPACK (i.e. HYBRJ, LMDER, and their Jacobian-less variants) seem to be the ones most referenced by documentation from MATLAB, SciPy, GSL, et cetera, though for "raw" optimisation I have found the NLopt wiki to also be very interesting.
 
I would like to come back to this topic, as this is becoming a subsection of my thesis work, so it's of utmost importance that I can address the topic concisely and correctly.

Let us continue to suppose that we have a function ##\vec{F}(\vec{X})## where both are of length (dimension) ##N##, and also a scalarised version ##f(\vec{X}) = \frac{1}{2}\vec{F}(\vec{X}) \cdot \vec{F}(\vec{X})## (factor of 1/2 for later convenience).

Newton-Raphson
For the first step (subsequent steps use ##n## and ##n+1## but the expression is the same):
$$\Delta \vec{X}_{NR} \equiv \vec{X}_1 - \vec{X}_0 = - \mathbf{J}^{-1}(\vec{X_0}) \vec{F}(\vec{X_0})~,$$
where the matrix ##\mathbf{J}(\vec{X}) \equiv \left( \frac{\partial F_i(\vec{X})}{\partial X_j} \right)## (shorthand).

Gradient-Descent
$$\Delta \vec{X}_{GD} \equiv \vec{X}_1 - \vec{X}_0 = - \frac{\lambda}{2} \nabla f(\vec{X}_0)~,$$
where ##\lambda## is the step-size (must be heuristically determined, let's ignore this for now and focus just on the direction), and where the gradient vector ##\nabla f(\vec{X}) \equiv \left(\frac{\partial f(\vec{X})}{\partial X_i}\right)## (shorthand).

If I apply this explicitly to a 2-by-2 system of equations, I get the following (after some messy working):

$$\Delta \vec{X}_{NR} = \frac{-1}{\frac{\partial F_1}{\partial X_1}\frac{\partial F_2}{\partial X_2} - \frac{\partial F_1}{\partial X_2}\frac{\partial F_2}{\partial X_1}} \begin{pmatrix} F_1 \frac{\partial F_2}{\partial X_2} - F_2 \frac{\partial F_1}{\partial X_2} \\ F_2 \frac{\partial F_1}{\partial X_1} - F_1 \frac{\partial F_2}{\partial X_1} \end{pmatrix}$$

$$\Delta \vec{X}_{GD} = - \lambda \begin{pmatrix} F_1 \frac{\partial F_1}{\partial X_1} + F_2 \frac{\partial F_2}{\partial X_1} \\ F_1 \frac{\partial F_1}{\partial X_2} + F_2 \frac{\partial F_2}{\partial X_2} \end{pmatrix}$$

Ignoring the scaling factors, focusing just on the directions, it becomes very obvious that they are different (and that it will be even messier with more dimensions), but I'd welcome any further comments which might help make this more "intuitive". Meanwhile I'll also of course carry on thinking about it, and will post any thoughts as they come to me.
 
Last edited:
The gradient descent approach has a speed problem converging to a minimum for an ill-conditioned situation. If the objective function is a long, narrow valley, the gradient descent tends to zig-zag across the narrow dimension like a sowing machine slowly traveling in the long dimension toward the minimum. More advanced algorithms first normalize the long-narrow valley to a circle so that the direction of each iterative step points more directly to the minimum.
steepestDescentWeakness2.png
steepestDescentWeakness.png


PS. If the objective function really is a simple ellipse, the steepest descent would take a large second step toward the minimum and be fast. But any distortion of the objective function can fool the steepest descent algorithm and make it very slow.
 
Last edited:
  • Like
Likes   Reactions: aphirst
In light of @FactChecker 's post talking about the properties of objective functions, I feel the need to re-emphasise that my ultimate goal is not merely to minimise a merit function, but indeed actually to solve ##\mathbf{F}(\mathbf{X})=\mathbf{0}## a system of nonlinear equations. This is important to state, as there is shared terminology between the fields of "minimisation" and "equation solving / rootfinding" which do not in fact share their meaning - "Newton's Method" for equation systems uses first derivatives, while for minimisation uses 2nd derivatives and whenever talking about a "merit function" for equation systems, one expects/aims for a "minimum" of exactly zero, corresponding to the root.

With that out of the way:

Instead of talking in terms of an explicit 2-by-2 system as I did in my previous post, I decided to try to improve my intuition on the concepts of these step directions by using vector calculus to compute the (cosine of the) angle between them, i.e. between the Newton-Raphson step direction ##\Delta \mathbf{X}_{NR} \equiv -\mathbf{J}^{-1} \mathbf{F}## and the Gradient-Descent direction ##\Delta \mathbf{X}_{GD} \equiv -\nabla f##.

Recalling that ##\nabla f = \frac{1}{2} \nabla \mathbf{F}^T \mathbf{F} = \mathbf{J}^T \mathbf{F}##:

$$\begin{align}\cos(\theta) &= \frac{(\Delta \mathbf{X}_{NR})^T (\Delta \mathbf{X}_{GD})}{||\Delta \mathbf{X}_{NR}|| ~ ||\Delta \mathbf{X}_{GD}|| } \\
&= \frac{(\mathbf{J}^{-1} \mathbf{F})^T (\nabla f)}{||\mathbf{J}^{-1} \mathbf{F}||~ ||\nabla f|| } = \frac{(\mathbf{J}^{-1} \mathbf{F})^T(\mathbf{J}^T \mathbf{F})}{||\mathbf{J}^{-1} \mathbf{F}||~||\mathbf{J}^T \mathbf{F}|| } \\
&= \frac{\mathbf{F}^T \mathbf{F}}{||\mathbf{J}^{-1} \mathbf{F}||~||\mathbf{J}^T \mathbf{F}|| } \\
\Rightarrow \cos(\theta) &> 0 \Leftrightarrow \frac{-\pi}{2} < \theta < \frac{\pi}{2}\end{align}$$

i.e. that the Newton Raphson step is a "descent direction", using the term from Nocedal & Wright (see Sections 10.3 and 11.2).

I would appreciate if someone could verify my logic, or cite any necessary properties of those terms in the denominator on the 3rd line moving into the 4th, as I'm not confident I can fully justify it if pressed. With this, I should be able to finally complete my understanding of the relationship between these different step directions.EDIT: I think I have it, namely that I was forgetting about the Cauchy-Schwarz Identity ##|\langle \mathbf{u},\mathbf{v}\rangle | \leq ||\mathbf{u}||~||\mathbf{v}||##. This gives:
$$\begin{align}\cos(\theta) &\leq \frac{\mathbf{F}^T \mathbf{F}}{(\mathbf{J}^{-1}\mathbf{F})^T (\mathbf{J}^T \mathbf{F})} = \frac{\mathbf{F}^T \mathbf{F}}{\mathbf{F}^T \mathbf{F}} = 1 \\
\Rightarrow 0 < \cos(\theta) &\leq 1\end{align}$$
with the ##\cos(\theta)>0## following trivially from the form of statement 3, except at ##\mathbf{F}\equiv 0##. This isn't a huge problem as this would already mean we're at the root, and we don't expect notions of descent or Newton-Raphson to be especially meaningful here.
 
Last edited:
  • Like
Likes   Reactions: jim mcnamara
Yet further analysis raises the following points about the possible extreme cases of ##\cos(\theta)##:

1. ##\Delta \mathbf{X}_{GD} \perp \Delta \mathbf{X}_{NR} \Leftrightarrow \cos(\theta) = 0~,##
and this can only occur either for ##\mathbf{F}=0##, or when either ##||\mathbf{J}^{-1}\mathbf{F}||\rightarrow \infty## or ##||\mathbf{J}^{T}\mathbf{F}||\rightarrow \infty##. When can either of these occur, and what would be the most "useful" interpretation of this situation in the context of an equation system and its functional behaviour around the current ##\mathbf{X}##?2. ##\Delta \mathbf{X}_{GD} \parallel \Delta \mathbf{X}_{NR} \Leftrightarrow \cos(\theta) = 1~,##
which seems to only occur when ##\mathbf{J}^{-1} = \lambda \mathbf{J}^T##. I'm aware that this is equivalent to the Jacobian being an "orthogonal matrix", but under what circumstances can this arise for an equation system, and again, how could this situation be qualitatively described/interpreted in that context?

Reading further into this, a matrix is orthogonal if its rows and columns both respectively form orthonormal bases. In the context of equation systems, would this be equivalent to saying that the equation system at ##\mathbf{X}## behaves essentially entirely linearly (i.e. no nonlinearity whatsoever), barring a coordinate-system rotation?
 
Last edited:

Similar threads

  • · Replies 0 ·
Replies
0
Views
3K
  • · Replies 4 ·
Replies
4
Views
3K
  • · Replies 12 ·
Replies
12
Views
3K
Replies
11
Views
2K
  • · Replies 9 ·
Replies
9
Views
2K
  • · Replies 8 ·
Replies
8
Views
2K
Replies
7
Views
3K
Replies
3
Views
2K
  • · Replies 25 ·
Replies
25
Views
2K