Failure of Optimisation for Nonlinear Equation Systems

In summary, you can solve a system of nonlinear equations by minimising the function ##f(\vec{X})##. However, you can also minimise the function ##f(\vec{X})## using methods like N-R. The methods you choose may depend on the region of the function where you are working.
  • #1
aphirst
Gold Member
25
5
I wasn't sure into which category I should post this, so feel free to move it into a more appropriate place.

As part of my work I'm solving a system of nonlinear equations, of a usual form:
$$\vec{F}(\vec{X})=\begin{pmatrix}F_1(X_1, X_2, \cdots X_N) \\ F_2(\cdots) \\ \vdots \\ F_N(\cdots)\end{pmatrix}=\vec{0}$$

There's lots of literature about solving systems of nonlinear equations, and lots of discussion about "how Newton-Raphson methods can fail" and "how to counter it", usual methods being to combine the N-R step with a gradient-descent step, obtained by scalarising the function usually something like

$$f(\vec{X}) = ||\vec{F}(\vec{X})||^2$$

However, I'm struggling to find any literature which makes it explicit (either with proofs, or worked examples, in 2D would probably be the most useful) that you can't just minimise ##f(\vec{X})## in the first place, and that it becomes necessary to move to methods like N-R which explicitly handle the dimensionality and simultaneity. I mean, I know that that doesn't tend to work, and I would usually hand-wave it away by making reference to the extra structure of ##\vec{F}(\vec{X})## which is lost by "flattening" things down.

If I have some free time this week, I intend to construct a pair of explicit functions of 2 variables, which definitely have regions where they cross the F=0 axis, and then some contour plots of both functions along with the L2-norm combination, and then graph some N-R and Grad-Desc directions at some sample x,y values. I hope that doing so might make things more explicit that in some cases the ##f(\vec{X})## approach will lead you astray where the ##\vec{F}(\vec{X})## behaves.

But until then, I'd really appreciate if anyone here happens to know of any relevant literature, that they provide me some citations or links.
 
Last edited:
Technology news on Phys.org
  • #2
This subject is called "quadratic programming" in the general subject of "nonlinear programming". My references are old (Hadley Nonlinear and Dynamic Programming) but I'm sure that the basics have not changed. Google quadratic programming to find references. You should be aware that there is no fool-proof solution to the problem of local minimum. A current research attempt to address the local minimum problem is called convexification. I have had mixed success with it.
My advice is to use existing quadratic programming tools like those available in MATLAB or R and do several optimizations from different starting points. That may give you several local minimums. Hopefully one of those is the global minimum.
 
  • Like
Likes aphirst
  • #3
I am of course using existing tools for solving my actual problems, my question was more to ensure that I have a thorough understanding of what I'm doing and why I'm using the tools I am, and why they do what they do, along with the ability to provide comprehensive arguments and citations.

Thanks, I'll look in that direction, and report back with what I find.
 
  • Like
Likes FactChecker
  • #4
aphirst said:
I am of course using existing tools for solving my actual problems,
Good. I just mentioned it to make sure.
my question was more to ensure that I have a thorough understanding of what I'm doing and why I'm using the tools I am, and why they do what they do, along with the ability to provide comprehensive arguments and citations.
I think you will find that the tools in major packages like MATLAB or R are reasonably well documented and respected. I would start there since they will address the particular algorithm you are using.
 
  • #5
FactChecker said:
I think you will find that the tools in major packages like MATLAB or R are reasonably well documented and respected. I would start there.

I think your comment on N-R being a case of quadratic programming has already set me in the right direction. From what I'm seeing so far, it's never so much that you "need N-R to solve the higher dimensional problem" but rather that "if you know you're in a region of essentially quadratic behaviour, and you have an N-by-N system, N-R is another option available to you which is going to get you to the root really, really fast"; i.e. there's a balance to be struck between robustness and speed.

In terms of algorithms I already have a huge body of literature at my disposal (I'd be here all day if I started listing, but an excellent general text so far has been "Numerical Optimization" by Nodecal and Wright), and for the nonlinear equation system solution methods I have a few things including "Numerical Methods for Nonlinear Algebraic Equations" by P. Rabinowitz which also covers the hybrid Powell method in great detail. It and the other algorithms in MINPACK (i.e. HYBRJ, LMDER, and their Jacobian-less variants) seem to be the ones most referenced by documentation from MATLAB, SciPy, GSL, et cetera, though for "raw" optimisation I have found the NLopt wiki to also be very interesting.
 
  • #6
I would like to come back to this topic, as this is becoming a subsection of my thesis work, so it's of utmost importance that I can address the topic concisely and correctly.

Let us continue to suppose that we have a function ##\vec{F}(\vec{X})## where both are of length (dimension) ##N##, and also a scalarised version ##f(\vec{X}) = \frac{1}{2}\vec{F}(\vec{X}) \cdot \vec{F}(\vec{X})## (factor of 1/2 for later convenience).

Newton-Raphson
For the first step (subsequent steps use ##n## and ##n+1## but the expression is the same):
$$\Delta \vec{X}_{NR} \equiv \vec{X}_1 - \vec{X}_0 = - \mathbf{J}^{-1}(\vec{X_0}) \vec{F}(\vec{X_0})~,$$
where the matrix ##\mathbf{J}(\vec{X}) \equiv \left( \frac{\partial F_i(\vec{X})}{\partial X_j} \right)## (shorthand).

Gradient-Descent
$$\Delta \vec{X}_{GD} \equiv \vec{X}_1 - \vec{X}_0 = - \frac{\lambda}{2} \nabla f(\vec{X}_0)~,$$
where ##\lambda## is the step-size (must be heuristically determined, let's ignore this for now and focus just on the direction), and where the gradient vector ##\nabla f(\vec{X}) \equiv \left(\frac{\partial f(\vec{X})}{\partial X_i}\right)## (shorthand).

If I apply this explicitly to a 2-by-2 system of equations, I get the following (after some messy working):

$$\Delta \vec{X}_{NR} = \frac{-1}{\frac{\partial F_1}{\partial X_1}\frac{\partial F_2}{\partial X_2} - \frac{\partial F_1}{\partial X_2}\frac{\partial F_2}{\partial X_1}} \begin{pmatrix} F_1 \frac{\partial F_2}{\partial X_2} - F_2 \frac{\partial F_1}{\partial X_2} \\ F_2 \frac{\partial F_1}{\partial X_1} - F_1 \frac{\partial F_2}{\partial X_1} \end{pmatrix}$$

$$\Delta \vec{X}_{GD} = - \lambda \begin{pmatrix} F_1 \frac{\partial F_1}{\partial X_1} + F_2 \frac{\partial F_2}{\partial X_1} \\ F_1 \frac{\partial F_1}{\partial X_2} + F_2 \frac{\partial F_2}{\partial X_2} \end{pmatrix}$$

Ignoring the scaling factors, focusing just on the directions, it becomes very obvious that they are different (and that it will be even messier with more dimensions), but I'd welcome any further comments which might help make this more "intuitive". Meanwhile I'll also of course carry on thinking about it, and will post any thoughts as they come to me.
 
Last edited:
  • #7
The gradient descent approach has a speed problem converging to a minimum for an ill-conditioned situation. If the objective function is a long, narrow valley, the gradient descent tends to zig-zag across the narrow dimension like a sowing machine slowly traveling in the long dimension toward the minimum. More advanced algorithms first normalize the long-narrow valley to a circle so that the direction of each iterative step points more directly to the minimum.
steepestDescentWeakness2.png
steepestDescentWeakness.png


PS. If the objective function really is a simple ellipse, the steepest descent would take a large second step toward the minimum and be fast. But any distortion of the objective function can fool the steepest descent algorithm and make it very slow.
 
Last edited:
  • Like
Likes aphirst
  • #8
In light of @FactChecker 's post talking about the properties of objective functions, I feel the need to re-emphasise that my ultimate goal is not merely to minimise a merit function, but indeed actually to solve ##\mathbf{F}(\mathbf{X})=\mathbf{0}## a system of nonlinear equations. This is important to state, as there is shared terminology between the fields of "minimisation" and "equation solving / rootfinding" which do not in fact share their meaning - "Newton's Method" for equation systems uses first derivatives, while for minimisation uses 2nd derivatives and whenever talking about a "merit function" for equation systems, one expects/aims for a "minimum" of exactly zero, corresponding to the root.

With that out of the way:

Instead of talking in terms of an explicit 2-by-2 system as I did in my previous post, I decided to try to improve my intuition on the concepts of these step directions by using vector calculus to compute the (cosine of the) angle between them, i.e. between the Newton-Raphson step direction ##\Delta \mathbf{X}_{NR} \equiv -\mathbf{J}^{-1} \mathbf{F}## and the Gradient-Descent direction ##\Delta \mathbf{X}_{GD} \equiv -\nabla f##.

Recalling that ##\nabla f = \frac{1}{2} \nabla \mathbf{F}^T \mathbf{F} = \mathbf{J}^T \mathbf{F}##:

$$\begin{align}\cos(\theta) &= \frac{(\Delta \mathbf{X}_{NR})^T (\Delta \mathbf{X}_{GD})}{||\Delta \mathbf{X}_{NR}|| ~ ||\Delta \mathbf{X}_{GD}|| } \\
&= \frac{(\mathbf{J}^{-1} \mathbf{F})^T (\nabla f)}{||\mathbf{J}^{-1} \mathbf{F}||~ ||\nabla f|| } = \frac{(\mathbf{J}^{-1} \mathbf{F})^T(\mathbf{J}^T \mathbf{F})}{||\mathbf{J}^{-1} \mathbf{F}||~||\mathbf{J}^T \mathbf{F}|| } \\
&= \frac{\mathbf{F}^T \mathbf{F}}{||\mathbf{J}^{-1} \mathbf{F}||~||\mathbf{J}^T \mathbf{F}|| } \\
\Rightarrow \cos(\theta) &> 0 \Leftrightarrow \frac{-\pi}{2} < \theta < \frac{\pi}{2}\end{align}$$

i.e. that the Newton Raphson step is a "descent direction", using the term from Nocedal & Wright (see Sections 10.3 and 11.2).

I would appreciate if someone could verify my logic, or cite any necessary properties of those terms in the denominator on the 3rd line moving into the 4th, as I'm not confident I can fully justify it if pressed. With this, I should be able to finally complete my understanding of the relationship between these different step directions.EDIT: I think I have it, namely that I was forgetting about the Cauchy-Schwarz Identity ##|\langle \mathbf{u},\mathbf{v}\rangle | \leq ||\mathbf{u}||~||\mathbf{v}||##. This gives:
$$\begin{align}\cos(\theta) &\leq \frac{\mathbf{F}^T \mathbf{F}}{(\mathbf{J}^{-1}\mathbf{F})^T (\mathbf{J}^T \mathbf{F})} = \frac{\mathbf{F}^T \mathbf{F}}{\mathbf{F}^T \mathbf{F}} = 1 \\
\Rightarrow 0 < \cos(\theta) &\leq 1\end{align}$$
with the ##\cos(\theta)>0## following trivially from the form of statement 3, except at ##\mathbf{F}\equiv 0##. This isn't a huge problem as this would already mean we're at the root, and we don't expect notions of descent or Newton-Raphson to be especially meaningful here.
 
Last edited:
  • Like
Likes jim mcnamara
  • #9
Yet further analysis raises the following points about the possible extreme cases of ##\cos(\theta)##:

1. ##\Delta \mathbf{X}_{GD} \perp \Delta \mathbf{X}_{NR} \Leftrightarrow \cos(\theta) = 0~,##
and this can only occur either for ##\mathbf{F}=0##, or when either ##||\mathbf{J}^{-1}\mathbf{F}||\rightarrow \infty## or ##||\mathbf{J}^{T}\mathbf{F}||\rightarrow \infty##. When can either of these occur, and what would be the most "useful" interpretation of this situation in the context of an equation system and its functional behaviour around the current ##\mathbf{X}##?2. ##\Delta \mathbf{X}_{GD} \parallel \Delta \mathbf{X}_{NR} \Leftrightarrow \cos(\theta) = 1~,##
which seems to only occur when ##\mathbf{J}^{-1} = \lambda \mathbf{J}^T##. I'm aware that this is equivalent to the Jacobian being an "orthogonal matrix", but under what circumstances can this arise for an equation system, and again, how could this situation be qualitatively described/interpreted in that context?

Reading further into this, a matrix is orthogonal if its rows and columns both respectively form orthonormal bases. In the context of equation systems, would this be equivalent to saying that the equation system at ##\mathbf{X}## behaves essentially entirely linearly (i.e. no nonlinearity whatsoever), barring a coordinate-system rotation?
 
Last edited:

1. What is the difference between linear and nonlinear equation systems?

Linear equation systems are those in which each equation can be expressed as a linear combination of the variables. In other words, the variables appear only in the first power and do not multiply or divide each other. Nonlinear equation systems, on the other hand, involve variables that can appear in higher powers or can be multiplied or divided by each other.

2. Why does optimisation fail for nonlinear equation systems?

Optimisation techniques are designed to find the best solution for a given problem, but they are most effective when working with linear systems. Nonlinear equation systems can have multiple solutions or can be more complex, making it difficult for optimisation methods to find the best solution.

3. How can I determine if my problem involves a nonlinear equation system?

If your problem has variables that are multiplied or divided by each other, or if the equations involve exponents or other nonlinear functions, then it is likely a nonlinear equation system. You can also check the form of the equations to see if they can be expressed as linear combinations of the variables.

4. What are some common methods for solving nonlinear equation systems?

Some common methods for solving nonlinear equation systems include iterative methods like the Newton-Raphson method, which involves repeatedly updating a guess to get closer to the solution, and gradient descent methods, which use the direction of steepest descent to find the minimum of a function. Other techniques include numerical methods like the bisection method and the secant method.

5. Are there any specific strategies for dealing with failure of optimisation for nonlinear equation systems?

Yes, there are several strategies that can help with dealing with failure of optimisation for nonlinear equation systems. One approach is to use a hybrid method that combines different techniques to improve the chances of finding a good solution. Another strategy is to reformulate the problem to make it more amenable to optimisation methods, such as transforming it into a linear problem. Additionally, adjusting the initial guess or using different starting points can also improve the performance of optimisation methods for nonlinear systems.

Similar threads

  • Advanced Physics Homework Help
Replies
11
Views
206
  • Advanced Physics Homework Help
Replies
6
Views
292
  • Introductory Physics Homework Help
Replies
3
Views
197
  • Classical Physics
Replies
3
Views
1K
  • Introductory Physics Homework Help
Replies
25
Views
268
  • Introductory Physics Homework Help
Replies
12
Views
193
  • Introductory Physics Homework Help
Replies
16
Views
716
  • Quantum Physics
Replies
5
Views
525
  • Introductory Physics Homework Help
Replies
15
Views
348
  • Introductory Physics Homework Help
Replies
11
Views
224
Back
Top