# Failure of Optimisation for Nonlinear Equation Systems

Tags:
1. Mar 8, 2017

### aphirst

I wasn't sure into which category I should post this, so feel free to move it into a more appropriate place.

As part of my work I'm solving a system of nonlinear equations, of a usual form:
$$\vec{F}(\vec{X})=\begin{pmatrix}F_1(X_1, X_2, \cdots X_N) \\ F_2(\cdots) \\ \vdots \\ F_N(\cdots)\end{pmatrix}=\vec{0}$$

There's lots of literature about solving systems of nonlinear equations, and lots of discussion about "how Newton-Raphson methods can fail" and "how to counter it", usual methods being to combine the N-R step with a gradient-descent step, obtained by scalarising the function usually something like

$$f(\vec{X}) = ||\vec{F}(\vec{X})||^2$$

However, I'm struggling to find any literature which makes it explicit (either with proofs, or worked examples, in 2D would probably be the most useful) that you can't just minimise $f(\vec{X})$ in the first place, and that it becomes necessary to move to methods like N-R which explicitly handle the dimensionality and simultaneity. I mean, I know that that doesn't tend to work, and I would usually hand-wave it away by making reference to the extra structure of $\vec{F}(\vec{X})$ which is lost by "flattening" things down.

If I have some free time this week, I intend to construct a pair of explicit functions of 2 variables, which definitely have regions where they cross the F=0 axis, and then some contour plots of both functions along with the L2-norm combination, and then graph some N-R and Grad-Desc directions at some sample x,y values. I hope that doing so might make things more explicit that in some cases the $f(\vec{X})$ approach will lead you astray where the $\vec{F}(\vec{X})$ behaves.

But until then, I'd really appreciate if anyone here happens to know of any relevant literature, that they provide me some citations or links.

Last edited: Mar 8, 2017
2. Mar 8, 2017

### FactChecker

This subject is called "quadratic programming" in the general subject of "nonlinear programming". My references are old (Hadley Nonlinear and Dynamic Programming) but I'm sure that the basics have not changed. Google quadratic programming to find references. You should be aware that there is no fool-proof solution to the problem of local minimum. A current research attempt to address the local minimum problem is called convexification. I have had mixed success with it.
My advice is to use existing quadratic programming tools like those available in MATLAB or R and do several optimizations from different starting points. That may give you several local minimums. Hopefully one of those is the global minimum.

3. Mar 8, 2017

### aphirst

I am of course using existing tools for solving my actual problems, my question was more to ensure that I have a thorough understanding of what I'm doing and why I'm using the tools I am, and why they do what they do, along with the ability to provide comprehensive arguments and citations.

Thanks, I'll look in that direction, and report back with what I find.

4. Mar 8, 2017

### FactChecker

Good. I just mentioned it to make sure.
I think you will find that the tools in major packages like MATLAB or R are reasonably well documented and respected. I would start there since they will address the particular algorithm you are using.

5. Mar 8, 2017

### aphirst

I think your comment on N-R being a case of quadratic programming has already set me in the right direction. From what I'm seeing so far, it's never so much that you "need N-R to solve the higher dimensional problem" but rather that "if you know you're in a region of essentially quadratic behaviour, and you have an N-by-N system, N-R is another option available to you which is going to get you to the root really, really fast"; i.e. there's a balance to be struck between robustness and speed.

In terms of algorithms I already have a huge body of literature at my disposal (I'd be here all day if I started listing, but an excellent general text so far has been "Numerical Optimization" by Nodecal and Wright), and for the nonlinear equation system solution methods I have a few things including "Numerical Methods for Nonlinear Algebraic Equations" by P. Rabinowitz which also covers the hybrid Powell method in great detail. It and the other algorithms in MINPACK (i.e. HYBRJ, LMDER, and their Jacobian-less variants) seem to be the ones most referenced by documentation from MATLAB, SciPy, GSL, et cetera, though for "raw" optimisation I have found the NLopt wiki to also be very interesting.

6. May 13, 2017

### aphirst

I would like to come back to this topic, as this is becoming a subsection of my thesis work, so it's of utmost importance that I can address the topic concisely and correctly.

Let us continue to suppose that we have a function $\vec{F}(\vec{X})$ where both are of length (dimension) $N$, and also a scalarised version $f(\vec{X}) = \frac{1}{2}\vec{F}(\vec{X}) \cdot \vec{F}(\vec{X})$ (factor of 1/2 for later convenience).

Newton-Raphson
For the first step (subsequent steps use $n$ and $n+1$ but the expression is the same):
$$\Delta \vec{X}_{NR} \equiv \vec{X}_1 - \vec{X}_0 = - \mathbf{J}^{-1}(\vec{X_0}) \vec{F}(\vec{X_0})~,$$
where the matrix $\mathbf{J}(\vec{X}) \equiv \left( \frac{\partial F_i(\vec{X})}{\partial X_j} \right)$ (shorthand).

$$\Delta \vec{X}_{GD} \equiv \vec{X}_1 - \vec{X}_0 = - \frac{\lambda}{2} \nabla f(\vec{X}_0)~,$$
where $\lambda$ is the step-size (must be heuristically determined, let's ignore this for now and focus just on the direction), and where the gradient vector $\nabla f(\vec{X}) \equiv \left(\frac{\partial f(\vec{X})}{\partial X_i}\right)$ (shorthand).

If I apply this explicitly to a 2-by-2 system of equations, I get the following (after some messy working):

$$\Delta \vec{X}_{NR} = \frac{-1}{\frac{\partial F_1}{\partial X_1}\frac{\partial F_2}{\partial X_2} - \frac{\partial F_1}{\partial X_2}\frac{\partial F_2}{\partial X_1}} \begin{pmatrix} F_1 \frac{\partial F_2}{\partial X_2} - F_2 \frac{\partial F_1}{\partial X_2} \\ F_2 \frac{\partial F_1}{\partial X_1} - F_1 \frac{\partial F_2}{\partial X_1} \end{pmatrix}$$

$$\Delta \vec{X}_{GD} = - \lambda \begin{pmatrix} F_1 \frac{\partial F_1}{\partial X_1} + F_2 \frac{\partial F_2}{\partial X_1} \\ F_1 \frac{\partial F_1}{\partial X_2} + F_2 \frac{\partial F_2}{\partial X_2} \end{pmatrix}$$

Ignoring the scaling factors, focusing just on the directions, it becomes very obvious that they are different (and that it will be even messier with more dimensions), but I'd welcome any further comments which might help make this more "intuitive". Meanwhile I'll also of course carry on thinking about it, and will post any thoughts as they come to me.

Last edited: May 13, 2017
7. May 13, 2017

### FactChecker

The gradient descent approach has a speed problem converging to a minimum for an ill-conditioned situation. If the objective function is a long, narrow valley, the gradient descent tends to zig-zag across the narrow dimension like a sowing machine slowly traveling in the long dimension toward the minimum. More advanced algorithms first normalize the long-narrow valley to a circle so that the direction of each iterative step points more directly to the minimum.

PS. If the objective function really is a simple ellipse, the steepest descent would take a large second step toward the minimum and be fast. But any distortion of the objective function can fool the steepest descent algorithm and make it very slow.

Last edited: May 13, 2017
8. May 19, 2017

### aphirst

In light of @FactChecker 's post talking about the properties of objective functions, I feel the need to re-emphasise that my ultimate goal is not merely to minimise a merit function, but indeed actually to solve $\mathbf{F}(\mathbf{X})=\mathbf{0}$ a system of nonlinear equations. This is important to state, as there is shared terminology between the fields of "minimisation" and "equation solving / rootfinding" which do not in fact share their meaning - "Newton's Method" for equation systems uses first derivatives, while for minimisation uses 2nd derivatives and whenever talking about a "merit function" for equation systems, one expects/aims for a "minimum" of exactly zero, corresponding to the root.

With that out of the way:

Instead of talking in terms of an explicit 2-by-2 system as I did in my previous post, I decided to try to improve my intuition on the concepts of these step directions by using vector calculus to compute the (cosine of the) angle between them, i.e. between the Newton-Raphson step direction $\Delta \mathbf{X}_{NR} \equiv -\mathbf{J}^{-1} \mathbf{F}$ and the Gradient-Descent direction $\Delta \mathbf{X}_{GD} \equiv -\nabla f$.

Recalling that $\nabla f = \frac{1}{2} \nabla \mathbf{F}^T \mathbf{F} = \mathbf{J}^T \mathbf{F}$:

\begin{align}\cos(\theta) &= \frac{(\Delta \mathbf{X}_{NR})^T (\Delta \mathbf{X}_{GD})}{||\Delta \mathbf{X}_{NR}|| ~ ||\Delta \mathbf{X}_{GD}|| } \\ &= \frac{(\mathbf{J}^{-1} \mathbf{F})^T (\nabla f)}{||\mathbf{J}^{-1} \mathbf{F}||~ ||\nabla f|| } = \frac{(\mathbf{J}^{-1} \mathbf{F})^T(\mathbf{J}^T \mathbf{F})}{||\mathbf{J}^{-1} \mathbf{F}||~||\mathbf{J}^T \mathbf{F}|| } \\ &= \frac{\mathbf{F}^T \mathbf{F}}{||\mathbf{J}^{-1} \mathbf{F}||~||\mathbf{J}^T \mathbf{F}|| } \\ \Rightarrow \cos(\theta) &> 0 \Leftrightarrow \frac{-\pi}{2} < \theta < \frac{\pi}{2}\end{align}

i.e. that the Newton Raphson step is a "descent direction", using the term from Nocedal & Wright (see Sections 10.3 and 11.2).

I would appreciate if someone could verify my logic, or cite any necessary properties of those terms in the denominator on the 3rd line moving into the 4th, as I'm not confident I can fully justify it if pressed. With this, I should be able to finally complete my understanding of the relationship between these different step directions.

EDIT: I think I have it, namely that I was forgetting about the Cauchy-Schwarz Identity $|\langle \mathbf{u},\mathbf{v}\rangle | \leq ||\mathbf{u}||~||\mathbf{v}||$. This gives:
\begin{align}\cos(\theta) &\leq \frac{\mathbf{F}^T \mathbf{F}}{(\mathbf{J}^{-1}\mathbf{F})^T (\mathbf{J}^T \mathbf{F})} = \frac{\mathbf{F}^T \mathbf{F}}{\mathbf{F}^T \mathbf{F}} = 1 \\ \Rightarrow 0 < \cos(\theta) &\leq 1\end{align}
with the $\cos(\theta)>0$ following trivially from the form of statement 3, except at $\mathbf{F}\equiv 0$. This isn't a huge problem as this would already mean we're at the root, and we don't expect notions of descent or Newton-Raphson to be especially meaningful here.

Last edited: May 19, 2017
9. May 24, 2017

### aphirst

Yet further analysis raises the following points about the possible extreme cases of $\cos(\theta)$:

1. $\Delta \mathbf{X}_{GD} \perp \Delta \mathbf{X}_{NR} \Leftrightarrow \cos(\theta) = 0~,$
and this can only occur either for $\mathbf{F}=0$, or when either $||\mathbf{J}^{-1}\mathbf{F}||\rightarrow \infty$ or $||\mathbf{J}^{T}\mathbf{F}||\rightarrow \infty$. When can either of these occur, and what would be the most "useful" interpretation of this situation in the context of an equation system and its functional behaviour around the current $\mathbf{X}$?

2. $\Delta \mathbf{X}_{GD} \parallel \Delta \mathbf{X}_{NR} \Leftrightarrow \cos(\theta) = 1~,$
which seems to only occur when $\mathbf{J}^{-1} = \lambda \mathbf{J}^T$. I'm aware that this is equivalent to the Jacobian being an "orthogonal matrix", but under what circumstances can this arise for an equation system, and again, how could this situation be qualitatively described/interpreted in that context?

Reading further into this, a matrix is orthogonal if its rows and columns both respectively form orthonormal bases. In the context of equation systems, would this be equivalent to saying that the equation system at $\mathbf{X}$ behaves essentially entirely linearly (i.e. no nonlinearity whatsoever), barring a coordinate-system rotation?

Last edited: May 24, 2017