# Iterative Minimization lemma proof

• rayge
In summary, we are attempting to minimize a real-valued function f(x) with no other conditions on it. We use the relation x^k = x^{k-1} + \alpha_{k}d^k to choose the next x^k and assume d^k is a descent direction. We also assume that for small positive \alpha, f(x^{k-1} + \alpha d^k) is decreasing. The lemma we want to prove is that when x^k is constructed using the optimal \alpha, we have \nabla f(x^k) \cdot d^k = 0. To prove this, we differentiate the function f(x^{k-1} + \alpha d^

## Homework Statement

$f(x)$ is the function we want to minimize. Beyond being real-valued, there are no other conditions on it. (I'm surprised it's not at least continuous, but the book doesn't say that's a condition.) We choose the next $x^k$ through the relation $x^k = x^{k-1} + \alpha_{k}d^k$. We assume $d^k$ is a descent direction. That is, for small positive $\alpha$, $f(x^{k-1} + \alpha d^k)$ is decreasing.

Here's the lemma we want to prove:
When $x^k$ is constructed using the optimal $\alpha$, we have $\nabla f(x^k) \cdot d^k = 0$

## The Attempt at a Solution

It's suggested in the book that we should differentiate the function $f(x^{k-1} + \alpha d^k)$ with respect to $\alpha$. My problem is that I don't know how to differentiate a function that isn't defined. My first guess went something like this, but I don't see how I'm any closer to a solution.

$$\frac{\partial}{\partial \alpha} f(x^{k-1} + \alpha d^k) = \frac{\partial}{\partial \alpha} f(x^k)$$
$$f'(x^{k-1} + \alpha d^k)d^k = f'(x^k)(0)$$
$$f'(x^{k-1} + \alpha d^k)d^k = 0$$

I'm really not looking for an answer, but if someone could point me to where I could learn about taking differentials of undefined functions that would be helpful. I'm guessing that somehow I can extract a gradient out of this, and a dot product, but I'm feeling pretty confused.

rayge said:

## Homework Statement

$f(x)$ is the function we want to minimize. Beyond being real-valued, there are no other conditions on it. (I'm surprised it's not at least continuous, but the book doesn't say that's a condition.)
If you're expected to differentiate $f$ it would have to be differentiable, hence continuous.

We choose the next $x^k$ through the relation $x^k = x^{k-1} + \alpha_{k}d^k$.

You didn't say what $x^k$ is. I assume it is a vector. For example, $x^k = ( {x^k}_1, {x^k}_2 )$

We assume $d^k$ is a descent direction. That is, for small positive $\alpha$, $f(x^{k-1} + \alpha d^k)$ is decreasing.

Here's the lemma we want to prove:
When $x^k$ is constructed using the optimal $\alpha$, we have $\nabla f(x^k) \cdot d^k = 0$

## The Attempt at a Solution

It's suggested in the book that we should differentiate the function $f(x^{k-1} + \alpha d^k)$ with respect to $\alpha$. My problem is that I don't know how to differentiate a function that isn't defined.

Visualize all the arguments of $f$ . For example $f( {x^{k-1}}_1 + \alpha {d^k}_1, {x^{k-1}}_2 + \alpha {d^k}_2)$

Things might look more familiar In different notation. If we have a real valued function $f(x,y)$ of two real variables $x,y$ and $X(\alpha), \ Y(\alpha)$ are real valued functions of $\alpha$ then
$\frac{df}{d\alpha} = \frac{\partial df}{dx}\frac{dX}{d\alpha} + \frac{\partial df}{dy} \frac{dY}{d\alpha}$
$= \nabla f \cdot (\frac{dX}{d\alpha},\frac{dY}{d\alpha})$

rayge said:

## Homework Statement

$f(x)$ is the function we want to minimize. Beyond being real-valued, there are no other conditions on it. (I'm surprised it's not at least continuous, but the book doesn't say that's a condition.) We choose the next $x^k$ through the relation $x^k = x^{k-1} + \alpha_{k}d^k$. We assume $d^k$ is a descent direction. That is, for small positive $\alpha$, $f(x^{k-1} + \alpha d^k)$ is decreasing.

Here's the lemma we want to prove:
When $x^k$ is constructed using the optimal $\alpha$, we have $\nabla f(x^k) \cdot d^k = 0$

## The Attempt at a Solution

It's suggested in the book that we should differentiate the function $f(x^{k-1} + \alpha d^k)$ with respect to $\alpha$. My problem is that I don't know how to differentiate a function that isn't defined. My first guess went something like this, but I don't see how I'm any closer to a solution.

$$\frac{\partial}{\partial \alpha} f(x^{k-1} + \alpha d^k) = \frac{\partial}{\partial \alpha} f(x^k)$$
$$f'(x^{k-1} + \alpha d^k)d^k = f'(x^k)(0)$$
$$f'(x^{k-1} + \alpha d^k)d^k = 0$$

I'm really not looking for an answer, but if someone could point me to where I could learn about taking differentials of undefined functions that would be helpful. I'm guessing that somehow I can extract a gradient out of this, and a dot product, but I'm feeling pretty confused.

Taking derivatives does not make sense unless ##f## is differentiable, so automatically continuous---in fact, even smoother than continuous. Maybe the book does not say that, but it is part of the reader's assumed background. Furthermore (just to be picky), expressions like ##\nabla f(x^k) \cdot d^k## depend on even more smoothness: ##f## must be continuously differentiable (not just plain differentiable). If the book does not make this clear, that is a flaw; it should definitely cover these points somewhere.

BTW: the stated result is false if ##f## is not continuously differentiable; there are some easy counterexamples.

Last edited:
Actually it is made clear in the beginning on the chapter that we're dealing with quadratic functions of the form $f(x) = \frac{1}{2}\|Ax - b\|^2_2$. My fault for not reading closely enough.

rayge said:
Actually it is made clear in the beginning on the chapter that we're dealing with quadratic functions of the form $f(x) = \frac{1}{2}\|Ax - b\|^2_2$. My fault for not reading closely enough.

OK, but the result is actually true for any continuously-differentiable ##f##.

Good to know, thanks. I applied the chain rule as suggested, and got:
$$\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = \nabla f \cdot d^k$$
From here, it seems reasonable that $\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = 0$ is our optimal value but I don't know how to prove that from our assumptions. For small $\alpha$, $f(x^{k-1} + \alpha d^k)$ is decreasing. For large enough $\alpha$, I think $f(x^{k-1} + \alpha d^k)$ is increasing (i.e. we pass by the optimal point). I just don't have a solid grasp on how the function works, and assuming it's increasing for large $\alpha$, why that is the case.

Thanks again.

rayge said:
Good to know, thanks. I applied the chain rule as suggested, and got:
$$\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = \nabla f \cdot d^k$$
From here, it seems reasonable that $\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = 0$ is our optimal value but I don't know how to prove that from our assumptions. For small $\alpha$, $f(x^{k-1} + \alpha d^k)$ is decreasing. For large enough $\alpha$, I think $f(x^{k-1} + \alpha d^k)$ is increasing (i.e. we pass by the optimal point). I just don't have a solid grasp on how the function works, and assuming it's increasing for large $\alpha$, why that is the case.

Thanks again.

You are searching for a minimum along a line, so you are trying to minimize a univariate function ##\phi(\alpha)##, which just happens to be of the form
$$\phi(\alpha) = f(x^{k-1} + \alpha d^k)$$
for some vector constants ##x^{k-1}, d^k##. How do you minimize functions of one variable?

rayge said:
Good to know, thanks. I applied the chain rule as suggested, and got:
$$\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = \nabla f \cdot d^k$$
From here, it seems reasonable that $\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = 0$ is our optimal value but I don't know how to prove that from our assumptions. For small $\alpha$, $f(x^{k-1} + \alpha d^k)$ is decreasing. For large enough $\alpha$, I think $f(x^{k-1} + \alpha d^k)$ is increasing (i.e. we pass by the optimal point). I just don't have a solid grasp on how the function works, and assuming it's increasing for large $\alpha$, why that is the case.

Thanks again.
I see that my previous response did not deal with all of your question. I cannot supply too many details without violating PF helping policies, but I can give a hint. In your specific case you have ##f(\vec{x}) = || A \vec{x} - \vec{b}||^2##. So letting ##\vec{y} = A \vec{x}^k - \vec{b}## and ##\vec{p} = A \vec{d}^k##, we have that ##\phi(\alpha) = f(\vec{x}^k + \vec{d}^k \alpha)## has the form ##\phi(\alpha) = ||\vec{y} + \alpha \vec{p}||^2 = a + b \alpha + c \alpha^2##. How do you find ##a, b, c##? Given that ##\vec{d}^k## is a descent direction, what does this say about the values or signs of ##a,b,c##? What do these signs say about the behavior of ##\phi(\alpha)##?