Iterative Minimization lemma proof

rayge · Dec 14, 2014

Homework Statement

f(x) is the function we want to minimize. Beyond being real-valued, there are no other conditions on it. (I'm surprised it's not at least continuous, but the book doesn't say that's a condition.) We choose the next x^k through the relation x^k = x^{k-1} + \alpha_{k}d^k. We assume d^k is a descent direction. That is, for small positive \alpha, f(x^{k-1} + \alpha d^k) is decreasing.

Here's the lemma we want to prove:
When x^k is constructed using the optimal \alpha, we have \nabla f(x^k) \cdot d^k = 0

The Attempt at a Solution

It's suggested in the book that we should differentiate the function f(x^{k-1} + \alpha d^k) with respect to \alpha. My problem is that I don't know how to differentiate a function that isn't defined. My first guess went something like this, but I don't see how I'm any closer to a solution.

\frac{\partial}{\partial \alpha} f(x^{k-1} + \alpha d^k) = \frac{\partial}{\partial \alpha} f(x^k)
f'(x^{k-1} + \alpha d^k)d^k = f'(x^k)(0)
f'(x^{k-1} + \alpha d^k)d^k = 0

I'm really not looking for an answer, but if someone could point me to where I could learn about taking differentials of undefined functions that would be helpful. I'm guessing that somehow I can extract a gradient out of this, and a dot product, but I'm feeling pretty confused.

Stephen Tashi · Dec 14, 2014

rayge said:

Homework Statement

f(x) is the function we want to minimize. Beyond being real-valued, there are no other conditions on it. (I'm surprised it's not at least continuous, but the book doesn't say that's a condition.)

If you're expected to differentiate f it would have to be differentiable, hence continuous.

We choose the next x^k through the relation x^k = x^{k-1} + \alpha_{k}d^k.

You didn't say what x^k is. I assume it is a vector. For example, x^k = ( {x^k}_1, {x^k}_2 )

We assume d^k is a descent direction. That is, for small positive \alpha, f(x^{k-1} + \alpha d^k) is decreasing.

Here's the lemma we want to prove:
When x^k is constructed using the optimal \alpha, we have \nabla f(x^k) \cdot d^k = 0

The Attempt at a Solution

It's suggested in the book that we should differentiate the function f(x^{k-1} + \alpha d^k) with respect to \alpha. My problem is that I don't know how to differentiate a function that isn't defined.

Visualize all the arguments of f . For example f( {x^{k-1}}_1 + \alpha {d^k}_1, {x^{k-1}}_2 + \alpha {d^k}_2)

Things might look more familiar In different notation. If we have a real valued function f(x,y) of two real variables x,y and X(\alpha), \ Y(\alpha) are real valued functions of \alpha then
\frac{df}{d\alpha} = \frac{\partial df}{dx}\frac{dX}{d\alpha} + \frac{\partial df}{dy} \frac{dY}{d\alpha}
= \nabla f \cdot (\frac{dX}{d\alpha},\frac{dY}{d\alpha})

Ray Vickson · Dec 14, 2014

rayge said:

Homework Statement

f(x) is the function we want to minimize. Beyond being real-valued, there are no other conditions on it. (I'm surprised it's not at least continuous, but the book doesn't say that's a condition.) We choose the next x^k through the relation x^k = x^{k-1} + \alpha_{k}d^k. We assume d^k is a descent direction. That is, for small positive \alpha, f(x^{k-1} + \alpha d^k) is decreasing.

Here's the lemma we want to prove:
When x^k is constructed using the optimal \alpha, we have \nabla f(x^k) \cdot d^k = 0

The Attempt at a Solution

It's suggested in the book that we should differentiate the function f(x^{k-1} + \alpha d^k) with respect to \alpha. My problem is that I don't know how to differentiate a function that isn't defined. My first guess went something like this, but I don't see how I'm any closer to a solution.

\frac{\partial}{\partial \alpha} f(x^{k-1} + \alpha d^k) = \frac{\partial}{\partial \alpha} f(x^k)
f'(x^{k-1} + \alpha d^k)d^k = f'(x^k)(0)
f'(x^{k-1} + \alpha d^k)d^k = 0

I'm really not looking for an answer, but if someone could point me to where I could learn about taking differentials of undefined functions that would be helpful. I'm guessing that somehow I can extract a gradient out of this, and a dot product, but I'm feeling pretty confused.

Taking derivatives does not make sense unless ##f## is differentiable, so automatically continuous---in fact, even smoother than continuous. Maybe the book does not say that, but it is part of the reader's assumed background. Furthermore (just to be picky), expressions like ##\nabla f(x^k) \cdot d^k## depend on even more smoothness: ##f## must be continuously differentiable (not just plain differentiable). If the book does not make this clear, that is a flaw; it should definitely cover these points somewhere.

BTW: the stated result is false if ##f## is not continuously differentiable; there are some easy counterexamples.

rayge · Dec 14, 2014

Actually it is made clear in the beginning on the chapter that we're dealing with quadratic functions of the form f(x) = \frac{1}{2}\|Ax - b\|^2_2. My fault for not reading closely enough.

Ray Vickson · Dec 14, 2014

rayge said:

Actually it is made clear in the beginning on the chapter that we're dealing with quadratic functions of the form f(x) = \frac{1}{2}\|Ax - b\|^2_2. My fault for not reading closely enough.

OK, but the result is actually true for any continuously-differentiable ##f##.

rayge · Dec 14, 2014

Good to know, thanks. I applied the chain rule as suggested, and got:
\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = \nabla f \cdot d^k
From here, it seems reasonable that \frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = 0 is our optimal value but I don't know how to prove that from our assumptions. For small \alpha, f(x^{k-1} + \alpha d^k) is decreasing. For large enough \alpha, I think f(x^{k-1} + \alpha d^k) is increasing (i.e. we pass by the optimal point). I just don't have a solid grasp on how the function works, and assuming it's increasing for large \alpha, why that is the case.

Thanks again.

Ray Vickson · Dec 14, 2014

rayge said:

Good to know, thanks. I applied the chain rule as suggested, and got:
\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = \nabla f \cdot d^k
From here, it seems reasonable that \frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = 0 is our optimal value but I don't know how to prove that from our assumptions. For small \alpha, f(x^{k-1} + \alpha d^k) is decreasing. For large enough \alpha, I think f(x^{k-1} + \alpha d^k) is increasing (i.e. we pass by the optimal point). I just don't have a solid grasp on how the function works, and assuming it's increasing for large \alpha, why that is the case.

Thanks again.

You are searching for a minimum along a line, so you are trying to minimize a univariate function ##\phi(\alpha)##, which just happens to be of the form
\phi(\alpha) = f(x^{k-1} + \alpha d^k)
for some vector constants ##x^{k-1}, d^k##. How do you minimize functions of one variable?

Ray Vickson · Dec 15, 2014

rayge said:

Good to know, thanks. I applied the chain rule as suggested, and got:
\frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = \nabla f \cdot d^k
From here, it seems reasonable that \frac{d f(x^{k-1} + \alpha d^k)}{d\alpha} = 0 is our optimal value but I don't know how to prove that from our assumptions. For small \alpha, f(x^{k-1} + \alpha d^k) is decreasing. For large enough \alpha, I think f(x^{k-1} + \alpha d^k) is increasing (i.e. we pass by the optimal point). I just don't have a solid grasp on how the function works, and assuming it's increasing for large \alpha, why that is the case.

Thanks again.

I see that my previous response did not deal with all of your question. I cannot supply too many details without violating PF helping policies, but I can give a hint. In your specific case you have ##f(\vec{x}) = || A \vec{x} - \vec{b}||^2##. So letting ##\vec{y} = A \vec{x}^k - \vec{b}## and ##\vec{p} = A \vec{d}^k##, we have that ##\phi(\alpha) = f(\vec{x}^k + \vec{d}^k \alpha)## has the form ##\phi(\alpha) = ||\vec{y} + \alpha \vec{p}||^2 = a + b \alpha + c \alpha^2##. How do you find ##a, b, c##? Given that ##\vec{d}^k## is a descent direction, what does this say about the values or signs of ##a,b,c##? What do these signs say about the behavior of ##\phi(\alpha)##?

Iterative Minimization lemma proof

Homework Statement

The Attempt at a Solution

Homework Statement

The Attempt at a Solution

Homework Statement

The Attempt at a Solution

Similar threads

Distance between a Clock's hands when the distance is increasing most rapidly

Volume with spherical coordinates

Does this series converge uniformly?

Use greedy vertex coloring algorithm to prove the upper bound of χ

Conflicting definitions of linear independence

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers