How to estimate simplex gradient using Taylor series?

TackyBoar · Aug 18, 2020

I read Iterative methods for optimization by C. Kelley (PDF) and I'm struggling to understand proof of

Lemma 6.2.1. Let \(\displaystyle S\) be a simplex. Let \(\displaystyle \nabla f \) be Lipschitz continuous in a neighborhood of \(\displaystyle S\) with Lipschitz constant \(\displaystyle 2K_f\). Then there is \(\displaystyle K>0\), depending only on \(\displaystyle K_f\) such that
\(\displaystyle \left\lVert \nabla f(x_1) - D(f : S) \right\rVert \leq K \kappa(V) \sigma_+(S).\)

Notes on notation: \(\displaystyle S \) is a simplex with vertices \(\displaystyle x_1 \) to \(\displaystyle x_{N+1} \) (order matters), some edges \(\displaystyle v_j = x_j - x_1 \) that make matrix \(\displaystyle V = (v_1, \dots, v_n) \) and \(\displaystyle \sigma_+(S) = \max_j \lVert x_1 - x_j \rVert_2 \). Simplex "gradient" is \(\displaystyle D(f : S) = V^{-T} (f(x_2) - f(x_1), \dots, f(x_{N+1}) - f(x_1)). \)

Proof starts with the following statement:

Our smoothness assumptions on \(\displaystyle f\) and Taylor’s theorem imply that for all \(\displaystyle 2 \leq j \leq N+1 \)
\(\displaystyle \left|f\left(x_{1}\right)-f\left(x_{j}\right)+v_{j}^{T} \nabla f\left(x_{1}\right)\right| \leq K_{f}\left\|v_{j}\right\|^{2}\)

Why is it true?

Using Taylor expansion about \(\displaystyle x_1 \) gives
\(\displaystyle f(x_j) = f(x_1) + \nabla f(x_1)(x_j - x_1)^T + R(x_j - x_1) = f(x_1) + \nabla f(x_1)v_j^T + R(v_j) \)
and using Mean Value Theorem on \(\displaystyle R(v_i) \) leads to
\(\displaystyle
|R(v_j)|
= |R(v_j)| - |R(0)|
\leq \lVert v_j \rVert \cdot \sup_{\theta \in [0, 1]} \lVert \nabla R(\theta v_j) \rVert
\leq \lVert v_j \rVert \cdot K_f \lVert v_j \rVert.
\)
It seems that the constant could be just \(\displaystyle 1 \cdot K_f \). Is there a reasoning error in what I wrote? Later in the text I spotted a similar Lemma with a different constant:

Lemma 6.2.5. Let \(\displaystyle S\) be a nonsingular simplex and let \(\displaystyle \nabla^2 f \) be Lipschitz continuous in a neighborhood of \(\displaystyle S \cup R(S) \) with Lipschitz constant \(\displaystyle 3K_C \). Then there is \(\displaystyle K>0 \) such that
\(\displaystyle \left\lVert \nabla f(x_1) - D_C(f : S) \right\rVert \leq K \kappa(V) \sigma_+(S)^2. \)

It uses similar concepts and I think the constant may mean something here.

jvicens · Aug 18, 2020

First of all, it is important to note that the proof of Lemma 6.2.1 is not explicitly stated in the forum post, so I will provide a brief explanation of the key steps.

The proof starts by using Taylor's theorem to expand the function f(x_j) around x_1, which gives us f(x_j) = f(x_1) + \nabla f(x_1)(x_j - x_1)^T + R(x_j - x_1), where R is the remainder term. Then, using the Mean Value Theorem, we can bound the remainder term by \lVert v_j \rVert \cdot K_f \lVert v_j \rVert, which leads to the inequality \left|f\left(x_{1}\right)-f\left(x_{j}\right)+v_{j}^{T} \nabla f\left(x_{1}\right)\right| \leq K_{f}\left\|v_{j}\right\|^{2}, as stated in the forum post.

Now, to address the question about the constant K, it is important to note that the Lipschitz constant K_f is defined as the maximum value of the absolute value of the gradient of f over a certain region. In this case, the region is a neighborhood of the simplex S. So, the constant K_f represents the maximum possible change in the gradient of f over this region. Therefore, when we use the Mean Value Theorem to bound the remainder term, we are essentially multiplying the maximum change in the gradient (represented by K_f) by the size of the vector v_j (represented by \lVert v_j \rVert). This is why the constant K is not simply 1 \cdot K_f, but rather K_f multiplied by the size of the vector.

Moving on to Lemma 6.2.5, the same concept applies. Here, the Lipschitz constant for the Hessian matrix \nabla^2 f is 3K_C, and this represents the maximum change in the Hessian matrix over a neighborhood of S \cup R(S). So, when we use the Mean Value Theorem to bound the remainder term, we are again multiplying the maximum change in the Hessian matrix (represented by 3K_C) by the size of the vector v_j (represented by \lVert v_j \rVert). This is

How to estimate simplex gradient using Taylor series?

1. What is the purpose of estimating simplex gradient using Taylor series?

2. How is Taylor series used to estimate simplex gradient?

3. What is a simplex in the context of estimating gradient?

4. What are the advantages of using Taylor series to estimate simplex gradient?

5. Are there any limitations to using Taylor series for estimating simplex gradient?

Similar threads

Hot Threads

Recent Insights