How to estimate simplex gradient using Taylor series?

In summary: K is not simply 1 \cdot K_C, but rather K_C multiplied by the squared size of the vector.In summary, the constants K and K_C represent the maximum possible change in the gradient and Hessian matrix, respectively, over a neighborhood of the simplex S or S \cup R(S). This is why they are not simply equal to the Lipschitz constants, but rather are multiplied by the size or squared size of the vector.
  • #1
TackyBoar
1
0
I read Iterative methods for optimization by C. Kelley (PDF) and I'm struggling to understand proof of
Lemma 6.2.1. Let \(\displaystyle S\) be a simplex. Let \(\displaystyle \nabla f \) be Lipschitz continuous in a neighborhood of \(\displaystyle S\) with Lipschitz constant \(\displaystyle 2K_f\). Then there is \(\displaystyle K>0\), depending only on \(\displaystyle K_f\) such that
\(\displaystyle \left\lVert \nabla f(x_1) - D(f : S) \right\rVert \leq K \kappa(V) \sigma_+(S).\)

Notes on notation: \(\displaystyle S \) is a simplex with vertices \(\displaystyle x_1 \) to \(\displaystyle x_{N+1} \) (order matters), some edges \(\displaystyle v_j = x_j - x_1 \) that make matrix \(\displaystyle V = (v_1, \dots, v_n) \) and \(\displaystyle \sigma_+(S) = \max_j \lVert x_1 - x_j \rVert_2 \). Simplex "gradient" is \(\displaystyle D(f : S) = V^{-T} (f(x_2) - f(x_1), \dots, f(x_{N+1}) - f(x_1)). \)

Proof starts with the following statement:
Our smoothness assumptions on \(\displaystyle f\) and Taylor’s theorem imply that for all \(\displaystyle 2 \leq j \leq N+1 \)
\(\displaystyle \left|f\left(x_{1}\right)-f\left(x_{j}\right)+v_{j}^{T} \nabla f\left(x_{1}\right)\right| \leq K_{f}\left\|v_{j}\right\|^{2}\)
Why is it true?

Using Taylor expansion about \(\displaystyle x_1 \) gives
\(\displaystyle f(x_j) = f(x_1) + \nabla f(x_1)(x_j - x_1)^T + R(x_j - x_1) = f(x_1) + \nabla f(x_1)v_j^T + R(v_j) \)
and using Mean Value Theorem on \(\displaystyle R(v_i) \) leads to
\(\displaystyle
|R(v_j)|
= |R(v_j)| - |R(0)|
\leq \lVert v_j \rVert \cdot \sup_{\theta \in [0, 1]} \lVert \nabla R(\theta v_j) \rVert
\leq \lVert v_j \rVert \cdot K_f \lVert v_j \rVert.
\)
It seems that the constant could be just \(\displaystyle 1 \cdot K_f \). Is there a reasoning error in what I wrote? Later in the text I spotted a similar Lemma with a different constant:
Lemma 6.2.5. Let \(\displaystyle S\) be a nonsingular simplex and let \(\displaystyle \nabla^2 f \) be Lipschitz continuous in a neighborhood of \(\displaystyle S \cup R(S) \) with Lipschitz constant \(\displaystyle 3K_C \). Then there is \(\displaystyle K>0 \) such that
\(\displaystyle \left\lVert \nabla f(x_1) - D_C(f : S) \right\rVert \leq K \kappa(V) \sigma_+(S)^2. \)

It uses similar concepts and I think the constant may mean something here.
 
Physics news on Phys.org
  • #2


First of all, it is important to note that the proof of Lemma 6.2.1 is not explicitly stated in the forum post, so I will provide a brief explanation of the key steps.

The proof starts by using Taylor's theorem to expand the function f(x_j) around x_1, which gives us f(x_j) = f(x_1) + \nabla f(x_1)(x_j - x_1)^T + R(x_j - x_1), where R is the remainder term. Then, using the Mean Value Theorem, we can bound the remainder term by \lVert v_j \rVert \cdot K_f \lVert v_j \rVert, which leads to the inequality \left|f\left(x_{1}\right)-f\left(x_{j}\right)+v_{j}^{T} \nabla f\left(x_{1}\right)\right| \leq K_{f}\left\|v_{j}\right\|^{2}, as stated in the forum post.

Now, to address the question about the constant K, it is important to note that the Lipschitz constant K_f is defined as the maximum value of the absolute value of the gradient of f over a certain region. In this case, the region is a neighborhood of the simplex S. So, the constant K_f represents the maximum possible change in the gradient of f over this region. Therefore, when we use the Mean Value Theorem to bound the remainder term, we are essentially multiplying the maximum change in the gradient (represented by K_f) by the size of the vector v_j (represented by \lVert v_j \rVert). This is why the constant K is not simply 1 \cdot K_f, but rather K_f multiplied by the size of the vector.

Moving on to Lemma 6.2.5, the same concept applies. Here, the Lipschitz constant for the Hessian matrix \nabla^2 f is 3K_C, and this represents the maximum change in the Hessian matrix over a neighborhood of S \cup R(S). So, when we use the Mean Value Theorem to bound the remainder term, we are again multiplying the maximum change in the Hessian matrix (represented by 3K_C) by the size of the vector v_j (represented by \lVert v_j \rVert). This is
 

1. What is the purpose of estimating simplex gradient using Taylor series?

The purpose of estimating simplex gradient using Taylor series is to approximate the gradient of a function at a given point. This can be useful in optimization problems where the gradient is needed to determine the direction of steepest ascent or descent.

2. How is Taylor series used to estimate simplex gradient?

Taylor series is a mathematical tool that allows us to approximate a function using its derivatives at a specific point. By using the first-order derivative of the function, we can estimate the gradient at that point.

3. What is a simplex in the context of estimating gradient?

In the context of estimating gradient, a simplex is a geometric shape that is used to represent a set of points in a multi-dimensional space. It is typically used in optimization problems to find the maximum or minimum value of a function.

4. What are the advantages of using Taylor series to estimate simplex gradient?

One advantage of using Taylor series to estimate simplex gradient is that it provides a simple and efficient way to approximate the gradient of a function. It also allows us to approximate the gradient at any point, not just the points where the function is differentiable.

5. Are there any limitations to using Taylor series for estimating simplex gradient?

One limitation of using Taylor series for estimating simplex gradient is that it only provides an approximation of the gradient, which may not be accurate for all points. Additionally, it can be computationally expensive to calculate the higher-order derivatives needed for a more accurate estimation.

Similar threads

Replies
3
Views
1K
Replies
3
Views
1K
Replies
16
Views
2K
  • Linear and Abstract Algebra
Replies
8
Views
789
Replies
2
Views
1K
  • Set Theory, Logic, Probability, Statistics
Replies
1
Views
745
Replies
1
Views
159
Replies
3
Views
732
Back
Top