B Constrained Optimization with the KKT Approach

SilverSoldier · Nov 30, 2022

I'm reading the book Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, and currently reading this chapter on numerical methods--specifically, the section on constrained optimization.

The book states the following.

Suppose we wish to minimize a function ##f\left(\textbf{x}\right)## over some set $$S=\left\{\textbf{x}\,|\,\forall\,i,g^{(i)}\left(\textbf{x}\right)=0,\,\forall\,j,h^{(j)}\left(\textbf{x}\right)\leq0\right\},$$ where ##g^{(i)}## and ##h^{(j)}## are functions called the equality constraints and inequality constraints respectively; i.e., we wish to find ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##.

To do this, we start by defining a function ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## called the generalized Lagrangian as follows; $$L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)+\sum_i\lambda_i\cdot g^{(i)}\left(\textbf{x}\right)+\sum_j\alpha_j\cdot h^{(j)}\left(\textbf{x}\right).$$ Then, $$\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right).$$ The way I understand this is that we start by picking some random values for ##\textbf{x}## and ##\boldsymbol{\lambda}##, and then find the best non-negative ##\boldsymbol{\alpha}## for which ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## is maximum.

Say the ##\textbf{x}## we picked randomly happened to lie in ##S##. Then we should end up at the result that ##\boldsymbol{\alpha}=\textbf{0}##.
It remains now to maximize ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## over ##\boldsymbol{\lambda}##, and the minimize the resulting function over ##\textbf{x}##. Again, because the ##\textbf{x}## we have picked randomly happens to lie in ##S##, ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## is already maximum for any ##\boldsymbol{\lambda}##, because ##\forall\,i, g^{(i)}\left(\textbf{x}\right)=0##.

Therefore, in such a case where the ##\textbf{x}## we pick randomly just so happens to lie in ##S##, we see that $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right).$$ What I do not understand is how minimizing ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)## over ##\textbf{x}## now will keep ##\textbf{x}## within ##S##, and find a minimum in this constrained region.

If $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right),$$ mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?

FactChecker · Nov 30, 2022

SilverSoldier said:

It remains now to ..... minimize the resulting function over ##\textbf{x}##. .....

SilverSoldier said:

mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?

Yes. You already said that you need to minimize over ##\textbf{x}##. If ##\textbf{x}## was picked randomly, then you would have to repeat this process enough to feel confident that you have found a minimum over ##\textbf{x}##, or do a methodical search for the minimum.

SilverSoldier · Nov 30, 2022

FactChecker said:

Yes. You already said that you need to minimize over ##\textbf{x}##. If ##\textbf{x}## was picked randomly, then you would have to repeat this process enough to feel confident that you have found a minimum over ##\textbf{x}##, or do a methodical search for the minimum.

But it is necessary to find the minimum over values in ##S## only. How does an unconstrained minimization of ##f## over all values of ##\textbf{x}## result in its minimum over values in ##S##?

FactChecker · Nov 30, 2022

SilverSoldier said:

But it is necessary to find the minimum over values in ##S## only. How does an unconstrained minimization of ##f## over all values of ##\textbf{x}## result in its minimum over values in ##S##?

Oh, I am sorry. I missed the point. By maximizing over ##\lambda_x## and ##\alpha_x \ge 0## for any given ##x##, you are finding the multipliers that provide the best penalty for violating the constraints of ##g## and ##h##. So that forces the minimization over ##x## to stay within ##S##.

UPDATE: After thinking more about this, I do not feel I am knowledgeable enough to give authoritative help here. I will leave this to others.

Office_Shredder · Dec 1, 2022

I think fact checker has it. If you pick a value of x for which ##g^i(x)>0## you can pick ##\lambda_i>>>0## and ##L## is very positive. Similarly if ##g^i(x) <0## you can pick ##\lambda_i<<<<0##. So ##g^i(x)=0## when you minimize ##L##.

But for the inequality constraints you only get ##h^i(x)\leq0## since otherwise we make ##\alpha^i>>>>0##

SilverSoldier · Dec 1, 2022

I'm sorry I still don't understand it and I am quite confused. Shouldn't we have ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## or ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\max_{\boldsymbol{\lambda}}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## instead?

Say we wanted to find the minimum of ##\dfrac{x^2+y^2}{5}## subject to the constraints ##x^2+y-2\leq0## and ##10x-y-22=0##. The actual minimum, which is ##1.6## given these constraints, is achieved at ##x=2## and ##y=-2##. To find this minimum using the KKT approach, we first define the generalized Lagrangian, as follows; $$\dfrac{x^2+y^2}{5}+\lambda\left(10x-y-22\right)+\alpha\left(x^2+y-2\right).$$ Say we start by picking ##\left(1.8, -4\right)## randomly for ##\left(x,y\right)##. Notice ##\left(1.8, -4\right)## satisfies both constraints and so ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(1.8, -4\right)=3.85##. What does it now mean to minimize this? Does this mean we now vary ##x## and ##y## until we look for values that minimize ##\dfrac{x^2+y^2}{5}##?

Can you also please describe how this approach may be implemented in code?

Office_Shredder · Dec 1, 2022

I think you're thinking about the ordering wrong. The first thing is min over x. So you can imagine having some function ##K(x)## that e are trying to minimize. That function is max over ##\lambda##, max over ##\alpha##, ##L(x,\lambda, \alpha)##. So the ordering in the OP says for each ##x##, pick the worst choice of ##\lambda## and ##\alpha## to compute ##K(x)##. Then find the ##x## that makes this as small as possible.

FactChecker · Dec 1, 2022

SilverSoldier said:

I'm sorry I still don't understand it and I am quite confused. Shouldn't we have ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## or ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\max_{\boldsymbol{\lambda}}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## instead?

If the first thing you do is minimize wrt ##x## then you are already too late to try to keep ##x## within ##S##. The first thing you must do is to maximize wrt ##\lambda## and ##\alpha## to force ##x## to stay within ##S##.

B Constrained Optimization with the KKT Approach

Similar threads

I Trigonometry problem of interest

Insights Fixing Things Which Can Go Wrong With Complex Numbers

I Geometry problem of interest with a 3-4-5 triangle

B Geometry Puzzle with 20 points in a cross pattern

B Excel: converting a 3-ish week count into a monthly count

Insights Thinking Outside The Box Versus Knowing What’s In The Box

Insights Why Entangled Photon-Polarization Qubits Violate Bell’s Inequality

Insights Quantum Entanglement is a Kinematic Fact, not a Dynamical Effect

Insights What Exactly is Dirac’s Delta Function? - Insight

Insights Relativator (Circular Slide-Rule): Simulated with Desmos - Insight

Insights Fixing Things Which Can Go Wrong With Complex Numbers