B Constrained Optimization with the KKT Approach

AI Thread Summary
The discussion revolves around the KKT (Karush-Kuhn-Tucker) approach to constrained optimization as presented in the book "Deep Learning." It highlights the formulation of the generalized Lagrangian and the process of minimizing a function subject to equality and inequality constraints. A key point is the necessity of maximizing over the Lagrange multipliers before minimizing over the variable to ensure that the solution remains within the feasible set defined by the constraints. Confusion arises regarding how unconstrained minimization can yield results constrained to the set S, which is clarified by emphasizing the role of the multipliers in enforcing the constraints. The conversation concludes with a focus on the correct order of operations in applying the KKT conditions to find the optimal solution.
SilverSoldier
Messages
26
Reaction score
3
I'm reading the book Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, and currently reading this chapter on numerical methods--specifically, the section on constrained optimization.

The book states the following.

Suppose we wish to minimize a function ##f\left(\textbf{x}\right)## over some set $$S=\left\{\textbf{x}\,|\,\forall\,i,g^{(i)}\left(\textbf{x}\right)=0,\,\forall\,j,h^{(j)}\left(\textbf{x}\right)\leq0\right\},$$ where ##g^{(i)}## and ##h^{(j)}## are functions called the equality constraints and inequality constraints respectively; i.e., we wish to find ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##.

To do this, we start by defining a function ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## called the generalized Lagrangian as follows; $$L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)+\sum_i\lambda_i\cdot g^{(i)}\left(\textbf{x}\right)+\sum_j\alpha_j\cdot h^{(j)}\left(\textbf{x}\right).$$ Then, $$\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right).$$ The way I understand this is that we start by picking some random values for ##\textbf{x}## and ##\boldsymbol{\lambda}##, and then find the best non-negative ##\boldsymbol{\alpha}## for which ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## is maximum.

Say the ##\textbf{x}## we picked randomly happened to lie in ##S##. Then we should end up at the result that ##\boldsymbol{\alpha}=\textbf{0}##.
It remains now to maximize ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## over ##\boldsymbol{\lambda}##, and the minimize the resulting function over ##\textbf{x}##. Again, because the ##\textbf{x}## we have picked randomly happens to lie in ##S##, ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## is already maximum for any ##\boldsymbol{\lambda}##, because ##\forall\,i, g^{(i)}\left(\textbf{x}\right)=0##.

Therefore, in such a case where the ##\textbf{x}## we pick randomly just so happens to lie in ##S##, we see that $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right).$$ What I do not understand is how minimizing ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)## over ##\textbf{x}## now will keep ##\textbf{x}## within ##S##, and find a minimum in this constrained region.

If $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right),$$ mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?
 
Mathematics news on Phys.org
SilverSoldier said:
It remains now to ..... minimize the resulting function over ##\textbf{x}##. .....

SilverSoldier said:
mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?
Yes. You already said that you need to minimize over ##\textbf{x}##. If ##\textbf{x}## was picked randomly, then you would have to repeat this process enough to feel confident that you have found a minimum over ##\textbf{x}##, or do a methodical search for the minimum.
 
FactChecker said:
Yes. You already said that you need to minimize over ##\textbf{x}##. If ##\textbf{x}## was picked randomly, then you would have to repeat this process enough to feel confident that you have found a minimum over ##\textbf{x}##, or do a methodical search for the minimum.
But it is necessary to find the minimum over values in ##S## only. How does an unconstrained minimization of ##f## over all values of ##\textbf{x}## result in its minimum over values in ##S##? :confused:
 
SilverSoldier said:
But it is necessary to find the minimum over values in ##S## only. How does an unconstrained minimization of ##f## over all values of ##\textbf{x}## result in its minimum over values in ##S##? :confused:
Oh, I am sorry. I missed the point. By maximizing over ##\lambda_x## and ##\alpha_x \ge 0## for any given ##x##, you are finding the multipliers that provide the best penalty for violating the constraints of ##g## and ##h##. So that forces the minimization over ##x## to stay within ##S##.

UPDATE: After thinking more about this, I do not feel I am knowledgeable enough to give authoritative help here. I will leave this to others.
 
Last edited:
I think fact checker has it. If you pick a value of x for which ##g^i(x)>0## you can pick ##\lambda_i>>>0## and ##L## is very positive. Similarly if ##g^i(x) <0## you can pick ##\lambda_i<<<<0##. So ##g^i(x)=0## when you minimize ##L##.

But for the inequality constraints you only get ##h^i(x)\leq0## since otherwise we make ##\alpha^i>>>>0##
 
  • Like
Likes FactChecker
I'm sorry I still don't understand it and I am quite confused. Shouldn't we have ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## or ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\max_{\boldsymbol{\lambda}}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## instead?

Say we wanted to find the minimum of ##\dfrac{x^2+y^2}{5}## subject to the constraints ##x^2+y-2\leq0## and ##10x-y-22=0##. The actual minimum, which is ##1.6## given these constraints, is achieved at ##x=2## and ##y=-2##. To find this minimum using the KKT approach, we first define the generalized Lagrangian, as follows; $$\dfrac{x^2+y^2}{5}+\lambda\left(10x-y-22\right)+\alpha\left(x^2+y-2\right).$$ Say we start by picking ##\left(1.8, -4\right)## randomly for ##\left(x,y\right)##. Notice ##\left(1.8, -4\right)## satisfies both constraints and so ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(1.8, -4\right)=3.85##. What does it now mean to minimize this? Does this mean we now vary ##x## and ##y## until we look for values that minimize ##\dfrac{x^2+y^2}{5}##?

Can you also please describe how this approach may be implemented in code?
 
I think you're thinking about the ordering wrong. The first thing is min over x. So you can imagine having some function ##K(x)## that e are trying to minimize. That function is max over ##\lambda##, max over ##\alpha##, ##L(x,\lambda, \alpha)##. So the ordering in the OP says for each ##x##, pick the worst choice of ##\lambda## and ##\alpha## to compute ##K(x)##. Then find the ##x## that makes this as small as possible.
 
Last edited:
  • Like
Likes FactChecker
SilverSoldier said:
I'm sorry I still don't understand it and I am quite confused. Shouldn't we have ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## or ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\max_{\boldsymbol{\lambda}}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## instead?
If the first thing you do is minimize wrt ##x## then you are already too late to try to keep ##x## within ##S##. The first thing you must do is to maximize wrt ##\lambda## and ##\alpha## to force ##x## to stay within ##S##.
 
Back
Top