Constrained Optimization with the KKT Approach

Click For Summary
SUMMARY

The discussion centers on constrained optimization using the Karush-Kuhn-Tucker (KKT) approach as outlined in the book "Deep Learning" by Ian Goodfellow et al. The generalized Lagrangian function, defined as $$L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)+\sum_i\lambda_i\cdot g^{(i)}\left(\textbf{x}\right)+\sum_j\alpha_j\cdot h^{(j)}\left(\textbf{x}\right)$$, is crucial for finding the minimum of a function subject to equality and inequality constraints. The process involves maximizing over the Lagrange multipliers $$\boldsymbol{\lambda}$$ and $$\boldsymbol{\alpha}$$ to ensure that the solution remains within the feasible set $$S$$. The discussion emphasizes the necessity of this order of operations to maintain the constraints while minimizing the objective function.

PREREQUISITES
  • Understanding of constrained optimization principles
  • Familiarity with the Karush-Kuhn-Tucker (KKT) conditions
  • Knowledge of Lagrangian multipliers
  • Basic calculus and optimization techniques
NEXT STEPS
  • Study the derivation and application of the KKT conditions in optimization problems
  • Explore numerical methods for solving constrained optimization problems
  • Learn about the implementation of Lagrangian methods in programming languages such as Python
  • Investigate real-world applications of constrained optimization in machine learning
USEFUL FOR

Researchers, data scientists, and machine learning practitioners interested in optimization techniques, particularly those working with constrained optimization problems in deep learning and numerical methods.

SilverSoldier
Messages
26
Reaction score
3
I'm reading the book Deep Learning by Ian Goodfellow, Yoshua Bengio, and Aaron Courville, and currently reading this chapter on numerical methods--specifically, the section on constrained optimization.

The book states the following.

Suppose we wish to minimize a function ##f\left(\textbf{x}\right)## over some set $$S=\left\{\textbf{x}\,|\,\forall\,i,g^{(i)}\left(\textbf{x}\right)=0,\,\forall\,j,h^{(j)}\left(\textbf{x}\right)\leq0\right\},$$ where ##g^{(i)}## and ##h^{(j)}## are functions called the equality constraints and inequality constraints respectively; i.e., we wish to find ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##.

To do this, we start by defining a function ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## called the generalized Lagrangian as follows; $$L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)+\sum_i\lambda_i\cdot g^{(i)}\left(\textbf{x}\right)+\sum_j\alpha_j\cdot h^{(j)}\left(\textbf{x}\right).$$ Then, $$\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right).$$ The way I understand this is that we start by picking some random values for ##\textbf{x}## and ##\boldsymbol{\lambda}##, and then find the best non-negative ##\boldsymbol{\alpha}## for which ##L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## is maximum.

Say the ##\textbf{x}## we picked randomly happened to lie in ##S##. Then we should end up at the result that ##\boldsymbol{\alpha}=\textbf{0}##.
It remains now to maximize ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## over ##\boldsymbol{\lambda}##, and the minimize the resulting function over ##\textbf{x}##. Again, because the ##\textbf{x}## we have picked randomly happens to lie in ##S##, ##L\left(\textbf{x}, \boldsymbol{\lambda}, \textbf{0}\right)## is already maximum for any ##\boldsymbol{\lambda}##, because ##\forall\,i, g^{(i)}\left(\textbf{x}\right)=0##.

Therefore, in such a case where the ##\textbf{x}## we pick randomly just so happens to lie in ##S##, we see that $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right).$$ What I do not understand is how minimizing ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right)## over ##\textbf{x}## now will keep ##\textbf{x}## within ##S##, and find a minimum in this constrained region.

If $$\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(\textbf{x}\right),$$ mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?
 
Physics news on Phys.org
SilverSoldier said:
It remains now to ..... minimize the resulting function over ##\textbf{x}##. .....

SilverSoldier said:
mustn't we have $$\min_{\textbf{x}}\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=\min_{\textbf{x}}f\left(\textbf{x}\right),$$ which is different from ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)##?
Yes. You already said that you need to minimize over ##\textbf{x}##. If ##\textbf{x}## was picked randomly, then you would have to repeat this process enough to feel confident that you have found a minimum over ##\textbf{x}##, or do a methodical search for the minimum.
 
FactChecker said:
Yes. You already said that you need to minimize over ##\textbf{x}##. If ##\textbf{x}## was picked randomly, then you would have to repeat this process enough to feel confident that you have found a minimum over ##\textbf{x}##, or do a methodical search for the minimum.
But it is necessary to find the minimum over values in ##S## only. How does an unconstrained minimization of ##f## over all values of ##\textbf{x}## result in its minimum over values in ##S##? :confused:
 
SilverSoldier said:
But it is necessary to find the minimum over values in ##S## only. How does an unconstrained minimization of ##f## over all values of ##\textbf{x}## result in its minimum over values in ##S##? :confused:
Oh, I am sorry. I missed the point. By maximizing over ##\lambda_x## and ##\alpha_x \ge 0## for any given ##x##, you are finding the multipliers that provide the best penalty for violating the constraints of ##g## and ##h##. So that forces the minimization over ##x## to stay within ##S##.

UPDATE: After thinking more about this, I do not feel I am knowledgeable enough to give authoritative help here. I will leave this to others.
 
Last edited:
I think fact checker has it. If you pick a value of x for which ##g^i(x)>0## you can pick ##\lambda_i>>>0## and ##L## is very positive. Similarly if ##g^i(x) <0## you can pick ##\lambda_i<<<<0##. So ##g^i(x)=0## when you minimize ##L##.

But for the inequality constraints you only get ##h^i(x)\leq0## since otherwise we make ##\alpha^i>>>>0##
 
  • Like
Likes   Reactions: FactChecker
I'm sorry I still don't understand it and I am quite confused. Shouldn't we have ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## or ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\max_{\boldsymbol{\lambda}}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## instead?

Say we wanted to find the minimum of ##\dfrac{x^2+y^2}{5}## subject to the constraints ##x^2+y-2\leq0## and ##10x-y-22=0##. The actual minimum, which is ##1.6## given these constraints, is achieved at ##x=2## and ##y=-2##. To find this minimum using the KKT approach, we first define the generalized Lagrangian, as follows; $$\dfrac{x^2+y^2}{5}+\lambda\left(10x-y-22\right)+\alpha\left(x^2+y-2\right).$$ Say we start by picking ##\left(1.8, -4\right)## randomly for ##\left(x,y\right)##. Notice ##\left(1.8, -4\right)## satisfies both constraints and so ##\displaystyle\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)=f\left(1.8, -4\right)=3.85##. What does it now mean to minimize this? Does this mean we now vary ##x## and ##y## until we look for values that minimize ##\dfrac{x^2+y^2}{5}##?

Can you also please describe how this approach may be implemented in code?
 
I think you're thinking about the ordering wrong. The first thing is min over x. So you can imagine having some function ##K(x)## that e are trying to minimize. That function is max over ##\lambda##, max over ##\alpha##, ##L(x,\lambda, \alpha)##. So the ordering in the OP says for each ##x##, pick the worst choice of ##\lambda## and ##\alpha## to compute ##K(x)##. Then find the ##x## that makes this as small as possible.
 
Last edited:
  • Like
Likes   Reactions: FactChecker
SilverSoldier said:
I'm sorry I still don't understand it and I am quite confused. Shouldn't we have ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\lambda}}\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## or ##\displaystyle\min_{\textbf{x}\in S}f\left(\textbf{x}\right)=\max_{\boldsymbol{\alpha}, \boldsymbol{\alpha}\geq0}\max_{\boldsymbol{\lambda}}\min_{\textbf{x}}L\left(\textbf{x}, \boldsymbol{\lambda}, \boldsymbol{\alpha}\right)## instead?
If the first thing you do is minimize wrt ##x## then you are already too late to try to keep ##x## within ##S##. The first thing you must do is to maximize wrt ##\lambda## and ##\alpha## to force ##x## to stay within ##S##.
 

Similar threads

  • · Replies 4 ·
Replies
4
Views
2K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 0 ·
Replies
0
Views
2K
  • · Replies 6 ·
Replies
6
Views
2K
  • · Replies 4 ·
Replies
4
Views
2K
Replies
5
Views
3K
  • · Replies 18 ·
Replies
18
Views
4K
Replies
4
Views
2K
  • · Replies 1 ·
Replies
1
Views
3K
  • · Replies 5 ·
Replies
5
Views
2K