My textbook, Deep Learning by Goodfellow, Bengio, and Courville, says the following in a section on constrained optimization:
The Karush-Kuhn-Tucker (KKT) approach provides a very general solution to constrained optimization. With the KKT approach, we introduce a new function called the generalized Lagrangian or generalized Lagrange function.
To define the Lagrangian, we first need to describe $\mathbb{S}$ in terms of equations and inequalities. We want a description of $\mathbb{S}$ in terms of $m$ functions $g^{(i)}$ and $n$ functions $h^{(j)}$ so that $\mathbb{S} = \{ \boldsymbol{\mathcal{x}} \mid \forall i, g^{(i)}(\boldsymbol{\mathcal{x}}) = 0 \ \text{and} \ \forall j, h^{(j)} (\boldsymbol{\mathcal{x}}) \le 0 \}$. The equations involving $g^{(i)}$ are called the equality constraints, and the inequalities involving $h^{(j)}$ are called inequality constraints.
We introduce new variables $\lambda_i$ and $\alpha_j$ for each constraint, these are called the KKT multipliers. The generalized Lagrangian is then defined as
$$L(\boldsymbol{\mathcal{x}}, \boldsymbol{\lambda}, \boldsymbol{\alpha}) = f(\boldsymbol{\mathcal{x}}) + \sum_i \lambda_i g^{(i)} (\boldsymbol{\mathcal{x}}) + \sum_j \alpha_j h^{(j)}(\boldsymbol{\mathcal{x}}) \tag{4.14}$$
We can now solve a constrained minimisation problem using unconstrained optimization of the generalized Lagrangian. As long as at least one feasible point exists and $f(\boldsymbol{\mathcal{x}})$ is not permitted to have value $\infty$, then
$$\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) \tag{4.15}$$
has the same optimal objective function value and set of optimal points $\boldsymbol{\mathcal{x}}$ as
$$\min_{\boldsymbol{\mathcal{x}} \in \mathbb{S}} f(\boldsymbol{\mathcal{x}}). \tag{4.16}$$
This follows because any time the constraints are satisfied,
$$\max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) = f(\boldsymbol{\mathcal{x}}),$$
while any time a constraint is violated,
$$\max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) = \infty$$
these properties guarantee that no infeasible point can be optimal, and that the optimum within the feasible points is unchanged.
I'm having difficulty understanding how $$\min_{\boldsymbol{\mathcal{x}}} \max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}})$$
has the same optimal objective function value and set of optimal points $\boldsymbol{\mathcal{x}}$ as
$$\min_{\boldsymbol{\mathcal{x}} \in \mathbb{S}} f(\boldsymbol{\mathcal{x}})$$
Specifically, I am not seeing how the latter claim that any time the constraints are satisfied,
$$\max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) = f(\boldsymbol{\mathcal{x}}),$$
while any time a constraint is violated,
$$\max_{\boldsymbol{\mathcal{\lambda}}} \max_{\boldsymbol{\mathcal{\alpha, \alpha}}\ge 0} L(\boldsymbol{\mathcal{x}}, \boldsymbol{\mathcal{\lambda}}, \boldsymbol{\mathcal{\alpha}}) = \infty$$
I would greatly appreciate it if people could please take the time to clarify this.
Starting with:
If the constraints are satisfied, then $g^{(i)}(x)=0$ and $h^{(j)}(x)\leq0$. Therefore, the terms with $\lambda$ all vanish, and the terms with $\alpha$ attain their maximum over $\alpha_j$ at $\alpha_j=0$ (because the last term cannot be positive), so also those terms vanish, leaving you with $f(x)$.
On the other hand, suppose a constraint is not satisfied. If $g^{(i)}(x)\neq 0$ for some $i$, you can let $\lambda_i g^{(i)}(x)$ go to infinity by letting $\lambda_i$ go to $\infty$ if $g^{(i)}(x)>0$, and $\lambda_i \to -\infty$ otherwise. Similarly, if $h^{(j)}(x)>0$ for some $i$, you can let $\alpha_j h^{(j)}(\boldsymbol{\mathcal{x}})$ go to $\infty$ by letting $\alpha_j \to \infty$.