Consider the problem
Maximize $f(\mathbf{x})$ subject to $g(\mathbf{x})=c$
Using the method of Lagrangian multpliers, I would set up a Lagrangian like
$$L = f(\mathbf{x})-\lambda (g(\mathbf{x})-c)$$
I would then solve for $\frac{\partial L}{\partial x_1}$,$\frac{\partial L}{\partial x_2}$,... and so on and use these to solve the problem.
My Question:
What is $L$? Why does it have this special property that when I take these derivatives and solve I can suddenly find the solution to the optimization problem? I just sort of compute without really understanding what I am doing. Why does this method work? Supposedly I have heard that $\lambda$ is the scalar necessary to make $\nabla g(\mathbf{x})$ equal to $\nabla f(\mathbf{x})$. And this has something to do with the normal vectors being parallel. But I don't really get how this helps me understand the role of $L$.
Let's start with the following question, suppose $$ \Sigma:=\left\{x\in\Bbb R^m \mid f(x)=0\in\Bbb R^q,\quad f(x)\in\mathscr C^1(\Bbb R^m),\quad 1\le q<m \right\} $$ and there is a scalar field, or the so-called "goal function" $\theta(x)\in\Bbb R$ on $\Sigma$. What we are going to do is seek $x_{*}$ such that $$\theta (x_*)=\sup_{x\in\Sigma}\theta(x)\quad\text{or}\quad\inf_{x\in\Sigma}\theta(x)$$ Now we apply the Implicit Function Theorem. We split $x\in\Bbb R^m$ into two parts $(\tilde{x},\hat{x})\in\Bbb R^p\times\Bbb R^q$ where $p+q=m$, and hence $f(\tilde{x},\hat{x})=0\in\Bbb R^q$. According to the theorem, if for all $x=(\tilde{x},\hat{x})\in \Sigma$, we have ($D$ denotes Jacobian matrix) \begin{equation} \det (D_{\hat{x}}f)(x)=\frac{\partial f}{\partial \hat{x}}\ne 0 \end{equation} Then there exists a parameter domain $U_{\Sigma}\in\Bbb R^p$ such that there exists an implicit function $\xi(\tilde{x})$: \begin{equation} \xi(\tilde{x}):U_{\Sigma}\ni\tilde{x}\mapsto\xi(\tilde{x})\in \Bbb R^q \end{equation} which is determined by the constraint $f(\tilde{x},\xi(\tilde{x}))=0\in\Bbb R^q$, or equivalently $(\tilde{x},\xi(\tilde{x}))\in\Sigma$.
Here comes the key part: we are going to recognize $\Sigma$ as a "hyper surface" which maps elements from $\Bbb R^p$ to $\Bbb R^{p+q}$, and then parameterize it with the aid of the implicit function: if we regard $U_{\Sigma}$ as the "parameter domain" for the hyper surface $\Sigma$, then we are immediately able to define the parametrization mapping $\sigma$ for $\Sigma$ as \begin{equation}\sigma(\tilde{x}):U_{\Sigma}\ni\tilde{x}\mapsto \sigma(\tilde{x}):=(\tilde{x},\xi(\tilde{x}))\in\Sigma\subset \Bbb R^m\end{equation} Hence we can rewrite the "goal function" $\theta(x)$ as \begin{equation}\Theta(\tilde{x}):U_{\Sigma}\ni\tilde{x}\mapsto \Theta(\tilde{x}):=\theta\circ\sigma(\tilde{x})\in\Bbb R\end{equation} The significant difference between the original form $\theta(x)$ and the rewritten form $\Theta(\tilde{x})$ is that the latter is defined directly on an open domain $U_{\Sigma}$, "freed" from any constraint. Therefore, if we are to seek local extrema for $\Theta(\tilde{x})$, all we have to do is simply let \begin{equation} (D\Theta)(\tilde{x})=(D\theta\circ\sigma)(\tilde{x})=0\in\Bbb R^{1\times p} \end{equation}
By Chain Rule, we have $$ (D\theta\circ\sigma)(\tilde{x})=(D\theta)(\sigma(\tilde{x}))(D\sigma)(\tilde{x}) $$ Note that $\sigma(\tilde{x})=(\tilde{x},\xi(\tilde{x}))$ and hence $$(D\theta)(\cdot)=\left[(D_{\tilde{x}}\theta)(\cdot),(D_{\hat{x}}\theta)(\cdot)\right]$$ and ($I_p$ denotes the $p\times p$ identity matrix) $$(D\sigma)(\tilde{x})=\begin{bmatrix} I_p \\ (D\xi)(\tilde{x}) \end{bmatrix}$$ we have \begin{equation} (D\theta\circ\sigma)(\tilde{x})=(D_{\tilde{x}}\theta)(x)+(D_{\hat{x}}\theta)(x)(D\xi)(\tilde{x})=0\in\Bbb R^{1\times p} \end{equation} Again, aided by the Implicit Function Theorem, we have \begin{equation} (D\xi)(\tilde{x})=-(D_{\hat{x}}f)^{-1}(x)(D_{\tilde{x}}f)(x) \end{equation} plugging it into the previous equation, we obtain the following equation \begin{equation}(D_{\tilde{x}}\theta)(x)-(D_{\hat{x}}\theta)(x)(D_{\hat{x}}f)^{-1}(x)(D_{\tilde{x}}f)(x)=0\in\Bbb R^{1\times q}\end{equation} Together with the constraint $f(x)=0\in\Bbb R^q$, we have $$ \left\{ \begin{array}{l} (D_{\tilde{x}}\theta)(x)-(D_{\hat{x}}\theta)(x)(D_{\hat{x}}f)^{-1}(x)(D_{\tilde{x}}f)(x)=0\in\Bbb R^{1\times p}\\ f(x)=0\in\Bbb R^q \end{array} \right. $$ Provided that $\Sigma$ (and hence $U_{\Sigma}$) is compact, these $m$ equations can determine all the possible $x_*$s that are not located on $\partial\Sigma$, which, in my opinion, is the intrinsic form of the so-called Lagrange Multiplier Function.
To see how the common Lagrange function "coincides" with this form, let $$L(x,\lambda):\Bbb R^m\times\Bbb R^q\ni (x,\lambda)\mapsto L(x,\lambda):=\theta(x)+\lambda^Tf(x)\in\Bbb R$$ differentiate $L$ and we get \begin{align*} (DL)(x,\lambda)&=(DL)(\tilde{x},\hat{x},\lambda)=\left[(D_{\tilde{x}}L),(D_{\hat{x}}L),(D_{\lambda}L)\right](x,\lambda)\\ &=\left[(D_{\tilde{x}}\theta)(x)+\lambda^T(D_{\tilde{x}}f)(x),(D_{\hat{x}}\theta)(x)+\lambda^T(D_{\hat{x}}f)(x),(f(x))^T\right]\\ &=\left[0\in\Bbb R^{1\times p},0\in\Bbb R^{1\times q},0\in\Bbb R^{1\times q}\right] \end{align*} from which it follows that $$(D_{\hat{x}}\theta)(x)+\lambda^T(D_{\hat{x}}f)(x)=0\in\Bbb R^{1\times p}\implies \lambda^T=-(D_{\hat{x}}\theta)(x)(D_{\hat{x}}f)^{-1}(x)\in\Bbb R^{1\times q}$$ plugging it into $$(D_{\tilde{x}}\theta)(x)+\lambda^T(D_{\tilde{x}}f)(x)=0\in\Bbb R^{1\times p}$$ so to obtain $$(D_{\tilde{x}}\theta)(x)-(D_{\hat{x}}\theta)(x)(D_{\hat{x}}f)^{-1}(x)(D_{\tilde{x}}f)(x)=0\in\Bbb R^{1\times p}$$ together with $(f(x))^T=0\in\Bbb R^{q}$, we have returned to the $m$ equations in the intrinsic form.