I am learning about the Lagrange multiplier. Here's what I understand so far.
Suppose a point $P$ is a minimizer of $f(x)$ subject to $g(x)=0$. Then any movement along that level-curve of $g$ must leave $f$ unaffected, because otherwise (assuming $f$ smooth or something) there would be some point higher than that at $P$. So the level-curve of $g$ must be perpendicular to the gradient of $f$, denoted $\nabla f$.
I think the gradient of $g$ is perpendicular to the level curve of $g$ for a similar reason. So the gradients of both $g$ and $f$ are perpendicular to the level curve of $g$.
Therefore we can express this situation as
$$\nabla f(p) = -\lambda \nabla g(P)$$ with a conventional minus sign, and also keep in mind that $$g(P) = 0.$$
These constitute our two equations for minimizing $f$ subject to $g(x)=0$. I think this should be enough information to just solve it from here.
But I often see a single-equation formulation written something like $$L(x,\lambda) = f(x) - \lambda g(x).$$
Questions (or clusters of questions):
- What is the relationship between these two formulations? It seems like the first is just some derivative of the second? How would we arrive at the single-equation formulation?
- What is the point of the second if the first (i.e. the pair of equations) suffice to solve it?
- In an applied problem would we be trying to get from the first to the second? It seems like the first is a manageable problem, and I don't really understand where the second comes from.
The relationship between the two formulations is that the partial derivatives of the second formulation give the vector components of the first, and the constraint $g(\textbf x)=0$. To use the approach given by the second formulation, one finds the partial derivatives of $L$ with respect to $x_1, x_2, \dots$ and $\lambda$: $$\frac{\partial L}{\partial x_1}=\frac{\partial f}{\partial x_1}-\lambda\frac{\partial g}{\partial x_2} \tag1$$ $$\frac{\partial L}{\partial x_2}=\frac{\partial f}{\partial x_2}-\lambda\frac{\partial g}{\partial x_2} \tag 2$$ $$\vdots$$ $$\frac{\partial L}{\partial \lambda}=-g(\textbf x) \tag 3$$ Now look at $\nabla f(\textbf x) = \lambda \nabla g(\textbf x)$: $$\left\langle \frac{ \partial f}{\partial x_1}, \frac{\partial f}{ \partial x_2}, \dots \right\rangle=\lambda\left\langle \frac{ \partial g}{\partial x_1}, \frac{\partial g}{\partial x_2}, \dots \right\rangle$$ $$\implies \frac{\partial f}{\partial x_1}-\lambda\frac{\partial g}{\partial x_1}=0, \qquad \quad \frac{\partial f}{\partial x_2}-\lambda\frac{\partial g}{\partial x_2}=0, \qquad \text{and so on}$$ Which are equations $(1)$ and $(2)$ if you set them to equal $0$. Similarly, $g(\textbf x)=0$ is equation $(3)$ if you set $(3)$ to equal $0$ (You wrote this in your question as $g(P)=0$). I think it is fairly obvious how to arrive at the single equation formulation: $f(\textbf x)$ and $g(\textbf x)$ are given, so plug that into $f(\textbf x) - \lambda g(\textbf x)$ and define that expression as $L$.
I think the point of the second formulation is to provide a simpler way to find the optimal point $P$. The second formulation does not require knowledge of the gradient vector, only partial derivatives, and it is for this reason that other subjects such as economics prefer to use the second formulation. Personally, I prefer the first formulation because the idea of a gradient vector is the basis of the theory behind the method of Lagrangian multipliers. Simply setting the partial derivatives of $L$ to equal $0$ seems to have no fundamental logical reason, other than "that is the correct way to do it".
In an applied problem, either method works equally well.