I already understand the intuition behind why the gradient of a function $f$ at its maximum $(x,y)$ subject to some constraint $g$ satisfies:
$\nabla f(x,y) = \lambda\nabla g(x,y)$
For some constant $\lambda$. There are a lot of depictions online of the single constraint case in 2D, where you see that gradients of a function at a point are always perpendicular to the level set of the function at that point. You then conclude that the gradient of $f$ and the gradient of $g$ must be parallel (just a verbal way of expressing the equation above), because $\nabla f$ points in the direction of steepest ascent, and if $f$ is differentiable then it's continuous and the tangent plane is a good local approximation, and if you could move in some direction that increased $f$ but that was also parallel to $\nabla g$, you would be able to move along the level set of $g$ at $(x,y)$ and increase $f$ a little more without violating the constraint.
My problem is that this intuition falls apart with two or more constraints. Somehow this ends up being true for arbitrarily high dimension:
$\nabla f(x_1,\ldots,x_D) = \sum_{i=1}^n \lambda_i\nabla g_i(x_1,\ldots,x_D)$
I can see that if we stay in two dimensions and have two constraints, any two non-perpendicular vectors end up spanning the whole space so it must be the case that they can sum to $\nabla f$. But if the number of dimensions is high, and the number of constraints is smaller than the number of dimensions, it's not obvious to me why $\nabla f$ must be a linear combination of $\nabla g_i$.
What I can accept, is that it must be the case that at the maximum moving in the direction of $\nabla f$ must require moving in a direction that has a non-zero projection onto at least one $\nabla g_i$. In other words if we consider one pair of $(\nabla f, \nabla g_i)$, two vectors always lie in some plane, and we can consider $\nabla f$ to be the sum of two vectors: one that is parallel to $\nabla g_i$ and one that is perpendicular to $\nabla g_i$. Since it must be the case at the maximum that going further in the direction of $\nabla f$ would cause us to violate at least one constraint, there must be at least one $\nabla g_i$ where in a plane that only contains the two of them its part that is parallel to $\nabla f$ is non-zero. But I have no idea how we get from that to a linear combination of all constraints.
How do I get an intuition for this? Maybe there is an intuitive visualization for the multiple constraints case? I haven't been able to find one.
Here is how I personally interpret the theorem. You want to find maxima/minima of a given function f in a domain expressed as a set of cartesian equation. If we take a parametric curve in this set (this is the idea , nobody grants you that you can always take a proper one-dimensional curve on a generic set described by cartesian equations) then you have that if the extremal point belongs to this curve $\gamma$ must happen that the derivative of the composite function $f(\gamma(t))$ is zero when $\gamma(t)$ is the extremal point. So if we rewrite this formula we get $0=(f(\gamma))’=\nabla f \cdot \gamma’$ so the gradient and the tangent of the curve in that point are orthogonal (this indeed means that the gradient of f must be a linear combination of the gradients of the funcions that define the equations). Necessari condition to be an extremal point is that the gradient in orthogonal to the tangent space of the set at the given point. Now if we describe the set as cartesian equation then the span of the gradient in a certain point of those equations defines the cartesian equation of the tangent space of the set at that point. Imposing that the gradient of f is a linear combination of those gradients impose the condition of orthogonality necessary to realize our purpose. I hope this was not too much strange. The key is that a certain vector space described in cartesian equation automatically describes the vectors that are orthogonal to all the vectors of the space, this happens considering the equations as vectors and interpret the equation as a scalar product with an unknown vector.