Here is a proof of the Lagrange multiplier method from Calculus Early Transcendentals by James Stewart (8th ed). It does not rely on the Implicit Function Theorem like all other "rigorous" proofs seem to. What is the missing piece from this proof (which I guess relies on the Implicit Function Theorem) that would make this rigorous?
Suppose that a function $f$ has an extreme value at a point $(x_0, y_0, z_0)$ on the surface $S$ and let $C$ be a curve with vector equation $\vec{r}(t)=(x(t), y(t), z(t))$ that lies on $S$ and passes through $(x_0, y_0, z_0)$. If $t_0$ is the parameter value corresponding to the point $(x_0, y_0, z_0)$, then $\vec{r}(t_0)=(x(t_0), y(t_0), z(t_0))$. The composite function $h(t)=f(x(t), y(t), z(t))$ represents the values that $f$ takes on the curve $C$. Since $f$ has an extreme value at $(x_0, y_0, z_0)$, it follows that $h$ has an extreme value at $t_0$, so $h'(t_0) = 0$. But if $f$ is differentiable, we can use the Chain Rule to write $$0 = h'(t_0) = \nabla f(x_0, y_0, z_0) \cdot \vec{r'}(t_0)$$
This shows that the gradient vector $\nabla f(x_0, y_0, z_0)$ is orthogonal to the tangent vector $\vec{r'}(t_0)$ to every such curve $C$. We know that the gradient of $g$, $\nabla g(x_0, y_0, z_0)$, is also orthogonal to $\vec{r'}(t_0)$ for every such curve. This means that the gradient vectors $\nabla f(x_0, y_0, z_0)$ and $\nabla g(x_0, y_0, z_0)$ must be parallel.
Alternatively, an even simpler proof from MIT OCW goes as follows:
Consider any unit vector $\hat{u}$ at the critical point that is tangent to the constraint surface. Then, since the directional derivative along $\hat{u}$, $D_\hat{u} f = \nabla f \cdot \hat{u} = 0$ at the critical point so $\nabla f$ is perpendicular to any such $\hat{u}$. We know $\nabla g$ is perpendicular to the level curves of $g$, so $\nabla g$ is also perpendicular to any such $\hat{u}$, implying $\nabla f$ and $\nabla g$ are parallel.
What does introducing $\vec{r}(t)$ in the Stewart proof give us over this one? And, again, what is the piece here that needs to be shown more rigorously (presumably using the Implicit Function Theorem)?
The two proofs are equivalent (with slight non-consequential differences I will clarify later).
At this level, it's helpful to borrow some intuition from physics (after all that's where calculus came from).
Let's use just two coordinates instead of three to make things easier to visualize:
We have a hill, and $f(x,y)$ is the height of the hill at $(x,y)$. A hiker's horizontal location (horizontal since we are not using $z$) at any time t is given by $\vec{r}(t)$ in Steward (which basically gives us the entire history of the hiker's movement). OCW only concerns us with hiker's movement near the extremum (and doesn't bother making it explicit), since elsewhere it's irrelevant. The latter also specifies that the hiker travels at unit speed, which is inconsequential here. Steward doesn't specify the speed. So these are the slight differences.
Now, if we write out the derivative in OCW (making the location explicit as in Steward), it's (evaluated at 0):
$$ \frac{d}{dt} f(\vec{r}(t_0)+\hat u t) $$
For Steward, it's (evaluated at $t_0$):
$$ \frac{d}{dt} f(\vec{r}(t))$$
In the first case, apply chain rule we get:
$$ \nabla f(\vec{r}(t_0)) \cdot \hat u$$
In the second case:
$$ \nabla f(\vec{r}(t_0)) \cdot \vec{r}'(t_0)$$
So, same conclusion.
Personally, I think Steward's approach presents it in a more intuitive way (and painstakingly names every detail), so is easier for beginners to understand. OCW's approach is more pragmatic, and you will be using that kind of notation later on. There is not any difference in terms of rigor.