For a linear system $Ax = y$, define $J_1 = ||{Ax-y}||^2$ and $J_2 = ||x||^2$. We wish to minimize the weighted-sum objective $J_1 + \mu J_2$. If we interpret $J_1$ as a cost function and $J_2$ as an effort penalty (e.g., in the context of signal processing where $x$ may be a digital filter and $||x||$ signifies the maximum gain/amplification introduced by the filter or the filter's "effort" expended in processing an input signal), then we can say that as $\mu$ increases (decreases), we emphasize (de-emphasize) minimization of the effort penalty. So, it seems to me that if we set $\mu = 0$, it might suggest that we don't really care about large gains that the filter might need to apply to get us as close as possible to our objective, $J_1$. On the other hand, if we increase $\mu$, we're putting a restriction on how much the filter can do to to get us as close as possible to our objective, $J_1$.
The above interpretation makes sense to me when we're dealing with an overdetermined system. Specifically, when $A \in \mathcal{R}^{m\times n}$ with $m > n$ and has full column rank, the solution is the standard least-squares solution with Tikhonov regularization given by $\hat{x} = \left(A^{T}A + \mu I\right)^{-1}A^{T}y$. The more we increase $\mu$, the greater the "error" we introduce to the matrix we're inverting (i.e., $A^{T}A + \mu I$), resulting in the latter having a smaller condition number. If we (loosely) interpret the smaller condition number as restricting the dynamic range of the filter, then we're effectively reducing the filter effort, which aligns with the fact that with a larger $\mu$, we're emphasizing the minimization of the $J_2$ term in the original objective function.
However, I'm struggling to reconcile the above interpretation with the mathematics that results when the system is underdetermined. Specifically, when $m < n$ and $A$ has full row rank, the solution is still the standard least-squares solution with Tikhonov regularization given by $\hat{x} = \left(A^{T}A + \mu I\right)^{-1}A^{T}y$ for $\mu \neq 0$. Again, I would expect that as $\mu$ increases, we emphasize minimization of $J_2$ over $J_1$. However, the minimum-norm solution (i.e., the one that minimizes $||x||$ subject to $Ax=y$) is $\hat{x} = A^{T}\left(AA^{T}\right)^{-1}y$, which is obtained as $\mu \to 0$. So, as $\mu \to 0$, we get a solution to $Ax = y$, which also happens to be the one that has the smallest norm. Does the qualifier "smallest" only pertain to all other solutions to the optimization problem where only $||Ax - y||^2$ is minimized? In other words, is the norm of the solution for the optimization problem where $J_1 + \mu J_2$ is minimized with $\mu > 0$ (or maybe $\mu > 1$?) smaller than the "minimum-norm" solution? Is there a nice way to show this mathematically?
If it's not too much trouble, could someone also comment on the case when $A$ is square and full rank? Thanks.
In the underdetermined case you may think on the 2 edges:
Since the path goes through solutions which minimizes $ {J}_{1} + \mu {J}_{2} $ it means for any solution for $ \mu > 0 $ its norm will be smaller than the norm of $ \hat{\boldsymbol{x}} $.