I want to prove that the estimator of ridge regression is the mean of the posterior distribution under Gaussian prior.
$$y \sim N(X\beta,\sigma^2I),\quad \text{prior }\beta \sim N(0,\gamma^2 I).$$
$$\hat{\beta} = \left(X^TX + \frac{\sigma^2}{\gamma^2}I\right)^{-1}X^Ty.$$
What I'm trying to show is want to show that $\mu$ = $\hat{B}$, for $\mu$ in $$-\frac{1}{2}(\beta - \mu)^T\Sigma^{-1}(\beta - \mu)$$ $\Sigma^{-1}$ is the covariance matrix for the posterior distribution $p(\beta\mid X,y)$.
There is a solution to this question the last couple of lines on page 3 from http://ssli.ee.washington.edu/courses/ee511/HW/hw3_solns.pdf. I'm baffled as to how it does this. (The problem is exercise 3.6.)
Edit: $\mu$ is the mean of the posterior.
Edit2: Last couple of lines of problem 3.6 say "which is the single $\beta$ term in the $p(\beta|y, X)$ equation." What is the single $\beta$ term? This sentence makes no sense to me. I'm not sure what the relevance of saying something is the the singe $\beta$ term to this proof.
Edit2 continued: For convenience, "$m_b = \frac{1}{\sigma^2I}\Sigma_bX^Ty$, and
$B^T\Sigma_b^{-1}m_b = \frac{1}{\sigma^2}B^tX^ty$, which is the single $\beta$ term in $p(\beta |X,y)$ equation." (me: okay, how is this helpful to the proof?)
Ridge
The ridge estimator is defined as the solution to the following optimisation problem:
$$ \beta_R = \arg\min_\beta \|X\beta -y\|_2^2 + {\kappa\over 2}\|\beta\|_2^2 $$
i.e., the usual OLS loss (first term) with an L2 regularisation term and tuning "hyper"-parameter $\kappa\ge 0$. If $\kappa=0$ it's the OLS case. This problem is convex and differentiable, setting the gradient in $\beta$ to zero leads to
$$ X^T(X\beta_R -y) + \kappa \beta_R = 0 $$
which is equivalent to $(X^T X + \kappa I)\beta_R = X^Ty $ and for $\kappa>0$ the matrix on the LHS is positive definite (easy to prove) and can therefore be inverted leading the standard form:
$$\beta_R = (X^T X+\kappa I)^{-1} X^Ty. $$
Bayesian regression
The posterior in $\beta$ is proportional to the likelihood (gaussian) times the prior (gaussian):
$$ p(\beta | y) \propto p(y|\beta) \pi(\beta) $$
Gaussians x Gaussian is still Gaussian. So the posterior is a Gaussian with a mean $\beta_G$ and a covariance $\Sigma_G$.
I will show that $\beta_G = \beta_R$ from two different ways.
1. Via the MAP
We know the posterior is a Gaussian. So the mode=mean=maximum. Therefore we can write:
$$ \beta_G = \arg\max_\beta L(y;\beta)\pi(\beta) $$
where $L(y;\beta)=p(y|\beta)$ is the likelihood of the data $y$ if the gaussian generating process is parameterised by $\beta$ i.e.: $\mathcal N(y;X\beta, \sigma^2I)$. Correspondingly $\pi(\beta)$ is the prior or $\mathcal N(\beta;0,\tau^2 I)$.
Now the $\arg\min$ does not change under a monotonous positive transformation such as the $\log$. Also additive constants are irrelevant. Therefore, we can write
$$\beta_G = \arg\max_\beta \log L(\beta) + \log \pi(\beta)$$
removing additive constants, this leads to
$$\beta_G = \arg\max_\beta -{1\over 2\sigma^2}(y-X\beta)^T(y-X\beta) - {1\over 2\tau^2} \beta^T\beta $$
changing sign, this becomes a minimisation problem (see how we're coming close to Ridge?) rearrange a bit and you get (again, get rid of additive constants and constant multiplicative terms):
$$ \beta_G = \arg\min_\beta {1\over \sigma^2}\left(\beta^T X^TX\beta - 2\beta^T X^T y\right) + {1\over \tau^2} \beta^T\beta $$
massage a bit more (multiply objective by $\sigma^2$)
$$ \beta_G = \arg\min_\beta \beta^T\left(X^TX+{\sigma^2\over \tau^2}I\right)\beta - 2\beta^TX^T y $$
setting the gradient in $\beta$ to zero leads to
$$\left(X^TX + {\sigma^2\over \tau^2}I\right)\beta_G = X^T y $$
which concludes the proof (just identify $\kappa \leftrightarrow \sigma^2/\tau^2$).
2. Quadratic form matching
You have a product of two gaussians leading to a Gaussian. We can ignore the proportionality constants and the $\exp$ so that you're effectively trying to match quadratic forms:
Matching quadratic terms in $\beta$
therefore
$$\Sigma_G^{-1} = (X^TX/\sigma^2 + I/\tau^2) $$
Linear term in $\beta$
therefore
$$ \beta_G = \Sigma_G X^T y / \sigma^2 = (X^TX + \sigma^2/\tau^2)^{-1} X^T y $$
Side remark
The first approach corresponds to taking the MAP estimator (maximum a posteriori). MAP estimators are typically identical to regularised MLE estimators. In this case, the MLE estimator is the OLS and it's regularised with an L2 norm, this corresponds to a MAP with a Gaussian likelihood and Gaussian prior.
You have a similar thing with the Lasso for example which corresponds to the MAP of a Gaussian likelihood with a Laplace prior.