Background:
Suppose that I observe some data $\mathbf{y} = [y_{1}, \ldots, y_{N}]^{T}$ at specific time points $\mathbf{t} = [ t_{1}, \ldots, t_{N}]^{T}$. I am assuming that my data can be modeled as:
$$ y_{n} = \sum_{m=1}^{M} w_{m} \phi_{m}\left( t_{n} \right) + \epsilon_{n} \; \forall n$$
where $ \epsilon_{n} \sim \mathcal{N}\left(0, \beta^{-1} \right) $ are i.i.d.
If we assume that $ \mathbf{w} \sim \mathcal{N}(0, \alpha^{-1} \mathbf{I}) $, where $\mathbf{w} = [w_{1}, \ldots, w_{M}]^{T}$, then we can find $\mathbf{w}$ by maximizing the posterior distribution:
$$ p(\mathbf{w} | \mathbf{y}, \mathbf{t}, \alpha, \beta) \propto p(\mathbf{y} | \mathbf{t}, \mathbf{w}, \beta) p(\mathbf{w} | \alpha) $$
Doing the math, this gives us:
$$ p(\mathbf{w} | \mathbf{y}, \mathbf{t}, \alpha, \beta) = \mathcal{N}\left( \beta ( \alpha \mathbf{I} + \beta \Phi^{T} \Phi) \Phi^{T} \mathbf{y}, (\alpha \mathbf{I} + \beta \Phi^{T} \Phi)^{-1} \right)$$ where $\Phi_{m,n} = \phi_{m}(t_{n})$.
The Problem
In normal Bayesian inference, we would just pick an $\alpha$ and assume that we know $\beta$. However, using the 'evidence approximation', it is claimed that you can find both $\beta$ and $\alpha$ by maximizing:
$$ p(\mathbf{y} | \alpha, \beta) = \int p(\mathbf{y} | \mathbf{t}, \mathbf{w}, \beta) p(\mathbf{w} | \alpha) d\mathbf{w} $$
But my question is -- isn't that circular logic? Aren't we trying to find what we are assuming in the first place?
Assuming that we find these $\alpha$ and $\beta$ in this way, what will it do for us? Give us a better estimate of $\mathbf{w}$ via the posterior? But, how can we know that is actually a better estimate if we don't know what $\mathbf{w}$ is? Will choosing this $\mathbf{w}$ found using these optimal values minimize the mean square error more than if we used arbitrary values?
Also, suppose we find these parameter values -- what do they then represent? The "actual" prior and noise precision? Does it even make sense to say there is an "actual" prior?
In other words, what is the point of evidence approximation? Is it not circular logic? Where, when and why is it used?