Minimizing function of a random variable; MLE over random parameters

968 Views Asked by At

I have a tricky homework question. I'm not looking for the answer, but rather some help understanding how to think about the question.

I have a vector $w \in \mathbb{R}^d$ where $w_i$~$ N(0, \tau^2)$, $x_i \in \mathbb{R}^d$ given for $i = 1, 2,..n$ , and a formula for generating output $y_i$ as follows:

$y_i = \sum_j w_jx_{ij} + \epsilon_i$, where $\epsilon_i$ ~$N(0, \sigma^2)$.

I am supposed to show that maximizing

$P(w_1, w_2, ... w_d|(x_1, y_1), ..., (x_n, y_n))$ is equivalent to minimizing the quantity $||Xw - y||^2 + \lambda ||w||^2$ over $w$.

I'm having a lot of difficulty figuring out how to proceed with this. My questions are as follows:

1) How do I minimize a function of a random variable? What does that even mean?

2) How would I maximize the probability? It's clearly a maximum likelihood type computation; I'm not even sure how to compute the probability density function for $(w_1, ..., w_d)$ or how to go about maximizing it.

What I do know

1) I see that the ridge regression must have a minimum in $w$ because its quadratic in that variable. Or at least, I assume that still holds when the variable in question is random

2) I see that (strange indexing aside), we have something like $w = X^{-1}(y - \epsilon)$, where $y, \epsilon \in \mathbb{R}^n$, and $X$ is a $n \times d$ matrix. Of course, I have no reason to believe that $X$ is invertible; so presumably the least squares solution (which is more likely to be what I get than an exact solution) somehow corresponds to the maximum likelihood $w$. Again, though, this I'm not sure what to do with the random variable ($\epsilon$) on the right hand side.

I'd appreciate any clarification or ideas about how to proceed. Thanks.

1

There are 1 best solutions below

1
On BEST ANSWER

This is an extended comment elaborating on the theme I was pursuing in the comments above.


I think you're overcomplicating this. Suppose I tell you that I have the distribution $P(w;\phi) = C \exp ( - \|w - \phi \|^2).$ What value of $w,$ stated in terms of $\phi,$ will maximise $P$? This is a standard calculus problem. Similarly, suppose I gave you the function $\|w - \phi \|^2.$ What value of $w$ would minimise this, in terms of $\phi$? This is a standard convex opt. Q.

To see how this is relevant here - Note that (in vector form) $y = Xw + \epsilon,$ and so $$P(w|y;X) = \frac{P(w,y;X)}{P(y;X)} = \frac{P(w) \cdot \phi(y - Xw, \sigma^2 I)}{P(y;X)} = \frac{\phi(w, \tau^2 I) \cdot \phi(y - Xw, \sigma^2 I)}{P(y},$$ where $\phi(a, S) $ is the Gaussian density, and so you can set up the optimisation problem that maximises the likelihood of $w$ assuming everything else is constant (it only depends on $X,y$ which is stuff you know/observe).

You should find it this being equivalent to the optimisation $ \min_w \mathcal{L}(w; y, X, \lambda) := \| y - Xw\|^2 + \lambda \|w\|^2$ for some choice of $\lambda.$

That's really the extent of the problem here. You're not maximising a function of a random variable, you're investigating the most likely value of that random variable given your data.