I'm new to converting formulas to matrix form. But this is required for efficient machine learning code. So I want to understand the "right" way, not the cowboy stuff I do.
Alright here we go, I'm trying to convert weighted sum of squares from the form below into matrix form. $$J(w)=\sum_{i=1}^m u_i (w^T x_i - y_i)^2$$
Where $u_i$ is the weight for each sample error$_i$. Also, $x_i \in \mathbb{R^n}$, $w \in \mathbb{R^n}$, $y \in \mathbb{R}$, $u_i \in \mathbb{R}$,$i=1,...,m$. $w^T x_i$ is the predicted value, the result of multiplying a weight vector by a feature vector.
Here's what I think, and I do get creative. So feel free to skip to the end if I go on a tangent.
Let $r$ be a column vector of functions that represents the non-squared error. We can represent $(w^T x_i - y_i)^2$ over $i=1,...,m$ as $$ r^2 = \begin{bmatrix}r_1 & r_2 & \cdots & r_m\end{bmatrix} \begin{bmatrix} r1 \\ r2 \\ \vdots \\ r_m \\ \end{bmatrix} \tag{1}\label{1}$$
The results of the $1 \times m $ vector multiplied by the $m \times 1$ vector is a $ 1 \times 1$ matrix (scalar).
Let $u$ be a vector of weights that weighs each sample error. Since we need to weigh the squared errors, we need to incorporate $u$ in Formula $\ref{1}$ before getting the scalar. Since we want the first $r$ to remain as a $1 \times m$ vector, we define $U$ to be a diagonal matrix with the diagonal terms coming from $u$. We now have:
$$ J(w) = \begin{bmatrix}r_1 & r_2 & \cdots & r_m\end{bmatrix} \begin{bmatrix} u_1 & 0 & \cdots & 0\\ 0 & u_2 & \cdots & 0\\ \vdots & \vdots & \ddots & \vdots\\ 0 & 0 & \cdots & u_m\\ \end{bmatrix} \begin{bmatrix} r1 \\ r2 \\ \vdots \\ r_m \\ \end{bmatrix} \tag{2}\label{2}$$
We can simplify this to $$ J(w) = r^T U r \tag{3}\label{3}$$
Now we expand $r$. We had $x_i \in \mathbb{R^n}$ multiplied by $w \in \mathbb{R^n}$, giving us $Xw$ where X is now an $m \times n$ matrix and $w$ is an $n \times 1$ column vector. Let y be the $m \times 1$ column vector representing the labels $y = 1,...,m$. Now $r = (Xw - y)$. We substitute this into $\ref{3}$, giving us the final weighted sum of squares in matrix form: $$ J(w) = (Xw - y)^T U(Xw-y) \tag{4}\label{4} $$
First, does this make sense? Second, and most importantly, is this actually how you're supposed to do it?
Thanks
In general, you are right. If you can specify the weights $u_i$, then you can define the matrix $U^{1/2}=diag (u_i^{1/2},...,u_i^{1/2})$, then multiply your model, i.e., $$ U^{1/2}y=U^{1/2}X\beta + U^{1/2}\epsilon. $$ So, you have to minimize $S(\beta)=(U^{1/2}(y-X\beta))'U^{1/2}(y-X\beta)=(y-X\beta)'U(y-X\beta)$. You can just denote $y*=U^{1/2}y$ and $X*=U^{1/2}X\beta$, thus the OLS result is given by $$ \hat{\beta}=(X*'X*)^{-1}X*'y = (X'UX)^{-1}X'Uy. $$
So, you can see that the WLS is just on OLS performed on a transformed model. As such, you can skip all these steps and just define $U=diag(u_1,...,u_n)$ and then use the final result in order to get the WLS coefficients.
EDIT:
Note that for a general structure of the covariance matrix, you assume $\epsilon \sim MVN (0, \sigma^2 V)$. In this case you will a get a GLS estimator of the form: $$ \hat{\beta} = (X'V^{-1}X)^{-1}X'V^{-1}y, $$ so WLS is just a special case of GLS where $V$ is diagonal, however still differs from $kI$. Now, you have to construct $V^{-1}$, which is a positive definite (diagonal) matrix, and thus can be factorized by $V=V^{-1/2}V^{-1/2}$, so the transformed model is given by $$ y^* = V^{-1/2}y=V^{-1/2}X\beta + V^{-1/2}\epsilon, $$ such that $$ cov(y^*) = cov(V^{-1/2}\epsilon)=\sigma^2V^{-1/2}VV^{-1/2}=\sigma^2V^{-1/2}V^{1/2}V^{1/2}V^{-1/2}=\sigma^2I, $$ that solves the problem. So, for the technicalities, just denote the new matrices by $*$ or something like that and plug in the previous form in the end of the calculations.