Suppose we have a random vector $\textbf{x}\subset \mathbb{R}^n$ with known probabilistic characteristics. We want to find a $\theta$ which makes the vector $z=H\theta$ as close as possible to $\textbf{x}$ where $H\in\mathbb{R}^{n\times p}$ is a full rank matrix with $p<n$ and $\theta\in\mathbb{R}^{p\times 1}$.
One can find the desired $\theta$ by writing the least square errors as $$J(\theta)=(\textbf{x}-H\theta)^T(\textbf{x}-H\theta)$$ The answer for the best $\theta$ which minimizes the above norm is
$$\hat\theta=(H^TH)^{-1}H^T\textbf{x}$$
which means the best $\theta$ is the one that makes orthogonal projection of $\textbf{x}$ onto the subspace spanned by $H$.
However, the weighted form of the error can be written as $$J(\theta)=(\textbf{x}-H\theta)^TW(\textbf{x}-H\theta)$$ If $W$ is a positive definite matrix, by substituting $W=LL^T$ (Cholescky decompostion) the above can be written as $$J(\theta)=(L\textbf{x}-LH\theta)^T(L\textbf{x}-LH\theta)$$ Having the above notion in mind, and letting $y=L\textbf{x}$ one can say, the best $\theta$ is the one that makes orthogonal projection of $y$ onto the subspace spanned by $LH$.
The last step means we have a subspace spanned by $H$ but to find the least error we change the subspace to Range($LH$) and our measurement to $L\textbf{x}$ and try to find the orthogonal projection of new measurement onto the new subspace.
Why do we change our subspace to new subspace?
What is the best $W$ or $L$?
Does this change relate to data shape?