What is the rationale behind transformation of a subspace in machine learning

65 Views Asked by At

Suppose we have a random vector $\textbf{x}\subset \mathbb{R}^n$ with known probabilistic characteristics. We want to find a $\theta$ which makes the vector $z=H\theta$ as close as possible to $\textbf{x}$ where $H\in\mathbb{R}^{n\times p}$ is a full rank matrix with $p<n$ and $\theta\in\mathbb{R}^{p\times 1}$.

One can find the desired $\theta$ by writing the least square errors as $$J(\theta)=(\textbf{x}-H\theta)^T(\textbf{x}-H\theta)$$ The answer for the best $\theta$ which minimizes the above norm is

$$\hat\theta=(H^TH)^{-1}H^T\textbf{x}$$

which means the best $\theta$ is the one that makes orthogonal projection of $\textbf{x}$ onto the subspace spanned by $H$.

However, the weighted form of the error can be written as $$J(\theta)=(\textbf{x}-H\theta)^TW(\textbf{x}-H\theta)$$ If $W$ is a positive definite matrix, by substituting $W=LL^T$ (Cholescky decompostion) the above can be written as $$J(\theta)=(L\textbf{x}-LH\theta)^T(L\textbf{x}-LH\theta)$$ Having the above notion in mind, and letting $y=L\textbf{x}$ one can say, the best $\theta$ is the one that makes orthogonal projection of $y$ onto the subspace spanned by $LH$.

The last step means we have a subspace spanned by $H$ but to find the least error we change the subspace to Range($LH$) and our measurement to $L\textbf{x}$ and try to find the orthogonal projection of new measurement onto the new subspace.

Why do we change our subspace to new subspace?

What is the best $W$ or $L$?

Does this change relate to data shape?