Can anyone give a real life example to illustrate why does the principal axis that has the maximal variance retain the most information?

91 Views Asked by At

One job in PCA is to maximize variance, because the principal axis that has the maximal variance retain the most information.

Why is that? How to understand this in a easy or concrete (such as a real life example) way?

4

There are 4 best solutions below

0
On

This could be quite difficult to be seen in a real life example since it is a theoretical or abstract assumption.

When you do PCA you want to project data in a low dimensional space. This projection obviously will loss information so you want to retain the most information. As the PCA assumption, the eigen-values of the matrices will show the variance retained.

This variance will illustrate the error reconstruction of data, you want to have a rich representation in the low dimensional space, as good as, you can return to the high dimensional losing little information as possible.

Think, if you go to a space with few variance you can think you are projecting onto a point. How do you return to the other space? The most variance in data the better you can reconstruct data.

The PCA can be seen as maximizing this variance, so the most variance the best is the fit done. If the variance is near to 1, it is said that the fit is very good and data relies on the space you are projecting.

You have here a good lecture on this: https://www.stat.cmu.eitu/~cshalizi/350/lectures/10/lecture-10.pdf

0
On

PCA itself does not reduce dimension. Moreover, as PCA applied usually to the covariance matrix (of the scaled and centered data), it may be viewed as an intermediate step in predictors construction and selection, which in turn, as a by product, may reduce dimensionality (by reducing the number of predictors). For dimensionality reduction of the data matrix SVD will be much more useful.

Anyway, regarding concrete intuition. Assume you have a data of age and daily income of $100$ people (in years and dollars, respectively). Assume that you sampled exactly $1$ person from each age from $1$ y.o up to $100$ y.0. Assume, just for the sake of this example, that $$ \text{income}_i=2\text{age}_i+\epsilon_i, $$
with $\mathbb{E}[{\epsilon}_i]=0$ and $var(\epsilon_i) = 1$. In such a data, you will see that all the $100$ points are lined closely and symmetrically around $y=2x$. Hence, most of you variance is pointed in the direction of $t(2,1)$. Namely, the eigenvector, which is the first principal component (PC) is $\text{income} = 2 \text{age}$ with large eigenvalue ($>>1$) as this component accounts for most of the information. Moreover, in our simplistic example, this is the only informative axis (direction). The second PC will be orthogonal to the first one, i.e., $y=-x/2$ ( $\text{income} = - \text{age}/2$). This PC accounts mainly for the small noise (error) term. Its eigenvalue should be very small (much smaller that of the first PC).

Now assume an opposite situation. Assume that you sampled people only of age $30-32$. Assume that the same structure holds, i.e., $ \text{income}_i=2\text{age}_i+\epsilon_i, $ however, now $Var(\epsilon_i) = 1000$. Can you see what your scatter plot looks like in this case? All your $100$ data points are in a small horizontal strip from $30$ to $32$ with huge vertical dispersion. In such a case, what direction contains most of the information? The $y$ axis, that is, the income! Namely, your first PC is simply $\text{income}$ ($t(0,1)$ and the second PC is $\text{age}$ ($t(0,1)$). Regarding the interpretation - due to high variance of the noise term $\epsilon_i$ and little variance of $\text{age}$, the data contains almost no evidence on the structure of $\text{income} = 2 \text{age}$, thus you infer that the variables are uncorrelated. However, as income is much more informative, its eigenvalue will be much larger and if you like to reduce dimension, you can ignore the age variable whatsoever and still retain most of the information that is contained in the data.

0
On

It is simply not always true that

the principal axis that has the maximal variance retain the most information.

This is discussed with many examples and references here at Cross Validated and also here and here.

It simply cannot be answered in the abstract which PC is "most important" or "most informative". For whom? For what purpose? Tell us.

A very good, extensive post about pca which at this moment is the most upvoted of all questions at CV!

0
On

Say we’re trying to predict how good a football player is $(W)$ based on how tall he is $(X)$, what shoe size he wears $(Y)$, and the size of their hands $(Z)$.

There are some things you must know about these features of the football players.

  1. For one, all football players are required to wear the same exact shoes.

  2. Secondly, this football league requires every team to have players in every height category - all football teams will have dwarves, giants, and everything in between.

  3. Finally, hand sizes vary - but not all that much. Players have similarly sized hands - perhaps a little different, but not too different

Again, remember what the goal is. To predict $(W)$ based on $(X,Y,Z)$

The first thing we do is we want to put $(X,Y,Z)$ on the same scale. We do this by standardizing the dataset. Now that they're all on the same scale, we can compare how good these features of the players could potentially be for predicting $W$.

Let's first look at $X$. Does taking account of the player’s shoes really make sense? Obviously not - they all wear the same shoes so that we can't really compare two players by the shoes they wear.

Note that the "shoe" feature $X$ has zero variance.

So, if our original dataset came with information about their shoes, and they all wear the same shoes, that feature about the players would just be taking up space on our computer. There's no way to predict $Z$ based on it - no way to compare two players based on it, as it doesn't vary between players.

What about height? Well, since there is such a large range of different possible heights which our players can take on, it does make sense to use that feature to compare them.

A taller player could pherhaps be better than a shorter player...we don't know if there actually is a relationship between height and $W$. However, because height takes on such a big range of values, if there is a relationship between a player's height and how good they are, we'll be able to look for it.

Note that since the height feature has such a large range, its got a large variance.

Finally, what about hand-size? Well, it varies a little bit...but not all that much. It may be good for comparing players, but we don't have as much "information" in it as the "height" feature does simply because our players all have similar hand sizes. If the differences in hand-sizes are really small, then its hard to say if that's what's contributing to one player being better or worse than another.

Now, what does PCA do? PCA combines the features in such a way to create new features along which the dataset has maximum variance! For example - height is probably linearly correlated with hand size. Let's say its really correlated so that strictly taller players have strictly bigger hands.

It doesn't really make sense to keep both the hand size feature and the height feature in our dataset, since we know only from the height difference whether or not a player has or does not have bigger hands.

PCA might combine the height and hands features to create a new feature in which there is as much variance preserved from the original two features it combined.

It will also create a feature orthogonal to this one in which there is very little variance.

We'll only keep the one with the most variance because that's the one with the most information!