Statistical analyses

60 Views Asked by At

Imagine that you have run PCA on data gathered from one of the questionnaires gathered by the car manufacturer in which 10’000 people gave their age, gender, country of residence and the car model they purchased.

i) You find that all N eigenvectors (N=dimensionality of the dataset) covers the same % of the data variance. What is N here? What is this % of variance?

ii) How would you interpret the results in i? What if 1 eigenvector cover 99% of the data variance.

Thanks in advance.

1

There are 1 best solutions below

0
On BEST ANSWER

Assume you have a $(T x N)$ matrix of data, $X$, who's columns are the variables you've listed and whose rows are the different observations. Then $X^TX$ is a square $(N x N)$ matrix representing the empirical covariance matrix of $X$.

Here, $N$ refers to the number of the variables in your dataset. One of the ways PCA can be useful is that it allows you to take your $N$ possibly correlated variables (which will be summarized in $X^TX$) and transform them into $N$ linearly uncorrelated variables (called principal components). Linearly uncorrelated implies the off-diagonal elements of its covariance matrix will be zero

The PCA is most helpful in cases like in ii). If we find that 99% of the variance of $X$ is explained by one principal component (aka eigenvector), then we can focus our attention on modeling or analyzing that one principal component. In this case, we've reduced our analysis from an N-dimensional problem to a 1-dimensional one. If this isn't the case, like in i), then the PCA hasn't really done much for us. We still have an N-dimensional problem like before.