how correlation will help me to find features of my model in machine learning?

34 Views Asked by At

I read articles that tells correlation is used to find features in ML. But I want to know how it's exactly working.

1

There are 1 best solutions below

0
On BEST ANSWER

I think it's fair to say that if you're interested in practical aspects, then the question should be on either the data science or stats SE.

However, the question does have a mathematical answer. In fact, correlation is not used to find features, exactly. Rather, it is used to select, weed out, or rank features, usually.

Suppose you have a dataset $X$, so $x_i=\text{row}_i(X)$ is a point and $f_j=\text{col}_j(X)$ is a feature, with label vector $Y$.

Two common measures of correlation are the Pearson correlation: $$ \rho(a,b) = \frac{\mathbb{E}[(a-\mu_a)(b-\mu_b)]}{\sigma_a\sigma_b} $$ where $\mu$ and $\sigma$ are the mean and standard deviation, and the mutual information, which is essentially a measure of non-linear correlation: $$ \mathfrak{I}(U,V) = \int_{V} \int_{U} P(u,v)\log\left( \frac{P(u,v)}{P(u)P(v)} \right)du\,dv $$ Denote $C$ as a correlation function (e.g. $C\in\{\rho,\mathfrak{I}\}$).

One use of correlation is straight-forward: determine how much "information" a given feature contains regarding the labels. Obviously, a feature with lots of information about (or high absolute correlation to) the labels is useful. In other words, a feature with high $ C(f_k,Y) $ is more likely to be useful.

Another use is to remove redundancy. Having multiple features that share the same information content is essentially useless, and adds only bias and noise to the data. Hence we want low $ C(f_i,f_j)\,\forall\; i\ne j $.

Thus, we are looking for something similar to: $$ F^* = \arg\max_{\phantom{\large TY|}\!\!\!\!\!\!\!\!\!\!\!\!\!\!\!F\in S_f} \sum_{f\in F} C(f,Y) - \sum_{f_1\in F}\sum_{f_2\in F} C(f_1,f_2) $$ for $S_f$ being the set of possible combinations of features and $F$ being a particular subset of features.

Actually, these facts are the basis for the "Minimum Redundancy Maximum Relevance" feature selection algorithm.