Why is it that large linear SVM coefficients denote the most important features?

79 Views Asked by At

I'm looking for an intuitive explanation, preferably geometric. Why is it that I sort the coefficients of my linear SVM I get the most indicative features as the ones with the large coefficients?

Suppose I have only 2 features X1 and X2. Their corresponding coefficients are simply the slope of the line separating the 2D space, right? Why is it that if the coefficient for X1 is 3 while the one for X2 is 1, it means that X1 is more indicative of the target class variable than X2?

1

There are 1 best solutions below

0
On

My colleague Doron convinced me in the following explanation:

Suppose we are trying to classify documents to the "is_fashion" target variable - are they dealing with fashion or not? We have only 2 features x is TFIDF score of the word "dress", and y is the TFIDF score of the word "new", and each document can be represented as a point in this 2D space.

Suppose that the y feature isn't relevant at all to the target, while the x feature is a great predictor, so that every document with a x score of above 3 is dealing with fashion, and those with a score of below 3, does not deal with fashion. The SVM separating line would look like 1x + 0y -3 = 0, or x = 3.

The classification of a new document would be 1*x + y*0 - 3, which is nothing but the projection (dot product) of the document on the perpendicular to the separating line. This example clearly shows how a feature with a 0 coefficient does not affect the classification.

Let's complicate a bit. What if the word "new" is a bit relevant to fashion, but not as relevant as "dress"? The separating line (black in the image) would now be 5*x + 1*y - 3 = 0, or y = -5x + 3. The perpendicular (red in the image) would be y = x/5 + 3 svm_plot

Classification would now be a projection on the red line. The point <2,2> would get a classification score: 5*2 + 1*2 -3 = 9. If our important feature x is subtracted 1, we would need to move 5 steps higher in the y dimension to get the same score, to the point <1,7> which would also get a score of 5*1 + 1*7 -3 = 9.