I'm looking for an intuitive explanation, preferably geometric. Why is it that I sort the coefficients of my linear SVM I get the most indicative features as the ones with the large coefficients?
Suppose I have only 2 features X1 and X2. Their corresponding coefficients are simply the slope of the line separating the 2D space, right? Why is it that if the coefficient for X1 is 3 while the one for X2 is 1, it means that X1 is more indicative of the target class variable than X2?
My colleague Doron convinced me in the following explanation:
Suppose we are trying to classify documents to the "is_fashion" target variable - are they dealing with fashion or not? We have only 2 features
xis TFIDF score of the word "dress", andyis the TFIDF score of the word "new", and each document can be represented as a point in this 2D space.Suppose that the
yfeature isn't relevant at all to the target, while thexfeature is a great predictor, so that every document with axscore of above 3 is dealing with fashion, and those with a score of below 3, does not deal with fashion. The SVM separating line would look like1x + 0y -3 = 0, orx = 3.The classification of a new document would be
1*x + y*0 - 3, which is nothing but the projection (dot product) of the document on the perpendicular to the separating line. This example clearly shows how a feature with a 0 coefficient does not affect the classification.Let's complicate a bit. What if the word "new" is a bit relevant to fashion, but not as relevant as "dress"? The separating line (black in the image) would now be
5*x + 1*y - 3 = 0, ory = -5x + 3. The perpendicular (red in the image) would bey = x/5 + 3Classification would now be a projection on the red line. The point <2,2> would get a classification score:
5*2 + 1*2 -3 = 9. If our important featurexis subtracted 1, we would need to move 5 steps higher in theydimension to get the same score, to the point <1,7> which would also get a score of5*1 + 1*7 -3 = 9.