Why is it that large linear SVM coefficients denote the most important features?

Question

Why is it that large linear SVM coefficients denote the most important features?

79 Views Asked by Bumbble Comm At 03 Apr 2026 - 5:58

I'm looking for an intuitive explanation, preferably geometric. Why is it that I sort the coefficients of my linear SVM I get the most indicative features as the ones with the large coefficients?

Suppose I have only 2 features X1 and X2. Their corresponding coefficients are simply the slope of the line separating the 2D space, right? Why is it that if the coefficient for X1 is 3 while the one for X2 is 1, it means that X1 is more indicative of the target class variable than X2?

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2015-01-25 15:14:32

My colleague Doron convinced me in the following explanation:

Suppose we are trying to classify documents to the "is_fashion" target variable - are they dealing with fashion or not? We have only 2 features x is TFIDF score of the word "dress", and y is the TFIDF score of the word "new", and each document can be represented as a point in this 2D space.

Suppose that the y feature isn't relevant at all to the target, while the x feature is a great predictor, so that every document with a x score of above 3 is dealing with fashion, and those with a score of below 3, does not deal with fashion. The SVM separating line would look like 1x + 0y -3 = 0, or x = 3.

The classification of a new document would be 1*x + y*0 - 3, which is nothing but the projection (dot product) of the document on the perpendicular to the separating line. This example clearly shows how a feature with a 0 coefficient does not affect the classification.

Let's complicate a bit. What if the word "new" is a bit relevant to fashion, but not as relevant as "dress"? The separating line (black in the image) would now be 5*x + 1*y - 3 = 0, or y = -5x + 3. The perpendicular (red in the image) would be y = x/5 + 3 svm_plot

Classification would now be a projection on the red line. The point <2,2> would get a classification score: 5*2 + 1*2 -3 = 9. If our important feature x is subtracted 1, we would need to move 5 steps higher in the y dimension to get the same score, to the point <1,7> which would also get a score of 5*1 + 1*7 -3 = 9.

Why is it that large linear SVM coefficients denote the most important features?

There are 1 best solutions below

Related Questions in MACHINE-LEARNING

Trending Questions

Popular # Hahtags

Popular Questions