supervised version of TF-IDF [term frequency inverse document frequency]

653 Views Asked by At

In classification tasks involving documents, the documents will commonly be preprocessed by doing things like getting counts / tf-idf weights for each term-document pair. I understand how tf-idf works but it seems like one shortcoming is that it doesn't directly look at how the tf-idf weightings affect classification ability. Instead it creates tf-idf weights and then as a separate step, you can use things like cosine similarity and train your classifier on the feature vectors produced this way, and then transofrm the test data in the same way [using same vocabulary].

So e.g. consider binary classification task (like spam / not spam). You can apply tf-idf in an unsupervised way, just getting tf-idf weights for each term in each document. Then you take this matrix of feature vectors and train your classifier on it. If a word occurs across many documents, it will be downweighted by the idf term because it is a ubiquitous word. But if you have labeled training data [documents each have category label], and you saw that almost all of the time the word occurred it was in one of the document classes, then actually this is a strong feature. So it seems like you could better build your vocabulary by directly seeing how it impacts classification ability.

So my question is if there any techniques that turn tf-idf into a supervised version of tf-idf by directly taking advantage of the known document labels? Basically making TF-IDF a supervised approach?