Which machine learning algorithm to use?!

277 Views Asked by At

I have a training set which is set of essays written by students for a question. These essays are all scored by human evaluators with labels such as 1, 2 , 3 which is actually marks allotted for those essays. I want know whether to use regression or classification algorithm for machine learning purposes! My readings on Machine learning suggests me to go with classification algorithm but again should I go with numeric classification or nominal classification. I am thinking numeric - Am I correct?

1

There are 1 best solutions below

1
On

Since your labels are discrete, you are looking at a classification problem with text data. I recommend using Logistic Regression (LR) as implemented in scikit-learn library in Python.

You would need to construct $A_{V\times D}$ term-document matrix where $V$ is the vocabulary size and $D$ is the number of documents. Thus, each column of $A$ is a document or a collection of word-counts (here we are assuming that words are exchangeable and only their counts matter). I recommend using standard text-preprocessing such as tokenization, stop-word removal and tf-idf smoothing. Gensim library implements many of these functionalities.

The input to your Lostigic Regression (LR) becomes an $(X,y)$ pair where $X=A^{T}$ (each row of $X$ is a document with $V$ features) and $y \in \{1,...,K\}$ is the label. Once trained, you can use LR to predict the label of a new document (pre-processed in the same way as the training data).