How to apply logistic regression to analyse and predict kind of problem

130 Views Asked by At

Say I want to predict whether a new question posted in stack overflow will get an answer within 24 hrs.

I was given details about previous questions and all data. Now I have thousands of observations like below,

q_id(int)           => Question id,
q_text(string)      => Question,
q_mc(string)        => Question's main category,
q_mc_f(int)         => No of Main category followers,
q_rc[Array]         => Questions related categories 
q_rc_f[Array]       => Each questions related category followers
got_ans(bool)       => Whether it got answer in 24hrs or not [Yes/No]

I'm aware that I need to use logistic regression or probit model (to be exact) to find whether a newly asked question with details

q_id,q_text,q_mc,q_mc_f,q_rc,q_rc_f

will get an answer in 24hrs or not? That is, what is the got_ans for the new question ?

I've gone through this link about probit model in detail but still I couldn't figure out how to apply it in my problem.

1

There are 1 best solutions below

0
On BEST ANSWER

In general, what you do is you estimate your model parameters (coefficients) with the data you have, where the model you estimate is ($f$ might be the logistic function) $$gotans_i=f(\beta \mathbf{X}_i).$$ The vector $\mathbf{X}_i$ is the vector of explanatory variables. Then, once you have your model estimates, you can predict (=compute the probability) whether a question will get an answer within 24 hours if you have all values of the explanatory variables you used in your model.

Ideally, some linear combination of those variables, plugged into, say, the logistic function can explain your binary outcome variable $gotans$. So if you want to predict whether a question will get an answer within 24 hours, you should select the explanatory variables that can explain the outcome well. In your example, the amount of category followers might be a strong predictor. In a first run, you could include all variables and see what happens. You could also create interactions and include those.

In order to avoid overfitting, use only a randomly selected subsample of your data to estimate the model, and then use the remaining data to test the predictions made by your model (using some criterion). This is called cross-validation. It might turn out that a model with fewer rather than all variables predicts better. This procedure can also be used to decide whether to use logit or probit (or OLS).

Quick note on how to do this in Stata. After you imported your data, use

logit gotans x1 x2 x3 x4 ..

to estimate the model (alternatively, use command probit). To include interactions, use

logit gotans x3 x4 c.x1##c.x2

To predict (compute the probability) of getting an answer for each of your observations, use (after running the logit command)

predict gotans

and to predict the probability for specific values, justcreate an observation with those values and use predict.

Final note: you do not "need" to use a binary regression model like logit or probit. You can also try machine learning techniques like random forests to predict. My guess is they will predict better than the models discussed above, but are less intuitive. Machine learning stuff, however, you cannot do in Stata as far as I know.