Finding best predictors of a classification function

127 Views Asked by At

I have a large dataset where each element has a number of "input" categories that are either present or not (or if you like, true or false, 1 or 0 etc). Each one also has an output category, again a binary.

A simplified version of this would be the following set:

Rained_yesterday,Rained_more_than_10mm_yesterday -> Raining_today
Is_odd_date,Is_summer -> Raining_today
Is_odd_date,Is_summer -> ()

From this dataset I want to find the categories that best explain/predict the output, starting with the most significant one then following with the next most significant taking into account that the first one has already been used. In a more realistic version of the dataset above for example, the outcome might be:

Rained_yesterday
!Is_summer
Rained_more_than_10mm_yesterday
Is_odd_date

Note that I need to be able to also detect the negation or absence of a category as a predictor, and "Rained_more_than_10mm_yesterday" is likely to be ranked lower as it is strongly correlated with "Rained_yesterday". Ideally I would also like to be able to show that using the top n predictors, I can account for x % of all decisions.

Any pointers to algorithms to use, articles to read etc to help me get started on this would be appreciated.

1

There are 1 best solutions below

1
On BEST ANSWER

You can use logistic regression, which is typically used to generate a multivariable model for the prediction of a dichotomous outcome. Independent variables in logistic regression can be continuous or categorical. Using a "stepwise" procedure, most statistical softwares provide a model including only significant predictors. Looking at the odds ratios given for each predictor, you can also understand which predictors have a more evident independent impact on the outcome.