I have a large dataset where each element has a number of "input" categories that are either present or not (or if you like, true or false, 1 or 0 etc). Each one also has an output category, again a binary.
A simplified version of this would be the following set:
Rained_yesterday,Rained_more_than_10mm_yesterday -> Raining_today
Is_odd_date,Is_summer -> Raining_today
Is_odd_date,Is_summer -> ()
From this dataset I want to find the categories that best explain/predict the output, starting with the most significant one then following with the next most significant taking into account that the first one has already been used. In a more realistic version of the dataset above for example, the outcome might be:
Rained_yesterday
!Is_summer
Rained_more_than_10mm_yesterday
Is_odd_date
Note that I need to be able to also detect the negation or absence of a category as a predictor, and "Rained_more_than_10mm_yesterday" is likely to be ranked lower as it is strongly correlated with "Rained_yesterday". Ideally I would also like to be able to show that using the top n predictors, I can account for x % of all decisions.
Any pointers to algorithms to use, articles to read etc to help me get started on this would be appreciated.
You can use logistic regression, which is typically used to generate a multivariable model for the prediction of a dichotomous outcome. Independent variables in logistic regression can be continuous or categorical. Using a "stepwise" procedure, most statistical softwares provide a model including only significant predictors. Looking at the odds ratios given for each predictor, you can also understand which predictors have a more evident independent impact on the outcome.