I am an undergraduate student working on some projects using probit regression. I have a question on dummy variables that I was hoping someone could help me with (which I think stems from an incomplete understanding of the theroy).
I am using SAS to create a probit model (also logit models) for a binary dependent variable. The first explanatory parameter I added to the model turned out to be significant (p-value <0.05), as I expected based on my intuition of the hypothesis. I then added a set of 25 binary dummy variables, which each represented one single qualitative variable that had 25 possible values. (example: 25 ice cream flavors, 1st dummy variable: chocolate (y/n)? 2nd: mint (y/n)?. Note, there are no combinations. You can't have chocolate AND mint, only choclate OR mint)
When I ran the model with my first explanatory variable and all the dummy variables, not a single dummy variable came up significant. I then re-ran the model, but instead of including all the dummy variables, I included the ones I thought should have been significant (this was 3 of them). When I ran the model with my first parameter and the 3 dummy variables, they were all significant!
I don't understand why I am getting this behaivor... and also what it means for the model. Is it bad practice if I only include the 3 dummy variables, because I know they are significant? Should I try every possible combination of the 25 variables to see which has the most significant variables (may be computationally impossible...)? Should I consider none of them significant because of the initial run?
(By the way, I have ~10,000 observations in my sample.)
With dummy variables, everything is relative to the omitted group. So lets say in the first regression, you included all categories of 1-24, if none are significant, it suggests that you cannot tell the difference between those values and the value for the 25th category. This could, for example, be a function of the implied coefficient for the 25th category being very noisy and you are unable to get a precise estimate.
In your second regression, your omitted category is now, "everything that is not one the three main values". This is obviously a much different reference group.
In one of your comments, you said "If I included 1-24, no difference... but if I included 2-25, I got great results." But, this does not make sense. The results should be the same. Here's how to see this. First, run the the regression including 1-24. Note the coefficient and p-value on type 2. Then run the regression with 2-25. Now, test the difference between the coefficients on type 25 and type 2. It should be the same as the test on type 2 in the previous case.