Determining Viability of Chi-squared for Feature Selection

22 Views Asked by At

How does one determine the likelihood chi-squared is accurately determining features during feature selection of categorical data? To summarize the rest of this post, the chi-squared test doesn't appear to follow a discernible pattern so how can one be sure it is not eliminating important features, or keeping spurious ones? Is there another further test one can do?

There seem to be a number of heuristics surrounding the chi-squared tests. For example, observed and expected values should be greater than or equal to 5. Also, the number of samples should be greater than 13, but not too large. I could not find a documented upper bound, but a little experimentation indicates it increases or decreases considerably with a sample size of 100, at least with my sample data. Also, I notice that many examples standardize the data prior to feature selection.

To better understand, I generated my own data as follows:

A B C D E F
30 3 -0.967 -98.5 37207417 32
730 203 6.829 213.5 731059944 731
319 121 4.652 116.5 312290199 319
492 265 5.396 434.5 495552794 494
417 90 3.515 50.5 412029676 420

Feature A matches the response, F, though with a small random difference. Feature B does not match F. Feature C matches F, but at 100th the magnitude and a random difference of up to 30%. Feature D matches F, but at half the magnitude and a random error of up to 100%. E matches F, though scaled massively and with a random error of 30%.

Without scaling, I generated chi-squared values and probabilities. A, D and E showed dependence on F with probabilities upwards of .9. D had a probability of 1!

I then standardized and scaled all the values between 0 and 1. All the features showed dependence with a probability near 1. This made some intuitive sense because I violated the heuristic of feature magnitudes >= 5.

I then finally standardized, scaled and multiplied by 100 so the features are more reasonably sized. Now only features A and E show dependence.

I then regenerated the data with the same algorithm as follows: |A |B |C |D |E |F |---|---|-------|-------|-----------|-- |426|498|4.982 |165.0 |427891218 |426 |378|89 |2.565 |234.0 |370449414 |378 |264|436|1.579 |50.5 |253152641 |256 |854|219|8.429 |511.5 |847785255 |862 |443|260|3.507 |437.0 |441054258 |435

With the raw values, A, C and E show dependence upwards of 90%. This differs from the first run where A, D and E showed dependence. The standardized and scaled between 0 and 1 showed high dependence across all features again. Finally, the standardized and scaled between 0 and 100 run show A and E having high dependence.

I realize this is a long question, but again, how does one know whether the results of the chi-squared tests are at all reliable? Further, is my notion of standardizing and scaling to 100 a reasonable approach?

Thanks in Advance!

P.S. I realize my sample data violates the heuristic of 13 samples, but I wanted to keep the tables simpler for this post.