How does one determine the likelihood chi-squared is accurately determining features during feature selection of categorical data? To summarize the rest of this post, the chi-squared test doesn't appear to follow a discernible pattern so how can one be sure it is not eliminating important features, or keeping spurious ones? Is there another further test one can do?
There seem to be a number of heuristics surrounding the chi-squared tests. For example, observed and expected values should be greater than or equal to 5. Also, the number of samples should be greater than 13, but not too large. I could not find a documented upper bound, but a little experimentation indicates it increases or decreases considerably with a sample size of 100, at least with my sample data. Also, I notice that many examples standardize the data prior to feature selection.
To better understand, I generated my own data as follows:
| A | B | C | D | E | F |
|---|---|---|---|---|---|
| 30 | 3 | -0.967 | -98.5 | 37207417 | 32 |
| 730 | 203 | 6.829 | 213.5 | 731059944 | 731 |
| 319 | 121 | 4.652 | 116.5 | 312290199 | 319 |
| 492 | 265 | 5.396 | 434.5 | 495552794 | 494 |
| 417 | 90 | 3.515 | 50.5 | 412029676 | 420 |
Feature A matches the response, F, though with a small random difference. Feature B does not match F. Feature C matches F, but at 100th the magnitude and a random difference of up to 30%. Feature D matches F, but at half the magnitude and a random error of up to 100%. E matches F, though scaled massively and with a random error of 30%.
Without scaling, I generated chi-squared values and probabilities. A, D and E showed dependence on F with probabilities upwards of .9. D had a probability of 1!
I then standardized and scaled all the values between 0 and 1. All the features showed dependence with a probability near 1. This made some intuitive sense because I violated the heuristic of feature magnitudes >= 5.
I then finally standardized, scaled and multiplied by 100 so the features are more reasonably sized. Now only features A and E show dependence.
I then regenerated the data with the same algorithm as follows: |A |B |C |D |E |F |---|---|-------|-------|-----------|-- |426|498|4.982 |165.0 |427891218 |426 |378|89 |2.565 |234.0 |370449414 |378 |264|436|1.579 |50.5 |253152641 |256 |854|219|8.429 |511.5 |847785255 |862 |443|260|3.507 |437.0 |441054258 |435
With the raw values, A, C and E show dependence upwards of 90%. This differs from the first run where A, D and E showed dependence. The standardized and scaled between 0 and 1 showed high dependence across all features again. Finally, the standardized and scaled between 0 and 100 run show A and E having high dependence.
I realize this is a long question, but again, how does one know whether the results of the chi-squared tests are at all reliable? Further, is my notion of standardizing and scaling to 100 a reasonable approach?
Thanks in Advance!
P.S. I realize my sample data violates the heuristic of 13 samples, but I wanted to keep the tables simpler for this post.