Determining Viability of Chi-squared for Feature Selection

22 Views Asked by Bumbble Comm At 28 Mar 2026 - 8:08

How does one determine the likelihood chi-squared is accurately determining features during feature selection of categorical data? To summarize the rest of this post, the chi-squared test doesn't appear to follow a discernible pattern so how can one be sure it is not eliminating important features, or keeping spurious ones? Is there another further test one can do?

There seem to be a number of heuristics surrounding the chi-squared tests. For example, observed and expected values should be greater than or equal to 5. Also, the number of samples should be greater than 13, but not too large. I could not find a documented upper bound, but a little experimentation indicates it increases or decreases considerably with a sample size of 100, at least with my sample data. Also, I notice that many examples standardize the data prior to feature selection.

To better understand, I generated my own data as follows:

A	B	C	D	E	F
30	3	-0.967	-98.5	37207417	32
730	203	6.829	213.5	731059944	731
319	121	4.652	116.5	312290199	319
492	265	5.396	434.5	495552794	494
417	90	3.515	50.5	412029676	420

Feature A matches the response, F, though with a small random difference. Feature B does not match F. Feature C matches F, but at 100th the magnitude and a random difference of up to 30%. Feature D matches F, but at half the magnitude and a random error of up to 100%. E matches F, though scaled massively and with a random error of 30%.

Without scaling, I generated chi-squared values and probabilities. A, D and E showed dependence on F with probabilities upwards of .9. D had a probability of 1!

I then standardized and scaled all the values between 0 and 1. All the features showed dependence with a probability near 1. This made some intuitive sense because I violated the heuristic of feature magnitudes >= 5.

I then finally standardized, scaled and multiplied by 100 so the features are more reasonably sized. Now only features A and E show dependence.

I then regenerated the data with the same algorithm as follows: |A |B |C |D |E |F |---|---|-------|-------|-----------|-- |426|498|4.982 |165.0 |427891218 |426 |378|89 |2.565 |234.0 |370449414 |378 |264|436|1.579 |50.5 |253152641 |256 |854|219|8.429 |511.5 |847785255 |862 |443|260|3.507 |437.0 |441054258 |435

With the raw values, A, C and E show dependence upwards of 90%. This differs from the first run where A, D and E showed dependence. The standardized and scaled between 0 and 1 showed high dependence across all features again. Finally, the standardized and scaled between 0 and 100 run show A and E having high dependence.

I realize this is a long question, but again, how does one know whether the results of the chi-squared tests are at all reliable? Further, is my notion of standardizing and scaling to 100 a reasonable approach?

Thanks in Advance!

P.S. I realize my sample data violates the heuristic of 13 samples, but I wanted to keep the tables simpler for this post.

Original Q&A

Determining Viability of Chi-squared for Feature Selection

Related Questions in STATISTICS

Related Questions in MACHINE-LEARNING

Related Questions in CHI-SQUARED

Trending Questions

Popular # Hahtags

Popular Questions