Equally Distributed Data Set Measurement

143 Views Asked by At

I will be creating my own dataset with scores ranging from 50.00 to 100.00. How will I say that the dataset I chose is equally distributed and unbiased ? Is there a formula to know this?

1

There are 1 best solutions below

0
On

Your question leaves room for some interpretation. Here is my interpretation. If my interpretation is not what you had in mind, please revise your question to be more informative and maybe someone else will give an answer your find more useful.

If the population consists of numbers the $5001$ numbers $50.00, 50,01, \dots, 99.99, 100.00,$ and you select a sample of size $n=20$ with replacement, then the sample should be difficult to distinguish from a random sample of size twenty from the distribution $\mathsf{Unif}(50,100).$ [Computations and sampling in R.]

s = seq(50, 100, by=0.01)
head(s);  length(s)
[1] 50.00 50.01 50.02 50.03 50.04 50.05  # first 6 pop values
[1] 5001  # population size

set.seed(123)
x = sample(s, 20, rep=T)

At the 5% level, a Kolmogorov-Smirnov goodness-of-fit test, does not reject the null hypothesis that the sample of size $n=20$ is from the distribution $\mathsf{Unif}(50,100):$ the P-value of the test is $0.4606 > 0.05.$

ks.test(x, punif, 50, 100)

        One-sample Kolmogorov-Smirnov test

data:  x
D = 0.183, p-value = 0.4606
alternative hypothesis: two-sided

With samples as small as $n = 20,$ it is difficult to know what the population might be, but the K-S test sees no evidence that this uniform distribution could not have been the sampled population.

par(mfrow=c(1,2))
 hist(x, prob=T, col="skyblue2")
  rug(x)
 plot(ecdf(x))
  curve(punif(x,50,100),add=T, col="blue")
par(mfrow=c(1,1))

enter image description here

The K-S test statistic $D = 0.183$ is the maximum vertical distance between the CDF (blue) of $\mathsf{Unif}(50,100)$ and the ECDF (black) of the sample of 20. [Right-hand panel.] To make the empirical CDF (ECDF) of a sample: sort the sample; begin at height $0$ on the left, jump up by $1/n$ at each sample value, end end at height $1$ on the right.

Many goodness-of-fit tests are possible, but you should use only one of them in a practical situation. Another test is to count the frequencies (3, 3, 6, 2, 6) in the five histogram bins. For a uniform distribution we would expect $E = 4$ counts on average in each bin. A chi.squared test finds that the disagreement between the observed and expected frequencies is not greater than would be expected by chance.

hist(x, plot=F)$counts
[1] 3 3 6 2 6
f = hist(x, plot=F)$counts;  f
[1] 3 3 6 2 6
chisq.test(f, sim=T)

        Chi-squared test for given probabilities 
        with simulated p-value (based on 2000 replicates)

data:  f
X-squared = 3.5, df = NA, p-value = 0.5502

Notes on chisq.test in R: (1) Unless otherwise stated, the 'given probabilities' are taken to be equal in each category. (2) When expected category frequencies are small (as here), the test can simulate an accurate P-value. (If using software without this simulation capability, it would be better to have a sample size larger than twenty.