Detecting corrupted data in birthdates of a population

51 Views Asked by At

I have a population of N birthdates. Let's assume that birthdates are uniformly distributed over the year.

I'm concerned that some of these records have been corrupted, for example by someone pasting over filtered rows in excel, or otherwise introduced by error.

I would like a test to identify those records in N that share a birthdate which is over-represented in the data, indicating that they may have false dates. Any record might have been corrupted with any date, but I'm assuming the nature of the corruption was to overwrite the dates on a bunch of records with a single (false) date.

If I count the number of records on each date, what is the number above which I should suspect that some of the dates on those records are false? Obviously random variation means that the counts of records will not be N/365 for each date, but how much higher does it need to be on any given date for me to be 95% confident that I'm not just just seeing random variation?

1

There are 1 best solutions below

2
On BEST ANSWER

Without corruption, number of recorded birthdays $X_i$ for each day $i$ is a binomial $B(N,p)$($p^{-1}=365=n$). Since for $b>p$ $$P(\exists i: X_i\ge Nb)\le nP(X_i\ge Nb)\le n\exp(-ND(b||p))$$ you want to pick $b$ so that $$\exp(-ND(b||p))=\frac \alpha n\approx 1.37\cdot 10^{-4}$$ Hence $$D(b||p)=b\log \frac b p+(1-b)\log\frac {1-b}{1-p}= -\frac {\log(\frac \alpha n)}{N}=K\approx 2.22\cdot 10^{-6}$$ Write $b=p(1+\epsilon)$ and expand $LHS$ in $\epsilon$ (and $p$) to get: $$\epsilon\approx\sqrt {2Kn}=\sqrt{\frac {2n\log\frac n \alpha}{N}}\approx 0.04$$ Hence if any of the empirical frequencies of birthday deviate upwards from expected by more than $4\%$, the data have been corrupted at $\alpha=0.05$.