Approximating statistics for huge dataset

109 Views Asked by At

I'm investigating users accounts statistics for Vkontakte social network. There are $N\approx2 \cdot 10^8$ accounts that have different metrics along them – boolean, discrete and continuous. I found that statistics for those metrics can approximately be found from much smaller random subsets of accounts.

For example vast majority of accounts ($\approx 90\%$) are deleted (blocked bots or self-deleted accounts). So there is probability of deletion that can be found by $$P_{deletion}=\frac{N_{deleted}}{N}$$ this can be also done with much smaller subset of $n\ (n\ll N)$ accounts $$P_{deletion}\approx\frac{n_{deleted}}{n}$$ This approximation become closer with increasing $n$. I'm seeing this on graph of $P_{deletion}(n)$ function.

For discrete and continious metrics this approximation can be done for their distributions parameters as I believe. I checked it on the mean number of user posts for the last month.

What are precise math conceptions behind this? How can I calculate the accuracy of such approximations?

2

There are 2 best solutions below

2
On BEST ANSWER

First the sample of size $n$ chosen at random from the population of size $N.$

Then the estimate of the fraction of deleted accounts is $\hat p_{del} = X/n,$ where $X \sim Binom(n, p_{del})$ is the number of deleted accounts among $n$. For $n$ as large as I suppose you would use, $X$ is approximately normally distributed.

This leads to a 95% confidence interval that expresses how far the true $p_{del}$ might be from the estimated $\hat p_{del},$ as follows: $$ \hat p_{del} \pm 1.96 \sqrt{\hat p_{del}(1-\hat p_{del})/n}.$$ Notice that the margin of error (after the $\pm$-sign) shrinks proportionately as $\sqrt{n}$ increases. Also the largest margins of error occur when $\hat p_{del} \approx 1/2.$

When $\hat p_{del}$ is between .3 and .7 a rough rule of thumb is that the margin of error is $1/\sqrt{n}$; a public opinion poll based on randomly sampling $n = 2500$ subjects should be accurate to within about $\pm 2\%$.

For discrete and continuous measures, I guess you are interested in how close the sample mean $\bar X$ of a sample of size $n$ is from the population mean $\mu$ of the $N$ values in the population. Here the Central Limit Theorem takes effect to give $\bar X$ an approximately normal distribution. A 95% confidence interval for $\mu$ is $$\bar X \pm 1.96S/\sqrt{n},$$ where $S$ is the sample standard deviation. Again here the margin of error decreases proportionately as $\sqrt{n}$ increases.

0
On

Comment: As requested, here is some theoretical background for the confidence intervals shown in my Answer. You can find further details in most basic statistics books, but it may be useful to have information that focuses primarily on your situation where $n$ is very large. 'Intuitive' is in the eye of the beholder, but maybe this will give you a good start.

Binomial. Suppose the fraction of individuals in a population having a particular trait is $p$ and that we draw a random sample of $n$ individuals from the population, of whom $X$ have the trait. Then $X \sim Binom(n, p),$ and $p$ is estimated by $\hat p = X/n.$

While there are methods for finding a confidence interval for $p$ directly from binomial distributions, it is easier and about as accurate for large $n$ to use a normal approximation. One can show that $Z = \frac{\hat p - p}{\sqrt{p(1-p)/n}}$ is approximately standard normal. This works because $E(\hat p) = p$ and $SD(\hat p) = \sqrt{p(1-p)/n}.$ Then

$$P\left(-1.96 \le Z = \frac{\hat p - p}{\sqrt{p(1-p)/n}} \le 1.96\right) \approx .95.$$

The plot below shows that, for $n$ as small as $100$ and $p = .3,$ binomial probabilities for $\hat p$ (bars) are very well matched by the density curve for $Norm(p = .3, \sigma=.0458),$ where $\sigma = \sqrt{.3(.7)/100}.$

enter image description here

Furthermore, as a second approximation for very large $n,$ we have $X/n = \hat p \approx p,$ so that

$$P\left(-1.96 \le \frac{\hat p - p}{\sqrt{\hat p(1- \hat p)/n}} \le 1.96\right) \approx .95.$$ After some algebra with the inequalities, one can 'isolate' $p$ to obtain

$$P\left(\hat p - 1.96\sqrt{\hat p(1-\hat p)/n} \le p \le \hat p + 1.96\sqrt{\hat p(1-\hat p)/n}\right) \approx .95.$$

This gives rise to the following 95% CI for $p$:

$$\hat p \pm 1.96\sqrt{\hat p(1-\hat p)/n}.$$

Notice that two approximations are involved: (1) Assuming that $\hat p$ is approximately normal, and (2) Assuming that it is OK to use $\hat p$ instead of $p$ in the expression for the $SD(\hat p).$ If $n$ is in the thousands, both assumptions are reasonable.

Numerical Data. Suppose that $n$ observations are taken at random from from a distribution with mean $\mu$ and variance $\sigma^2.$ (The population distribution need not be normal, but ideally it should take a variety of different values.) Then the Central Limit Theorem says that $\bar X \sim Norm(\mu, \sigma/\sqrt{n}),$ where the second argument of $Norm$ is the standard deviation. Then $$Z = \frac{\bar X - \mu}{\sigma/\sqrt{n}} \sim Norm(0,1)$$ If the population standard deviation $\sigma$ is estimated by the sample standard deviation $S,$ then this ratio has Student's t distribution with $n - 1$ degrees of freedom:

$$T = \frac{\bar X - \mu}{S/\sqrt{n}} \sim T(df = n-1).$$

For $n$ in the hundreds or thousands, the distribution $T(n-1)$ is very nearly the same as $Norm(0,1).$

Thus, for large $n$ we have $$P\left(-1.96 \le \frac{\bar X - \mu}{S/\sqrt{n}} \le 1.96\right) \approx .95$$ and, after some algebra to 'isolate' $\mu,$ we have a 95% CI for $\mu$: $$\bar X \pm 1.96 S/\sqrt{n}.$$ Many people use $\bar X \pm 2\, S/\sqrt{n}.$

Finally, here is a simulation in R. Icosahedral dice have 20 equally likely sides, numbered 1 through 20. An 'experiment' is to roll 1000 such dice, average the numbers showing on the faces, and use the mean and SD of these 1000 numbers to make a 95% CI for $\mu = 10.5,$ the average number showing on such a die.

The frequentist interpretation of a CI is that the method produces CIs that cover the true $\mu$ 95% of the time over the long run. From 10,000 such thousand-die experiments we got just about 95% coverage.

 B = 10^4;  n = 1000
 x = sample(1:20, B*n, repl=T)
 DTA = matrix(x, nrow=B) # 10,000 x 1000: each row an expt
 x.bar = rowMeans(DTA)   # 10,000 sample means
 s = apply(DTA, 1, sd)   # 10,000 sample SDs
 lcl = x.bar - 1.96*s/sqrt(n)  # lower CI limits
 ucl = x.bar + 1.96*s/sqrt(n)  # upper CI limits
 cover = (lcl < 10.5) & (ucl > 10.5) # TRUE if CI covers 10.5, else FALSE
 mean(cover)  # proportion of TRUEs
 ## 0.9502