Random sample after adding extra elements

73 Views Asked by At

I have $N$ sensor measurements ($N=5000000$) and a random sample of size $s$ ($s=20$) from this set of data. For each measurement is computed a rank as being the minimum distance to the sample values. So finally we will have $5000000$ ranks. The problem is how do we compute the ranks when $100$ new measurements are retrieved, so now we have $5000100$ data objects. We could generate a new random sample $s_1$ and repeat all computations for all $5000100$ measurements. I want to avoid this since it is computational expensive. So I was thinking to keep the already generated sample s and already generated $5000000$ ranks and compute the rank incrementally only for the new $100$ observations. Is any way to asses the error that would be introduced if we would compute the ranks for the $100$ new values referring to the old sample $s$?

Example

$N=10$ [$2$ $7$ $10$ $2$ $6$ $9$ $11$ $3$ $15$ $8$] $s=3$ [$10$ $7$ $2$]

rank[$2$]$=0$
rank[$7$]$=0$
rank[$10$]$=0$
rank[$2$]$=0$
rank[$6$]$=1$
rank[$9$]$=1$
rank[$11$]$=1$
rank[$3$]$=1$
rank[$15$]$=5$
rank[$8$]$=1$

suppose we have $3$ new observations [$5$ $19$ $4$] the new set of observations would be

[$2$ $7$ $10$ $2$ $6$ $9$ $11$ $3$ $15$ $8$ $5$ $9$ $14$]

Is not correct to consider that for this set a random sample is [$10$ $7$ $2$] since we give no chance of being selected to the new introduced elements. Still considering that the random sample is [$10$ $7$ $2$] how can we estimate the error of computing the ranks for [$5$ $19$ $4$] in respect to sample [$10$ $7$ $2$] , i.e rank[$5$]$=2$
rank[$19$]$=9$
rank[$4$]$=2$

what would be the error of taking these values as real ranks?

Thanks and regards Sorin

1

There are 1 best solutions below

7
On

let us see how often you end with the new measurements in the random sample:
$$p = 1- \frac{{N \choose s}}{{N+100 \choose s}}$$.

Using the known bounds such as $ \left(\frac{n}{k}\right)^k <{n \choose k} < \left(\frac{e\cdot n}{k}\right)^k $, we can show that $p \to 0$ as long as $\frac{s}{N} \to 0$.

In your case s=20 and N=5000000, so I would assume that your initial calculations would remain true with a very high probability.