I have $N$ sensor measurements ($N=5000000$) and a random sample of size $s$ ($s=20$) from this set of data. For each measurement is computed a rank as being the minimum distance to the sample values. So finally we will have $5000000$ ranks. The problem is how do we compute the ranks when $100$ new measurements are retrieved, so now we have $5000100$ data objects. We could generate a new random sample $s_1$ and repeat all computations for all $5000100$ measurements. I want to avoid this since it is computational expensive. So I was thinking to keep the already generated sample s and already generated $5000000$ ranks and compute the rank incrementally only for the new $100$ observations. Is any way to asses the error that would be introduced if we would compute the ranks for the $100$ new values referring to the old sample $s$?
Example
$N=10$ [$2$ $7$ $10$ $2$ $6$ $9$ $11$ $3$ $15$ $8$] $s=3$ [$10$ $7$ $2$]
rank[$2$]$=0$
rank[$7$]$=0$
rank[$10$]$=0$
rank[$2$]$=0$
rank[$6$]$=1$
rank[$9$]$=1$
rank[$11$]$=1$
rank[$3$]$=1$
rank[$15$]$=5$
rank[$8$]$=1$
suppose we have $3$ new observations [$5$ $19$ $4$] the new set of observations would be
[$2$ $7$ $10$ $2$ $6$ $9$ $11$ $3$ $15$ $8$ $5$ $9$ $14$]
Is not correct to consider that for this set a random sample is [$10$ $7$ $2$] since we give no chance of being selected to the new introduced elements.
Still considering that the random sample is [$10$ $7$ $2$] how can we estimate the error of computing the ranks for [$5$ $19$ $4$] in respect to sample [$10$ $7$ $2$] , i.e
rank[$5$]$=2$
rank[$19$]$=9$
rank[$4$]$=2$
what would be the error of taking these values as real ranks?
Thanks and regards Sorin
let us see how often you end with the new measurements in the random sample:
$$p = 1- \frac{{N \choose s}}{{N+100 \choose s}}$$.
Using the known bounds such as $ \left(\frac{n}{k}\right)^k <{n \choose k} < \left(\frac{e\cdot n}{k}\right)^k $, we can show that $p \to 0$ as long as $\frac{s}{N} \to 0$.
In your case s=20 and N=5000000, so I would assume that your initial calculations would remain true with a very high probability.