I have a question with a statistical nature; I think there should be some standard theory about this issue.
Suppose I have a large data set of size $N$ items, which has an amount of $K<N$ unwanted items. I am interested in finding the value of $K$. Testing all items takes too much time, so I want to determine a suitable sample size $n<N$ of randomly selected items in the data set.
Suppose I just pick a value for $n$
Then, of a randomly sample data of size $n$, I search for the unwanted items of which there are some amount of $k\leq n$. Let this amount be a test statistic $T$, i.e. I will test on the probability $P(T \geq k)$. I can now find a smallest integer value $K_\min$ such that for the estimation $K = K_\min$ we have $P(T \geq k) \geq \alpha$. That is, for any smaller integer estimation $K<K_\min$ we have $P(T \geq k) < \alpha$. If I am correct, I can now state that with a significance level $\alpha$ we have that $K \geq K_\min$. Is that true?
If this is true, the question now is: How accurate is this lower bound?
This is also my main question. Based on the amount $n$ and accuracy level $\alpha$, what can we say about the accuracy of $K_\min$. In other words, can we determine some confidence interval on $K$ in relationship to $K_\min$ and $\alpha$?
Any tips or other approaches are very much appreciated!
Best, Koen
Edit 26 November:
Another formulation of the problem as mentioned by David K is as follows:
Given some "error" tolerance $\varepsilon$, how do we choose $n$ for a given $\alpha$ such that we can guarantee that $|K_\min−K|/N\leq \varepsilon$ (or some assurance like that)?
I suppose that when you compute $P(T \geq k),$ you have in mind a probability distribution of $T$ based on sampling $n$ items from $N$ of which $K$ are "unwanted". A hypergeometric distribution seems to fit the requirements.
In that light, I agree completely with your paragraph between "Suppose I just pick a value" and "If this is true".
It seems to me that then $K > K_\min$ is your confidence interval, that is, you have a one-sided confidence interval for $K$ whose lower bound is $K_\min.$