Context : To get answers to sensitive questions, we sometimes use a method called the randomized response technique. Suppose, for instance, that we want to determine what percentage of the students at a large university take ketamine. We construct 20 flash cards, write ‘I take ketamine at least once a week’ on 12 of the cards (where 12 is an arbitrary choice) and ‘I do not take ketamine at least once a week’ on the others. Then we let each student (in the sample interviews) select one of the 20 cards at random, and response yes or no without divulging the question.
Establish a relationship between P(Y), the probability that a student will give a yes response, and P(K), the probability that a student randomly selected at the university takes ketamine at least once a week.
I received the following question as an undergraduate Statistics student, and I am confused about the whole idea of the "randomised response technique". The process describes a student choosing the cards and they say yes or no, but how can we determine the percentage of students that may actually take ketamine? If I say that the probability a student gives a yes response is 0.5, this would imply that half the population in the university takes ketamine, although this would be incorrect. Could someone please explain how do I derive the probability more accurately?
Let $A$ be the event that the card drawn by the student interviewed (supposed to be picked uniformly at random among all students) says "I take ketamine at least once a week" and $B$ be the event that it says "I do not take ketamine at least once a week"; so that, here, $\Pr[A] = \frac{12}{20} = \frac{3}{5} $ and $\Pr[B] = \frac{2}{5}$.
By an application of Bayes' rule, the probability $\Pr[Y]$ satisfies $$\begin{align} \Pr[Y] &= \Pr[Y\mid A]\cdot \Pr[A] + \Pr[Y\mid B]\cdot \Pr[B] = \Pr[K]\cdot \Pr[A] + (1-\Pr[K])\cdot \Pr[B]\\ &= \frac{3}{5}\Pr[K]+ \frac{2}{5}(1-\Pr[K])\\ &= \frac{1}{5}\Pr[K]+\frac{2}{5} \tag{1} \end{align}$$ assuming, of course, that the students answer truthfully (that is, say "Yes" iff the statement they read on their card is true).
Why would we do that? Well, our goal is to estimate $\Pr[K]$ without knowing for sure the response of any given student (as this would violate their privacy; it's sensitive information).
Here, we have added some randomness, so if a given student answer "Yes", we don't know whether it's because they're taking ketamine and got the first type of card, or because they don't take ketamine but got the second type of card. So we can't know for sure whether a given student takes ketamine: good!
But we can still estimate $\Pr[K]$! How? Because we can estimate $\Pr[Y]$: take sufficiently many students uniformly at random, get their answer, use that to estimate $\Pr[Y]$ (call the value of the estimate $p$). Now, compute $q = 5(p-\frac{2}{5})$: by (1), this $q$ is a suitable estimate for $\Pr[K]$. (Namely, if $|p-\Pr[Y]| \leq \varepsilon$, then $|q-\Pr[K]| \leq 5\varepsilon$.)
(In your example, if the probability that a student gives the answer "Yes" is $1/2$ (note that it must be between $2/5$ and $3/5$ because of (1)); then the probability that a student takes ketamine is $5(1/2-\frac{2}{5})=1/2$ as well. But if the probability of Yes was say 45/100, then the fraction of students taking ketamine would be $5(45/100-\frac{2}{5})=1/4$.)
So we can estimate $\Pr[K]$ to very good accuracy without ever determining for sure if any given student takes ketamine: their privacy is preserved, and we got the statistical info we wanted.