Estimating probability of event given marginal information for discrete random variables

34 Views Asked by At

Given two mutually exclusive events $A$ and $B$ where $\mathbb{P}(X=A)=\alpha$ and $\mathbb{P}(X=B)=\beta\ \ (=1-\alpha)$ suppose we want to estimate $\alpha$. However we are only given samples from $(X,Y)$ (without knowledge of whether $X=A$ or $B$) for the values $C_k$ where the marginals $C_{A}=Y|X=A$ and $C_B=Y|X=B$ satisfy

$\mathbb{P}(C_A=C_k)=p_k=1/N,$ (uniform) where $k=1,...,N$

$\mathbb{P}(C_B=C_k)=q_k$ for $k=1,...,N.$

Obviously if $q_k$ is close to $p_k$ for all $k$ we cannot estimate $\alpha$ since the later stage samples are identically distributed. But if $q_k$ and $p_k$ differ substantially you should get a good estimate. Is anyone aware of a documented solution for this problem, or feel they can come up with a good estimate?

It should be a well documented problem I expect but I am not that comfortable with sample bias statistics. The probability estimates should depend on the number of samples $m$ and the differences between the probabilities $p_k$ and $q_k$.

P.S. If anyone believes there is need for further clarification, please let me know. I am trying to mathematically interpret the problem of estimating the number of samples from one of two datasets and where each set takes values with different probability compared to each other.

1

There are 1 best solutions below

1
On

I've thought about this more and if you look in the limit you can recover $\alpha$ exactly of course. The empirical probability you get from samples $r_k:=\mathbb{P}_e(Y=C_k)$ converges to the actual probability $\mathbb{P}(Y=C_k)$ by LLN.

$$\mathbb{P}(Y=C_k)=\mathbb{P}(X=A)\cdot\mathbb{P}(Y=C_k|X=A)+\mathbb{P}(X=B)\cdot\mathbb{P}(Y=C_k|X=B)$$

$$\mathbb{P}(Y=C_k)=\alpha/N+(1-\alpha)q_k$$

So if you look at the estimate $r_k\sim\alpha/N+(1-\alpha)q_k$ and solve for $\alpha$ you should get a good estimate:

$$\alpha\sim \frac{r_k-q_k}{\frac{1}{N}-q_k}.$$ The probability that $\alpha$ differs from this depends on the convergence behavior for LLN with discrete random variables $(X,Y)$. I guess this kinda thing is well-known by people, but I've forgotten what the optimal bounds are here.