To sample or not to sample?

137 Views Asked by At

I have the following issue (I want to predict sports results). Let $X$ be a discrete RV with $m$ possible outcomes, each having probability $p_i$, for $i=1,\ldots,m$. Assume that I have a large iid sample $X_1,\ldots,X_n$, where each variable has the same distribution as $X$. Now, to make a prediction for each $X_k$ (before its actual value is drawn) I could just always predict outcome $i^*$ with largest probability $p^*_i$. Then, in expectation, I would be right on $n\cdot p^*_i$ samples.

Alternatively, I could sample from the distribution of $X$, that is, for each $X_k$ predict value $i$ according to $p_i$. Then, the probability that my prediction for $X_k$ equals the actual draw of $X_k$ is

$$q=P[\hat{X}_k=X_k]=\sum_{i=1}^m P[\hat{X}_k=i]P[X_k=i]=\sum_{i=1}^m p_i^2.$$

Thus, according to this approach, the probability that I am correct on $r$ predictions is $B(r;n,q)=\binom{n}{r}q^r(1-q)^{n-r}$. Now, what is the probability that this approach gives me at least $n\cdot p_i^*$ correct results (i.e., $B(n\cdot p_i^*;n,q)+\cdots+B(n;n,q)$)? Are there any known general results on this? How does the answer depend on the distribution of $X$ (e.g. uniform, skewed, ...)?

In other words, should I sample (approach 2) or just predict the most probable outcome (approach 1)? (Did I make any mistakes somewhere?)

2

There are 2 best solutions below

3
On BEST ANSWER

$\sum_{i=1}^m p_i^2 \le \sum_{i=1}^m p_ip_{max} = p_{max}\sum_{i=1}^m p_i = p_{max}.$

2
On

You can prove that your sampling method is no better than the easier, choose-the-best-bet method. I.e. $\sum_i p_i =1,\ p_i > 0\ (\forall i) \implies \sum_i p_i^2 \leq \max_i p_i$.

Ignoring the positivity constraint and using Lagrange multipliers to maximise the sum of the squares. The Lagrangian is $\mathcal L(p_1,\dots,p_m,\lambda):= \sum p_i^2 + \lambda(\sum p_i-1)$, therefore our necessary condition is $\frac {\partial \mathcal L}{\partial p_i} = \frac {\partial \mathcal L}{\partial \lambda}=0$,

\begin{align} 0 &= \frac {\partial \mathcal L}{\partial p_i} \\ &= 2p_i+\lambda \\ \implies p_i &= -\frac \lambda 2 \\\\ 0 &= \frac {\partial \mathcal L}{\partial \lambda} \\ &= \sum p_i-1 \\ &= m(-\frac\lambda 2) - 1 \\ \implies \lambda &= -\frac 2m \\ \implies p_i &= \frac 1{m} \\ \implies \sum p_i^2 &\leq \sum \frac 1{m^2} = \frac 1{m} \leq \max_i p_i \end{align}

The first inequality comes from the fact that the value attained is indeed a maximum (this requires proof, although it's not too difficult. This is where we use positivity). The second inequality comes from the fact that the maximum must be greater than the average.

It's worth noting that if the distribution isn't uniform then the last inequality is strict, and the simpler method is strictly better.