To sample or not to sample?

Question

To sample or not to sample?

137 Views Asked by Bumbble Comm At 27 Mar 2026 - 5:54

I have the following issue (I want to predict sports results). Let $X$ be a discrete RV with $m$ possible outcomes, each having probability $p_i$, for $i=1,\ldots,m$. Assume that I have a large iid sample $X_1,\ldots,X_n$, where each variable has the same distribution as $X$. Now, to make a prediction for each $X_k$ (before its actual value is drawn) I could just always predict outcome $i^*$ with largest probability $p^*_i$. Then, in expectation, I would be right on $n\cdot p^*_i$ samples.

Alternatively, I could sample from the distribution of $X$, that is, for each $X_k$ predict value $i$ according to $p_i$. Then, the probability that my prediction for $X_k$ equals the actual draw of $X_k$ is

$$q=P[\hat{X}_k=X_k]=\sum_{i=1}^m P[\hat{X}_k=i]P[X_k=i]=\sum_{i=1}^m p_i^2.$$

Thus, according to this approach, the probability that I am correct on $r$ predictions is $B(r;n,q)=\binom{n}{r}q^r(1-q)^{n-r}$. Now, what is the probability that this approach gives me at least $n\cdot p_i^*$ correct results (i.e., $B(n\cdot p_i^*;n,q)+\cdots+B(n;n,q)$)? Are there any known general results on this? How does the answer depend on the distribution of $X$ (e.g. uniform, skewed, ...)?

In other words, should I sample (approach 2) or just predict the most probable outcome (approach 1)? (Did I make any mistakes somewhere?)

Original Q&A

There are 2 best solutions below

Bumbble Comm On 15 Jun 2014 - 6:10

You can prove that your sampling method is no better than the easier, choose-the-best-bet method. I.e. $\sum_i p_i =1,\ p_i > 0\ (\forall i) \implies \sum_i p_i^2 \leq \max_i p_i$.

Ignoring the positivity constraint and using Lagrange multipliers to maximise the sum of the squares. The Lagrangian is $\mathcal L(p_1,\dots,p_m,\lambda):= \sum p_i^2 + \lambda(\sum p_i-1)$, therefore our necessary condition is $\frac {\partial \mathcal L}{\partial p_i} = \frac {\partial \mathcal L}{\partial \lambda}=0$,

\begin{align} 0 &= \frac {\partial \mathcal L}{\partial p_i} \\ &= 2p_i+\lambda \\ \implies p_i &= -\frac \lambda 2 \\\\ 0 &= \frac {\partial \mathcal L}{\partial \lambda} \\ &= \sum p_i-1 \\ &= m(-\frac\lambda 2) - 1 \\ \implies \lambda &= -\frac 2m \\ \implies p_i &= \frac 1{m} \\ \implies \sum p_i^2 &\leq \sum \frac 1{m^2} = \frac 1{m} \leq \max_i p_i \end{align}

The first inequality comes from the fact that the value attained is indeed a maximum (this requires proof, although it's not too difficult. This is where we use positivity). The second inequality comes from the fact that the maximum must be greater than the average.

It's worth noting that if the distribution isn't uniform then the last inequality is strict, and the simpler method is strictly better.

**Bumbble Comm** · Accepted Answer

Bumbble Comm On 16 Jun 2014 - 7:13 BEST ANSWER

$\sum_{i=1}^m p_i^2 \le \sum_{i=1}^m p_ip_{max} = p_{max}\sum_{i=1}^m p_i = p_{max}.$

To sample or not to sample?

There are 2 best solutions below

Related Questions in PROBABILITY

Related Questions in STATISTICS

Related Questions in SAMPLING

Trending Questions

Popular # Hahtags

Popular Questions