Greg and Fyodor are playing a game. To play the game they each bring their own die with numbers $1$ to $6$ on them. It is not known what the probability distributions are of either die. One round involves: each person rolling their die; whoever gets the highest number wins, and if they roll the same number then they roll again until someone wins.
They play a game of $200$ rounds and only keep track of the number of wins and losses, but not the numbers on the dice in each round.
Fyodor wins $87$ rounds and Greg wins the other $113.$
Angry, Fyodor calls BS and tells Greg to his face, "No way, you were cheating."
I'm not sure that the statement "you were cheating" corresponds to a single hypothesis test claim. However, there are two seemingly similar but different claims I would like to carry out on Fyodor's behalf.
The first claim is: "(at the 5% significance level) The EV of Greg's die is not equal to the EV of Fyodor's die". Now assuming Fyodor and Greg's EVs are equal, then this implies that the probability of Fyodor winning a round is exactly $\frac{1}{2}.$ Then let $X$ is the random variable, "number of rounds Fyodor wins out of $200$", then $X\sim B(200, \frac{1}{2}). P(X \leq 87) = 0.0384 < 0.05,\ $ so at the $5$% significance level, there is evidence to suggest that the EV of Fyodor's die is not equal to the EV of Greg's die.
Now my second claim is: "(at the 5% significance level) The EV of Greg's die is greater than the EV of Fyodor's die". Now I'm not sure how to calculate this. I think this is much more difficult, if not impossible without further information. Maybe with some further information we could do a double integral of all the possible EV values of Greg's die and all the possible EV values of Fyodor's die, although maybe this will have to be weighted in some way depending on how the dice are made and what the probability of different values of EV different dice can have. Is my thinking that we need more information to answer my second claim correct? Or am I overthinking it and it's just some sort of one-tailed hypothesis test?
I came up with this question because I was thinking about how we could test claims of whether or not someone was cheating in a game - or whether the house or a player is cheating in a casino - and ultimately I think what you conclude depends on what your hypothesis test is.
No, you cannot perform the hypothesis test based on the expectations of each die, because "win" and "loss" are solely determined by which die rolled higher, not the extent to which one was higher. For an extreme example, Greg's die could have the distribution $$\Pr[X = x] = \begin{cases} 1/6, & x = 1 \\ 0, & x \in \{2, 3, 4, 5\} \\ 5/6, & x = 6 \end{cases}$$ in which case the expected value is $31/6$, and Fyodor's die could have the distribution $$\Pr[Y = y] = \begin{cases}0, & x \in \{1, 2, 3, 4\}, \\ 5/6, & x = 5 \\ 1/6, & x = 6 \end{cases}$$ with the same expected value, but Greg wins more than half the time because with probability $(5/6)^2 = 25/36 > 1/2$, Greg rolls a $6$ and Fyodor rolls a $5$.
Instead, the correct parameter on which to test the hypothesis that Greg cheated is the proportion of times Greg's die beats Fyodor's die. Fortunately, the relevant information captured in the sample is the number of outcomes in which one die beat the other--this is a sufficient statistic for the parameter.
Again to illustrate with the above example. Both $X$ and $Y$ are categorical random variables; specifically $$X \sim \operatorname{Categorical}(\pi_{x1} = 1/6, \pi_{x2} = \pi_{x3} = \pi_{x4} = \pi_{x5} = 0, \pi_{x6} = 5/6), \\ Y \sim \operatorname{Categorical}(\pi_{y1} = \pi_{y2} = \pi_{y3} = \pi_{y4} = 0, \pi_{y5} = 5/6, \pi_{y6} = 1/6),$$ are are in a sense location-scale generalizations of Bernoulli variables because all but two of the parameters in each case equal $0$. The number of times each variable equals one of the six outcomes is a multinomial random variable, which are vector-valued generalizations of a binomial random variable. However, you would need to write a complicated function that relates the $\pi_{ij}$ to the proportion $p$ of outcomes where $X$ beats $Y$. Again, this is not necessary because what ultimately matters is testing the hypothesis $$H_0 : p = 1/2 \quad \text{vs.} \quad H_1 : p \ne 1/2,$$ where the number of times $X$ beats $Y$ in $n$ outcomes (where an outcome may require multiple rolls to determine a winner), is a binomial random variable $B \sim \operatorname{Binomial}(n, p)$. Therefore, under the null hypothesis, for a sufficiently large sample size $n$, the test statistic $$Z \mid H_0 = \frac{\hat p - 1/2}{\sqrt{1/2 (1-1/2)/n}}, \quad \hat p = \frac{B}{n}$$ is approximately standard normal.
To use this test for your example, we have $B = 113$ and $n = 200$. Then $$Z \mid H_0 = \frac{\frac{113}{200} - \frac{1}{2}}{\sqrt{1/2(1 - 1/2)/200}} \approx 1.83848.$$ The $2$-sided $p$-value for this test is approximately $\Pr[|Z| > 1.83848] \approx 0.0659921$. Note that this value is twice what you calculated because the test should be two-tailed, not one. You tested a single-tailed hypothesis in which you would allow for Fyodor's die to win more frequently than Greg's, and this is of course just as unfair as the other way round. So if we are to use a $5\%$ significance level, the allocation of overall Type I error $\alpha$ must be equal on each tail, meaning that the critical value for the test is $z^*_{\alpha/2} \approx 1.95996$ and we cannot reject $H_0$.