Thanks in advance for any help!
So I am trying to figure out if the number of hits an inning of baseball is random, or if hits tend to come in bunches. To do this, I'm just using a fairly small sample of 10 games, which was 89 innings. Over the 89 innings, there were 74 total hits.
Here is a breakdown of number per inning (times occured):
0 (42)
1 (27)
2 (14)
3 (5)
4 (1)
How should I go about this? Would I find the expected number of "zero" innings and compare my 42 with that? Find the probability of obtaining 42 "zeros" if it was actually random?
Maybe this is a stars-and-bars problem.
Let me know if this is unclear.
Any advice would help! Thank you!
It seems reasonable to hypothesize that the number of baserunners in each inning is not completely random. After all, the teams exercise strategy to try to win, by arranging the batting order, choosing when to replace the pitcher, and so forth, with the objective of scoring as many runs as they can (which requires getting baserunners) and preventing the other team from scoring runs. It's also reasonable to test the hypothesis by comparing the actual numbers of baserunners with the expected numbers you would get if the process of becoming a baserunner were random.
If you assume there had to be exactly $74$ baserunners in $89$ innings, then randomly distributing $74$ balls to $89$ boxes, where each ball has an equal likelihood to be in any box, seems like a reasonable model.
The probability that a ball number $i$ will not be in box number $j$ then is $88/89$, and the probability that box $j$ will be empty is $(88/89)^{74} \approx 0.4333684068.$
Let $I_{jk}$ be $1$ if there are $k$ balls in box $j$, $0$ otherwise. Let $N_k$ be the number of boxes containing exactly $k$ balls. Then $$N_k = \sum_{1\leq j\leq n} I_{jk}.$$ Observing that $E[I_{jk}]$ is the probability of $k$ balls in box $j$, and all boxes have an equal chance to contain $k$ balls. So if we let $X_j$ be the number of balls in box $j$, then $$E[N_k] = \sum_{1\leq j\leq n} E[I_{jk}] = \sum_{1\leq j\leq n} P[X_j = k] = nP[X_1 = k].$$
Therefore the expected number of innings with no baserunners would be $89 (88/89)^{74} \approx 38.5697882019.$ Compare to the observed value, $42$.
The number of balls in each box is reasonably well approximated by a Poisson distribution when $m$ and $n$ are much larger than $k$ (which they are for all the numbers of interest in this problem):
$$ P[X_j = k] \approx \frac{1}{k!} \left( \frac mn \right)^k e^{-m/n}.$$
For $m=74$, $n=89$, and $k = 0, 1, 2, 3, 4, 5$, we get
\begin{array}{crr} k & P[X_1 = k] \quad & n P[X_1 = k]\quad\\ 0 &0.4354128253 &38.7517414554\\ 1 &0.3620286413 &32.2205490753\\ 2 &0.1505062891 &13.3950597279\\ 3 &0.0417133535 &3.7124884639\\ 4 &0.0086707533 &0.7716970403\\ 5 &0.0014418781 &0.1283271483\\ \end{array}
So most of the numbers agree pretty well with observations.
I considered a slightly different model of the problem. Rather than assume there were $74$ "baserunner" events that each had to find an inning in which to occur, I assume that each time a player goes to bat, he has some probability $p$ of becoming a baserunner. I would like to estimate the probability $p$ so that the expected number of baserunners is $74$.
A simplified model says that the inning ends when three players have failed to become baserunners. This assumes that the only way for a player to be "out" is while they are batting, not while they are running. This is slightly unrealistic (because players sometimes do get out while running) and overcounts the number of at-bats in $89$ innings. But this model says that $p \approx 0.2170087977$ and that the number of baserunners per inning has the following distribution:
\begin{array}{crr} k & P[X_1 = k] \quad & n P[X_1 = k]\quad\\ 0 &0.4800325059 &42.7228930293\\ 1 &0.3125138309 &27.8137309458\\ 2 &0.1356365014 &12.0716486216\\ 3 &0.0490571901 &4.3660899218\\ 4 &0.0159687628 &1.4212198866\\ 5 &0.0048515068 &0.4317841063\\ \end{array}
These figures are almost eerily close to the observed values.
It might be interesting to try to improve the model by accounting for the probability that an inning ends with fewer than three at-bats because baserunners get "out". There would then be fewer than $3\times 89$ at-bats but the value of $p$ would be correspondingly greater.