Election Data and Combining Gaussians

56 Views Asked by At

I'm writing a paper and have this data problem from the 2012 presidential election.

In 2012, there were 4384 counties and 50 states. Obama was elected, and the standard deviation of his vote percentage from the county data is ~15 while the standard deviation from the state data is ~12.

As counties are gathered into states, variation decreases, which makes sense, and I know how to add Gaussians, but I can't quite figure out how to set this problem up.

How can I calculate the 12 from the 15? And is this similar to some standard statistics question that I can reference? Thanks!

1

There are 1 best solutions below

3
On BEST ANSWER

Suppose you have a population of 5 million subjects with 'scores' $X_i$ distributed $\mathsf{Norm}(\mu = 100, \sigma=3795).$ [The choice of the population mean is arbitrary because we are interested only in standard deviations.]

Step 1: They are sorted into 50 bins with 100,000 subjects each and we take the 50 bin averages $\bar X_{100000}.$ These averages have population SD $\sigma_1 = 3795/\sqrt{100000} \approx 12.$

Step 2: After mixing the subjects at random, we sort them a second time into 5000 bins with 1000 subjects each and take the 5000 bin averages $\bar X_{1000}.$ These have $\sigma_2 = 3795/\sqrt{1000} \approx 120.$

Thus with the probability model you proposed in your Comment, the averages based on larger groups have smaller variances.


This can be simulated in R statistical software by using two matrices: MAT1 has 50 rows (bins) and 100,000 columns (subjects in each bin). We take row averages a1 of the 50 rows and see if the SD of these 50 averages has sample standard deviation near $\sigma_1 = 12.$ Because there are only 50 row averages we should not expect an exact match.

Upon reapportionment, we make MAT2 with 5000 rows (bins) and 1000 columns and find 5000 row averages a2 and compare the sample SD of these row averages with $\sigma_2 = 120.$

set.seed(313)  # retain this row to repeat exact same simul, delete for fresh run
x1 = rnorm(5*10^6, 100, 3795)  # 5 million subject scores
MAT1 = matrix(x1, nrow=50);  dim(MAT1)
[1]     50 100000              # verify matrix is 50 x 100000
a1 = rowMeans(MAT1)            # 50 row means
summary(a1);  length(a1);  sd(a1)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   59.33   89.22   95.40   96.25  104.50  129.60 
[1] 50
[1] 13.68891                   # reasonably close to 12
x2 = sample(x1)                # scramble subjects
MAT2 = matrix(x2, nrow=5000);  dim(MAT2)
[1] 5000 1000 
a2 = rowMeans(MAT2)
summary(a2);  length(a2);  sd(a2)
    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
-386.10   15.35   99.00   96.25  176.50  625.60 
[1] 5000
[1] 119.859                    # close to 120

The only difference between the simulation procedure and your political data is that the subjects in states are not randomly scrambled into counties. But the simulation accurate imitation of the scenario you proposed in your comment.

Addendum per Comment: If one has a random sample of observations $X_i$ with $E(X_i)=\mu$ and $SD(X_i) = \sigma$ and one using a sample mean $\bar X$ to estimate the the population mean $\mu,$ then the standard deviation $SD(\bar X) = \sigma/\sqrt{n}$ is called the 'standard error of the mean'. Sometimes $\sigma$ is unknown and so estimated by the sample standard deviation $S = \sqrt{\frac{1}{n-1}\sum_{i=1}^n (X_i - \bar X)^2}.$ Then the estimated standard error $S/\sqrt{n}$ is also called the 'standard error'. (The word 'estimated' is dropped, when obvious from context, and sometimes even when not obvious.)

When sample size quadruples, the standard error is halved. In your example, sample size increased by a factor of 100, so standard error decreased by factor of 10. You can use this terminology and formula to find a suitable stat book to reference in your article.