Mean and Variance Estimation Methods : Whole Sample vs Subsets of Sample

17 Views Asked by At

Suppose there is a Normal Distribution with $\mu$= 5 and $\sigma$ = 5. We have 100 random samples from this distribution.

  • Person 1 is given the 100 samples at once and is told to estimate $\mu$ and $\sigma$
  • Person 2 is given the these same 100 samples but broken into sets of 10. Person 2 estimates $\mu$ and $\sigma$ from each set, and then averages all estimates.

To write the estimation strategies more concisely:

  • Person 1 ($n=100$):

$$\hat{\mu} = \frac{1}{n}\sum_{i=1}^{n} x_i$$ $$\hat{\sigma}^2 = \frac{1}{n-1}\sum_{i=1}^{n} (x_i - \hat{\mu})^2$$

  • Person 2 ($n=100$, $m=10$):

$$\hat{\mu}_{\text{avg}} = \frac{1}{m}\sum_{j=1}^{m} \hat{\mu}_j = \frac{1}{m}\sum_{j=1}^{m} \left(\frac{1}{n}\sum_{i=1}^{n} x_{ij}\right)$$

$$\hat{\sigma}^2_{\text{avg}} = \frac{1}{m}\sum_{j=1}^{m} \hat{\sigma}^2_j = \frac{1}{m}\sum_{j=1}^{m} \left(\frac{1}{n-1}\sum_{i=1}^{n} (x_{ij} - \hat{\mu}_j)^2\right)$$

Here is an R simulation for this situation:

mu <- 5
sigma <- 5
n_samples <- 100
n_pairs <- 10
n_iterations <- 100

# vectors
person1_means_diff <- numeric(n_iterations)
person1_vars_diff <- numeric(n_iterations)
person2_means_diff <- numeric(n_iterations)
person2_vars_diff <- numeric(n_iterations)


for (i in 1:n_iterations) {
    # Generate samples 
    samples <- rnorm(n_samples, mean = mu, sd = sigma)
    
    # Person 1 
    person1_means_diff[i] <- abs(mean(samples) - mu)
    person1_vars_diff[i] <- abs(var(samples) - sigma^2)
    
    # Person 2
    pair_means <- sapply(split(samples, rep(1:(n_samples/n_pairs), each = n_pairs)), mean)
    pair_vars <- sapply(split(samples, rep(1:(n_samples/n_pairs), each = n_pairs)), var)
    person2_means_diff[i] <- abs(mean(pair_means) - mu)
    person2_vars_diff[i] <- abs(mean(pair_vars) - sigma^2)
}

df_person1_diff <- data.frame(Iteration = 1:n_iterations, Mean = person1_means_diff, Variance = person1_vars_diff)
df_person2_diff <- data.frame(Iteration = 1:n_iterations, Mean = person2_means_diff, Variance = person2_vars_diff)

plot_diff <- function(df, person) {
    p1 <- ggplot(df, aes(x = Iteration)) +
        geom_line(aes(y = Mean), color = "red") +
        labs(title = paste("Absolute Differences in Means for", person), x = "Iteration", y = "Absolute Difference in Mean") +
        theme_bw()
    
    p2 <- ggplot(df, aes(x = Iteration)) +
        geom_line(aes(y = Variance), color = "blue") +
        labs(title = paste("Absolute Differences in Variances for", person), x = "Iteration", y = "Absolute Difference in Variance") +
        theme_bw()
    
    list(p1, p2)
}

enter image description here

My Question: For finite sample sizes, do there exist any Probability Distribution Functions (e.g. https://en.wikipedia.org/wiki/Mixture_distribution) or situations where Person 1's strategy will significantly outperform Person 2's strategy?

Thanks!

References: