Trend of median of ratios vs average of ratios

69 Views Asked by At

Suppose I want to create a daily trend of how people support 5 candidates for presidency. Per day across a month

I have 20 small surveys. Randomly selected from the whole country .

Instead of using raw numbers. I am using ratio. Per candidate I ask what is the ratio of people supporting him out of total samples that we have.

Now suppose i create a daily graph trend for candidate X. I set up 3 approaches (don't question my approaches correctness as it is not the point)

  1. sum all surverys together as if it was a one big survey. Take votes for said X candidate per day out of total per day and look at trend.

  2. in each of the 20 surveys take ratio of votes for said X candidate out of the total survey itself . And then average all the 20 ratios from the 20 surveys.

  3. same as 2 but instead of averaging 20 survey ratios. Take median of the 20 ratios.

What I found out is that approaches 1 and 3 look more similar in trend while approach 2 is very different.

I don't understand why.

I understand why average is different because when averaging there is no preference for size of a survey and all are treated equally while approach 1 which sums before ratio the bigger surveys would be more dominant.

What I do not understand is why approach 3 would be any closer to approach 1? When using median of ratios I also don't give importance to survey size. Why wouldn't it be very different as well? I would think it should match approach 2 better if anything.

1

There are 1 best solutions below

0
On

This is an issue of robustness: surveys with smaller sample sizes are more likely to be further from the true figure more often than larger surveys, as they have a larger standard error. When there is an extreme case for this reason:

  1. the effect on the "one big combined survey" is relatively small because the surveys with small sample sizes have less effect on the total

  2. the effect on the "simple mean" is larger because the surveys with a small sample size have the same weight as the others in calculating the average but are more likely to have an extreme value

  3. the effect on the "simple median" is smaller than with the simple mean because surveys with a small sample sizes and extreme values only affect the direction of error with the median value coming from a different survey with a less extreme value

To take a different example as an illustration, suppose three surveys are used, one with sample size $11$, one with sample size $101$ and one with sample size $1001$, all sampling something with true proportion of $25\%$. Using R and taking $50$ simulations (sorting them by the "simple mean" to emphasise what happens when it is extreme), you might get something like the following, where $1,2,3$ are the ratios for the three surveys, the blue line is the "combined" ratio (largely driven by survey $3$), the red line is the "simple mean" (correlated with survey $1$ but typically only a third as extreme, as surveys $2$ and $3$ are usually closer to the true proportion), and the green line is the "simple median" (usually either survey $2$ or $3$). The blue line is closest to the true proportion overall, followed by the green line, with the red line usually having the largest errors, so it should be no surprise that the blue line is usually closer to the green line than it is to the red line, as you have noted.

survey proportions and averages

surveys <- function(sizes, trueprop){
  yeses <- rbinom(length(sizes), sizes, trueprop)
  props <- yeses / sizes
  comboratio <- sum(yeses) / sum(sizes)
  simplemean <- mean(props)
  simplemedian <- median(props)
  return(c(props, comboratio=comboratio, 
           simplemean=simplemean, simplemedian=simplemedian))   
  }

set.seed(2024)
sims <- replicate(50, surveys(c(11,101,1001), 0.25)) 
osm <- order(sims["simplemean",])
matplot(t(sims[1:(nrow(sims)-3), osm]), col="black")
lines(sims["comboratio", osm], col="blue")
lines(sims["simplemean", osm], col="red")
lines(sims["simplemedian", osm], col="green")
abline(h=mean(sims["comboratio",]))