Sampling distribution. How large does the sample size need to be?

134 Views Asked by At

I am building a data analysis model for clinics visits. Basically, I want to capture the patient arrival distribution for each specific visit type at each specific time. Let's say the previous data tells us we have 10 records for visit type A scheduled at 8am, they are $[-20,-16,-13,-2,0,1,5,10,30,31]$, where -20 denotes 20min earlier and 1 represents 1min later.

I want to build a discrete distribution (probability mass function:PMF) to summarize this pattern. Specifically, the x axis must be the numbers which are multiple of 5min (this is the requirement from downstream model). In this way, we can round all the numbers such that $$ [-20,-16,-13,-2,0,1,5,10,30,31] \approx [-20,-15,-15,0,0,0,5,10,30,30].$$ Thus, the PMF we get is $$P(x=-20) = 0.1, P(x=-15)=0.2, P(x=0)=0.3, P(x=5) =0.1, P(x=10)=0.1, P(x=30)=0.2$$

This is my initial thought. We questions are: 1) how large the sample size has to be to ensure this method makes sense? For the example above, it seems to me that it should be a Gaussian distribution and we are missing data points like x=-10,-5, etc. 2) For the above example, if we only have 10 samples for visit type A scheduled at 8am and it is not enough, should I enlarge the search size? Like include type B scheduled at 8am or include type A scheduled at 8:30am? Should I calculate the correlation coefficients for time and visit type and determine which dimension I should go first?

1

There are 1 best solutions below

0
On BEST ANSWER

In this kind of effort one must almost always make assumptions, and from what you say, it is difficult to know exactly what assumptions to make.

To begin, it may make sense to assume that the times are normally distributed in the vicinity of 0. A Shapiro-Wilk test for normality does not reject the null hypothesis that data are normal, but ten observations is hardly enough to judge normality.

Rounding to the nearest 5 only loses information. If that is really a requirement 'downstream' (wherever that may be), then it is best to do the rounding at the end of any analysis rather than at the start.

It is hard to say what degree of accuracy would 'make sense' in your situation, but I can give you an idea what kind of accuracy you can expect for $n = 10$ observations, assuming the data to be normal:

A 95% confidence interval for the population mean based on Student's t distribution is $(-9.9, 15.1)$ and a 95% confidence interval (CI) for the population standard deviation (SD) based on the chi-squared distribution is $(12.0, 31.9).$ The sample mean is $\bar X = 2.6$ and the sample SD is 17.5. These CIs are standard and you can read about them in many elementary and mid-level texts on applied statistics.

So the 'best-fitting' normal distribution would be $\mathsf{Norm}(\mu = 2.6, \sigma=17.5).$ But considering the sloppiness of the estimates for $\mu$ and $\sigma$ suggested by the CIs above, this procedure is less than totally promising.

If you are happy using this particular normal distribution, then you could discretize its probabilities using 12 intervals of width 5 centered at numbers -25, -20, ..., 30, obtaining the respective probabilities 0.033, 0.050, ..., 0.034, according to the computation in R statistical software below. (The probabilities don't quite add to 1, but there is very little probability in the tails beyond $\pm 30.$

int = seq(-27.5, 32.5, by = 5) 
round(diff(pnorm(int, 2.6, 17.5)),3)
 [1] 0.033 0.050 0.069 0.088 0.103 0.112 0.113 0.104 0.089 0.070 0.050 0.034

If you used the distribution $\mathsf{Norm}(5, 15),$ with parameters still inside the CIs above, your 12 probabilities would be as follows:

round(diff(pnorm(int, 5, 15)),3)
 [1] 0.018 0.033 0.055 0.081 0.106 0.125 0.132 0.125 0.106 0.081 0.055 0.033

Of course, there are many plausible choices of parameters within those CIs. But if you find the difference between these two lists to be alarming, then you clearly don't have enough data.

It seems to me this would work a considerably better if you had more than 10 observations. Whether it is best to achieve that by (a) including both type A and type B appointments, (b) including appointments from various times of day, or (c) both, is a matter to be decided by someone who knows how similar behavior is among types and times.


Addendum: Computations of CIs in R are as follows:

t.test(x)

         One Sample t-test

data:  x
t = 0.47049, df = 9, p-value = 0.6492
alternative hypothesis: true mean is not equal to 0
95 percent confidence interval:
 -9.900906 15.100906                     # Here's the CI for the pop mean
sample estimates:
mean of x 
      2.6 

mean(x) + qt(c(.025,.975),9)*sd(x)/sqrt(10)
[1] -9.900906 15.100906                  # CI for pop mean again


sqrt(9*var(x)/qchisq(c(.975,.025),9))
[1] 12.01996 31.90265                    # And the CI for the pop SD

The latter CI is based on the fact that $\frac{(n-1)S^2}{\sigma^2} \sim \mathsf{Chisq}(df = n-1),$ where $S^2$ and $\sigma^2$ denote the sample and population variances, respectively. Both CIs assume data are normal.