I am trying to understand the following statement about a collection of independent and identically distributed Bernoulli random variables.
We have $\theta$ as the probability of success of the Bernoulli random variable. Generally we can think of a specific observation of the probability parameter estimate as a random variable, $\hat{\theta}$. that can be estimated as $\hat{\theta} =\bar{Y} = > \frac{1}{n}\sum_{i=1}^n Y_i$ where $Y_i$ are Bernoulli random variables. We can see that our parameter estimate(random variable) is equal to the sample mean of the $Y_i$ follows a normal distribution $\mathcal{N}(\theta,\theta(1-\theta)/n)$ and so we can use the z-statistic for the confidence intervals for the Bernoulli parameter $\theta$.
I am from a programming background and new to statistics. I think I understand that in statistics, parameters refer to populations, unlike in programming where parameters are passed as inputs to a function.
I am having trouble processing several things about the above statement.
What is meant by a specific observation of the probability parameter estimate. How does one observe an estimate? Would "observing" mean "calculating" here?
What name do I use to describe the inputs to this normal distribution $\mathcal{N}(\theta,\theta(1-\theta)/n)$ ?
As a programmer I would call them parameters. But I am understanding that is wrong in statistics.
I am understanding that the variance of a binomial distribution is given by $np(1-p)$ and I know that with CLT we divide the variance by n. However shouldn't that mean that the input to the normal distribution would be just $\theta(1-\theta)$ since the $n$ would cancel out?
Is the $p$ in my understanding of a binomial probability the same as $\theta$ here?
I am understanding that
- $\hat{\theta}$ means "a point estimate of $\theta$".
- $\bar{Y}$ means "The mean of Y"
[Update]
From River's answer I see that in $\frac{1}{n}\sum_{i=1}^n Y_i$ the fraction is causing a scaling, where as to me it looks like it is causing an ordinary maths multiplication. Is there any special notation to differentiate scaling and/or translations from ordinary multiplication and addition operations?
Let's take a specific example. Suppose a pollster wants to figure out the true percentage $\theta$ or $p$ of voters in their country who approve of Polly Politician. They pick out a series of random people, Voter 1, Voter 2, Voter 3, ..., Voter $n$ and ask them "Do you approve of Polly? (Y/N)". The responses they get are the random variables $Y_1, Y_2, Y_3, ..., Y_n$, where the value of $Y_i = 0$ if the voter does not approve, and $Y_i = 1$ if the voter does approve. If the sample is a) small enough compared to the entire population, and b) randomly enough selected from the population, then we may assume that $Y_i$ are approximately i.i.d. Bernoulli with $P(Y_i = 1) = \theta$.
A "specific observation" would be the actual observed sequence of responses (or corresponding $0$'s and $1$'s) gotten from Voters 1 through $n$ here when the pollster asked their opinion. We can loosely speak of any statistic based on/calculated from these observations, such as the sample mean $\hat{\theta}$ which we are using to estimate the true population parameter $\theta$, as also having been "observed" when the pollster performed their experiment.
You would still call $\theta$ and $\theta(1 - \theta)/n$ "parameters" of the distribution, statistically.
$n\hat{\theta} = Y_1 + ... + Y_n$ is binomial with mean $n\theta$ and variance $n\theta(1-\theta)$, which means it's approximated by a normal $X_n$ with mean $n \theta$ and variance $n \theta(1-\theta)$:
$$n\hat{\theta} \approx X_n, \text{ where } X_n \sim \mathcal{N} (n\theta, n\theta(1-\theta)).$$
Dividing by $n$ to get the sample mean $\hat{\theta}$ gives us the approximation
$$\hat{\theta} \approx \frac{1}{n} X_n,$$
and $\frac{1}{n} X_n$ will still be normal, and its mean has been divided by $n$, but its variance has been divided by $n^2$ (reason: variance is average squared distance from the mean; if we divide all the distances by $n$, we divide the squared distances by $n^2$). So our normal approximation $\frac{1}{n} X_n$ to $\hat{\theta}$ has distribution $\frac{1}{n} X_n \sim \mathcal{N}(\theta,\theta(1-\theta)/n)$.
Yep, different notation for the same thing.