First, let me state the original problem (in my own wording):
Describe the decision procedure for testing the hypothesis about the parameter $p$ (success rate) of a Bernoulli distribution. The hypotheses are
\begin{gather} H_0: p = p_0 \\ H_1: p \ne p_0 \end{gather}
where $p_0$ is a fixed number. If $Y = \sum_{i=1}^n X_i$, where $X_i \sim \text{Bernoulli}(p)$ are i.i.d., is available, describe the decision procedure based on $Y$ that will guarantee that the probability of type 1 error does not exceed $\alpha$. (Assume $n$ is large.)
The first solution without the assumption that $n$ is large:
Find two critical values $Y_{lower}$ and $Y_{upper}$ such that $P(Y_{lower} < U < Y_{upper}) = 1 - \alpha$, where $U \sim \text{Binomial}(p_0, n)$. (There are many possible choices.) We accept $H_0$ if $Y_{lower} < Y < Y_{upper}$.
Then if $n$ is assumed large and we are allowed to approximate the distribution of $Y$ with a normal distribution, the method simplifies to
Find two critical values $Z_{lower}$ and $Z_{upper}$ such that $P(Z_{lower} < U < Z_{upper}) = 1 - \alpha$ where $U \sim \mathcal N(0, 1)$. We accept $H_0$ if $Z_{lower} < Z < Z_{upper}$, where $Z = \sqrt{n}\frac{(Y/n) - p_0}{\sqrt{p_0(1-p_0)}}$.
This simplifies the problem slightly if we take the symmetric interval around the mean, i.e., $Z_{lower} = -Z_{upper}$.
Here comes my question:
I was told by a teacher that in practice, some people use $$ \tilde Z = \sqrt n \frac{(Y/n) - p_0}{\sqrt{(Y/n)(1 - (Y/n))}} $$ instead of $Z$. In other words, $p_0$ in the denominator of $Z = \sqrt{n}\frac{(Y/n) - p_0}{\sqrt{p_0(1-p_0)}}$ is replaced by $\hat p = \frac{Y}{n}$.
His explanation was that $\sqrt{p_0(1-p_0)}$ might not be representative of the true variance, and the sample variance $\sqrt{\hat p(1 - \hat p)}$ may be better. I don't think this reasoning is valid because we are testing the hypothesis!
However, after some reflection, I am starting to think that using $\tilde Z$ might not be such a wrong thing because we did employ the central limit approximation, and using $\tilde Z$ might correct the approximation in a proper way. Of course this would somewhat contradict the assumption that the problem allows you to use normal approximation, but is this kind of correction (if it's really a correction) valid to some degree?
Why I think it might be correction in the right direction: Suppose $\frac 12 < \hat p < p_0$. Then $\sqrt{\hat p(1 - \hat p)} > \sqrt{p_0(1 - p_0)}$, so $Z < \tilde Z < 0$, making it more probable to accept $H_0$ using $\tilde Z$ than $Z$. On the other hand, if $\frac 12 < p_0 < \hat p$, it becomes less probable to accept $H_0$ using $\tilde Z$.
When one is constructing a hypothesis test, you have to find a way to balance Type I error and Type II error the best. The problem is that Type I error and Type II error are usually inversely related (as in the case of this problem) which means that if you want decrease probability of Type I error in test you are going to increase Type II error. Thus the way we construct hypothesis tests is that we have threshold of how much probability Type I Error we will allow (since it is impossible to get it to 0), this threshold is called the significance level. From there, we can construct a test with that significance level, but has most power (i.e. power=$1-P(\textrm{Typer II Error)}$). In general as in this case, you look at test that actually achieve significance level since these test will have more power than ones that achieve lower than significance level
Now as you know when we do hypothesis testing we assume the null hypothesis is true (innocent until proven guilty) and that our test should have a probability of rejecting the null hypothesis when the null hypothesis is true equal to our significance level (i.e. $P$(Type I error)$\leq\alpha$). Now this is why $Z$ usually performs better than $\tilde Z$. Remember our highest priority is to ensure that $P(\textrm{Type I error})\leq\alpha$. Well if null hypothesis is true and we have a large enough sample size, the variance will be exactly $Var(\frac{Y}{n})=\frac{p_{o}(1-p_{o})}{n}$ instead of approximately $\frac{\hat p(1-\hat p)}{n}$. Thus since $Z$ relies on less approximations that $\tilde Z$ the $P$(Type I Error) will be closer to actually being our significance level.
The reason we even bring up $\tilde Z$ is because its a general test for sample average (which ends being a proportion in Bernoulli/Binomial situation), but since you can find exact sample variance when null hypothesis is true its not the best test to use.
Here are some resources also about Wald and Score Tests http://ocw.jhsph.edu/courses/methodsinbiostatisticsii/PDFs/lecture18.pdf http://www.biostat.umn.edu/~dipankar/bmtry711.11/lecture_02.pdf