Based on this site: The MiniTab Blog, it states that the results from hypothesis testing and confidence intervals always agree.
I found this to be true when testing for mean.
However, when testing for proportions, this does not seem to apply - even from the formula you can see that it is different (standard deviation is different). Am I having wrong preconceptions? And if not, which one would be a better way of testing?
Please explain in a simpler way (just started this topic, with basic knowledge and some formulae on my hand). No comprehensible answers for me online so far.
(If you're interested, the question that puzzled me is as follows: A six resulted 86 out of 420 times a dice is rolled. Comment on the fairness of the die. [I assume 95% confidence level/5% significance level])
I think you need to be clear about what test statistic you are using for your hypothesis, and what confidence interval, because there are a variety of choices available for a binomial proportion.
For a large-sample calculation as in your case ($n = 420$), and a hypothesized proportion $p_0 = 1/6$ that is not close to $0$ or $1$, a normal approximation to the binomial proportion is suitable, so that under the null hypothesis $$H_0 : p = p_0,$$ the sample proportion $\hat p = x/n$ is approximately $$\hat p \mid H_0 \sim \operatorname{Normal}\left(\mu = p_0, \sigma = \sqrt{\hat p(1-\hat p)/n}\right).$$ Then the test statistic $$Z \mid H_0 = \frac{\hat p - p_0}{\sqrt{\hat p(1-\hat p)/n}} \sim \operatorname{Normal}(0,1)$$ is compared against $z_{\alpha/2}^*$, the $100(1-\alpha/2)$ percentile of the standard normal distribution, and if $|Z| > z_{\alpha/2}^*$, we conclude the data furnishes sufficient evidence to reject $H_0$ at a significance level of $\alpha$. In your case, we observe $x = 86$, hence $\hat p = 0.204762$, and the test statistic is $$Z \approx 1.93474 < 1.95996 = z_{0.025}^*,$$ so we fail reject $H_0$ at the $\alpha = 0.05$ significance level: there is insufficient evidence to suggest the die is unfair.
The associated confidence interval for this test is given by $$\hat p \pm z_{\alpha/2}^* \sqrt{\hat p (1 - \hat p)/n} = [0.16617, 0.243354],$$ which does contain $1/6$, so there is no contradiction here. Indeed, we see that there cannot be a contradiction, because $$\hat p - z_{\alpha/2}^* \sqrt{\hat p (1 - \hat p)/n} \le p_0 \le \hat p + z_{\alpha/2}^* \sqrt{\hat p (1 - \hat p)/n}$$ if and only if $$-z_{\alpha/2}^* \le \frac{p_0 - \hat p}{\sqrt{\hat p (1 - \hat p)/n}} \le z_{\alpha/2}^*,$$ the middle expression of which is simply the test statistic $Z$; hence $$|Z| \le z_{\alpha/2}^*.$$
But as I said at the beginning, you need to be clear about what statistic you are using. The one we used above is the Wald test statistic. If we use the (Wilson) score statistic, then $$Z \mid H_0 = \frac{\hat p - p_0}{\sqrt{p_0 (1 - p_0)/n}} \sim \operatorname{Normal}(0,1),$$ where you will notice that the standard deviation does not use the empirical (observed) proportion $\hat p$, but the null proportion $p_0$. This would give us $$Z \approx 2.09489 > 1.95996 = z_{\alpha/2}^*,$$ and we would reject $H_0$ at the $\alpha = 0.05$ level. Note this conclusion is different than under the Wald test.
Then, what is the corresponding 95% confidence interval? It is not the Wald interval we calculated above. Instead, it is $$\left(1 + (z_{\alpha/2}^*)^2/n\right) \left( \hat p + \frac{(z_{\alpha/2}^*)^2}{2n} \pm z_{\alpha/2}^* \sqrt{\frac{\hat p (1 - \hat p)}{n} + \frac{(z_{\alpha/2}^*)^2}{4n^2}} \right).$$ This is considerably more complicated than the Wald interval, and we also observe that unlike the Wald interval, the midpoint of the score interval is not the sample proportion $\hat p$ (it is asymmetric). This gives us a 95% CI of $$[0.19124, 0.231259],$$ and we see that $p_0 = 1/6$ is not contained in this interval, again consistent with the corresponding hypothesis test.
This leads one to ask, "if the conclusions are different, which test statistic is the 'right' one to use?" Generally speaking, the Wald test (and its CI) in this case has less power to detect the unfairness of the die. This is because the standard error of the Wald interval is larger than it "should" be if indeed the die were truly fair, since the observed proportion was larger than the expected proportion under $H_0$: $$\hat p \approx 0.204762 > 0.166667 \approx p_0.$$