I have been studying the basics of the Hypothesis Test in my statistics class, more specifically, studying the Analysis of Variance F-test. My question has to do with the $p$-value. Thus, here is my question:
If the p-value of the test comes out to be a small number, this fact is taken as justification for "rejecting the null hypothesis." Why is this a reasonable conclusion?
I am not sure what my professor meant by a "small number", but after doing some research, it turns out that if our p-value $\leq .05$, this suggests that we can reject our null hypothesis. He made a remark that if I said that, for example, that "this is the rule that statisticians use", it would not be proper justification. So, if I do end up yielding a "small number" for my p-value, why can I reject the null hypothesis?
It might be helpful to do this exercise on a more concrete example than then F-test... so let's consider the classic example of coin flips.
Here our null hypothesis will be that the coin is fair. Let us say that we flipped the coin $100$ times and got $99$ heads. Pretty much anyone would be comfortable rejecting the null hypothesis under this circumstance. Why? Because the probability of a fair coin being flipped $100$ times and coming up heads $99$ of those times is tiny!
We were able to decide this on the basis of our intuition. But what if it were $75$ heads instead of $99$? How about $60$? We need a way to quantify. We use the idea in the last sentence of my previous paragraph. We said the probability of the experiment coming out that way was tiny, but how tiny was it? This probability is called the p-value.
Stated semi-formally, the p-value is the probability under the null hypothesis that the experiment will out as extreme or more extreme than the data you have in front of you. We can calculate exactly the probability that a fair coin will come out with either $0$, $1$, $99$ or $100$ heads (the $0,1,100$ are included cause they are as extreme, or more extreme than $99$ heads). It is $$ p = 2\frac{1}{2^{100}} + 2\cdot 100 \frac{1}{2^{100}} = 1.6\times 10^{-28}.$$ This is the p value. Notice it is a tiny number, indicating that the results we say are extremely unlikely under the null hypothesis. This gives us good confidence in rejecting it.
Now we can compare to the case where we see $75$ heads or $60$ heads out of $100$. For $75$ heads, we can compute a p-value of $5.6\times 10^{-7}$ and for $60$ heads we get $0.057.$ Now we have a little more perspective on these less obvious cases. It turns out getting $75$ heads/tails or more is a one in a million occurrence if the coin is fair. So if it wasn't obvious before we can feel more confident in rejecting the null hypothesis that the coin is fair.
And we see the probability of getting $60$ or more is around $6\%.$ This is unlikely, but not that unlikely. So we might be more cautious about rejecting the null hypothesis... it's possible the coin is biased, but it's also plausible that it was just a statistical fluke that there were $60$ heads.
Where to draw the line is subjective. A historical convention, as you mentioned, is to use a threshold of $5\%$ (whereby we would elect to retain the null hypothesis for the case of $60$ coin flips). So we pick a line (before the experiment is done) based on how often we're comfortable being wrong, and then reject if the p value from the experimental outcome is lower.
This is the sensible (though not unflawed) logic of hypothesis testing. And it explains why low p-values mean we should reject.