I am teaching myself further maths A Level Year 1 statistics from the OCR book (A).
Chapter $5$ is about Correlation and regression.
I get that the ppmcc is given by $r,$ which is just a number with $\vert r\vert < 1, $ and represents how closely a sample of data matches a straight line on a graph- a graph that represents the bivariate data.
Then it talks defines $\rho$ as the population correlation coefficient- so far so good.
Then it delves into how to carry out a hypothesis test for whether or not the two variables (of the bivariate data) have any correlation in the overall population, using the data from a sample. That is:
$H_0: \rho=0;\ H_1: \rho \neq 0$ for a two-tailed test, or either $H_1: \rho < 0$ or $H_1: \rho > 0$ for a one-tailed test.
Then it says, "If both variables follow a normal distribution then the table of critical values can be used to conduct the hypothesis test. The null hypothesis is rejected if the sample correlation coefficient is more than the critical value associated with the sample size $(n)$ and the significance level."
I get the idea of Hypothesis tests in general. However, here I don't know what it means by, "If both variables follow a normal distribution." What does this mean? If what follows a normal distribution?
The worked example given in the book is the following:
Pheobe is looking for evidence that the population of cod $(c)$ and the population of tuna $(t)$ (where population is assessed by kilograms of fish caught), in various seas are negatively correlated. She observes eleven pairs of values and summarises her results:
The statistical summary of results in the question is:
$$ \sum c = 165,\ \sum c^2 = 2585,\ \sum t = 81,\ \sum t^2 = 757,\ \sum ct = 1184. $$
$(a)$ find the ppmcc for the data.
$(b)$ Conduct an appropriate test at the $5$% significance level.
$(c)...$ is irrelevant to my question.
I don't understand what I'm meant to be checking follows a normal distribution in this question, in order to justify using the table of critical values.
The next section of the chapter is about Spearman's rank correlation coefficient, and it says, "when the samples are drawn from populations following a normal distribution, the ppmcc is a very good way to test for correlation".
Again, what does it mean for a population to follow a normal distribution? The reason this doesn't make sense to me is because usually we talk about an attribute/ variable of something following a normal distribution. For example, it makes sense to say, "the masses of tuna fish follow a normal distribution". But I don't get what it means by "a population follows a normal distribution". Anyway, now I am repeating myself, so I stop typing...
It just means that the population can be modelled by a normal or, more usually, an approximate-normal probability distribution (this is ok because “everything natural is basically normal” which is good enough for A level). For example, the population of cod and tuna following a bivariate normally distribution means that the probability of observing a set of values $(c,t)$ (e.g. $\{(c,t):0\le c\le100\wedge t\le c\}$) can be given by integrating the bivariate normal pdf with suitable parameters.
This is necessary since, under the hood, all these tests are following precise mathematical proofs where certain assumptions about the underlying data are necessary in order to show that, e.g., the probability distribution of $r$ is approximately (something) so that we can calculate critical values by integrating the pdf of (something). The further maths A level does not explain any of this as the exam boards prefer a wealth of formulae over any deep explanation... to be fair, for statistics the proofs are too dense for A level students (but I will never not be mad at the way they teach pure!!)
A more clear example (because I actually know what the distribution is ;) I believe pmcc asymptotically follows some kind of student’s T distribution but I’m not sure) of the second paragraph can be found in the $\chi^2$ tests. It can be shown (on Wikipedia there is a proof) that if you assume the observed data categories to follow a multinomial distribution with the probabilities given by your hypothetical distribution, then the probability that the observed $\chi^2$ statistic as taken from a sample is greater than a given value can be modelled (asymptotically for large samples) by a $\chi^2$ probability distribution with suitably many “degrees of freedom”.