I am reading the book: Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow by Aurélien Géron and am a little confused by the following segment:
So far we have considered purely random sampling methods. This is generally fine if your dataset is large enough (especially relative to the number of attributes), but if it is not, you run the risk of introducing a significant sampling bias. ... For example, the US population is composed of 51.3% female and 48.7% male, so a well-conducted survey in the US would try to maintain this ratio in the sample: 513 female and 487 male. This is called stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances is sampled from each stratum to guarantee that the test set is representative of the overall population. If they used purely random sam‐ pling, there would be about 12% chance of sampling a skewed test set with either less than 49% female or more than 54% female.
I am not understanding two things here:
- Why would purely random sampling have any chance of sampling a skewed test set?
If the US population is 51.3% Female and 48.7% Male, and they use purely random sampling, then each person has a uniform likelihood of being selected into the sample. Doesn't that mean that this should lead to a representative sample?
- If there is indeed a chance of sampling a skewed test set, why is that 12%, and why does it hinge on having a test set with fewer than 49% females or more than 54% females? What is the relevance of 49% and 54%? They seem completely arbitrary.
I am aware this question has been posted here: https://stats.stackexchange.com/questions/294151/could-someone-explain-how-this-estimate-number-is-being-arrived-at
However, the answer only addresses how to get the 12% number (and in fact the answer is 12.5%, not 12%), and the points 1 and 2 that I have raised are not addressed.