Suppose you have a bunch of data for some observed quantity, and you have two hypotheses $f(x)$ and $g(x)$ for the probability distribution function, and suppose for simplicity that you know that one of these two models accurately represents your distribution. For example, maybe $f(x)$ is a gaussian distribution, whereas $g(x)$ is the sum of $f(x)$ plus another normal distribution with a different mean and smaller amplitude. This could correspond to measuring people's heights, where the null hypotheses would be that heights are normally distributed, and another hypotheses that heights have two peaks for men/women.
How would one go about comparing how likely the data comes from one model vs the other? What would you do if you do not go in with the inherent assumption that the single normal distribution is the "more likely" one, but your data is not significant enough to rule out a normal distribution completely?
This seems like a pretty fundamental question so I'm open to reference links.
Judging Bimodality of Mixture Distributions.
A couple of examples to illustrate my comment:
First example. Means of the two normal components are several standard deviations apart.
Combined data fail a Shapiro-Wilk test of normality.
Histogram shows bimodality. (Also, tick marks along the horizontal axis, at individual observations, show two clusters.)
A normality plot of the 200 values is not anywhere near linear. The 'S' shape in this plot is typical of bimodal data.
Second example. The means of the two components are more nearly equal, and the standard deviations are a little different. [Very roughly speaking, perhaps with a little exaggeration of differences, this may be more like the mixture distribution of adult humans (men and women together).]
Shapiro-Wilk test still detects non-normality (at 5% level).
However, a histogram does not clearly show bimodality.
Judging normality by looking at historgrams is subjective. A histogram of the same data, but with different binning, might give a slightly different impression. If there are not too many observations, it may help to put tick marks along the horizontal axis to show individual observations. The histogram with more bins seems 'uneven', but I see no evidence of bimodal clustering among the tick marks.
Here a normal probability plot does not clearly show bimodality, but it is not as close to linear as one would expect for normal data. Roughly speaking, this may be the reason for failing the Shapiro-Wilk test.