I have been doing some reading and thinking on the nature of statistical inference and the way formal statistics models an event seems a bit strange to me. Here is how I like to think about it:
In the deterministic world, the most basic way of trying to model something is to try to fit a function $y=f(x)$ that models input x to output y. The way this is usually is done is that either you have some theoretical calculations that suggests you the form of $f(x)$ or you look at many data and make a guess (or some computational fitting procedure). In the first case you simply test your guess function against your data by calculating mean absolute error between observed results $y_{obs}$ and the calculated ones $y_{calc}$ via the function $f$ given the inputs $x$. In the second case when you are training your $f(x)$ based on data then you train on some training subset and test it on some validation subset, but again using mean absolute error.
On the other hand when you want to model your data probabilistically, i,e it is too hard or impossible to build $f(x)$, you try to come up with a probability distribution $p(x)$ that tells you how likely it is to get some set of outcomes each time you make an experiment. In this case you have to test your error on the level of the probability distribution. The way I would naturally do is as follows: I assume that you have some guess $p_0(x)$ about what the probability would be like. Then I take set of experimental outcomes $\{x_i\}_{i=1}^N$ and look at the measure $p^N(x)= \frac{1}{N}\sum_{i=1}^N \delta_{x_i}(x)$. One can calculate the difference between $p^N$ and $p_0$ using a norm defined on the space of probability measures. If my guess $p_0$ is really good and $p^N$ converges to $p_0$ strongly then anyways the norm measured in any of these will tend to $0$ as $N$ goes to $\infty$. You can probably only spot differences for low number of data but then again with low number of data modeling your data because more of a choice as there is too much flexibility in what you can fit.
Below is an example where I model data points drawn from a normal distribution (their variance is denoted in the titles as data sigma and all has mean 0). Once I construct $p_N$ from data, I iterate over different values of variance and determine $p_0$ based on which variance gives the less distance to $p_N$ ( I fix mean=0 but that can be varied too). The distance I use is the mean absolute distance between cumulative distribution functions of $p_0$ and $p^N(x)= \frac{1}{N}\sum_{i=1}^N \delta_{x_i}(x)$. (a modified version of Kolmogorov distance). I also compare it to the normal distribution which has the variance and mean of the data which I call the unbiased and denote by $\hat{p}$. In the first figure I compare $p_0$ and $\hat{p}$ with the data on the x-axis. In the second figure I plot cdfs for $p_0$,$\hat{p}$, $p^N$. As can be seen for the cases when there is a lot of data they are almost identical and they are only noticably different for low number of data points. In the first figure the sigma calculated for the $p_0$ is denoted in the titles as Min Dist Gaussian sigma. The distances of $p_0$ and $\hat{p}$ to data are given in the legends.
To me this seems like the only sane way of thinking about modelling probabilistic phenomenon. On the other hand in the statistics domain what people do is they first guess a $p_0$ and then an alternate not $p_0$ which I denote as $p_1$. Looking at $p_0$ and $p_1$ you can somehow determine a region which if most of the data falls then $p_0$ is rejected. This should be somehow a region which is more towards the "high concentration regions" of $p_1$ and "low concentration regions" of $p_0$ so that you are more confident in your prediction. This just seems to me like a very round about way of measuring some sort of distance between the probability measures $p_0,p_1$ and the experimental measure $ \frac{1}{N}\sum_{i=1}^N \delta_{x_i}(x)$. However I was not able to formalise this. So my questions are:
1- Is it possible to see this as some sort of norm comparison between these probability measures?
2- If so, what is the reason this particular comparison is used?
3- And finally if we are looking at distance, why do we not try to find some $p(x)$ which minimises the distance rather and rather just compare one $p_1$ and its negatition?

