Currently I am doing an introduction to parameter estimation and fitting probability distributions to sets of data. So in a small synopsis what I understand the whole process to be like is the following:
1) We collect a large amount of raw data, which comes from an underlying probability distribution. We then "graph" the data (perhaps in the form of a bar chart or something similar at least in the 2D and 3D cases).
2) Observing this visual presentation we go through our list of existing probability distributions and form an opinion on which distribution appears to fit the data most precisely.
3) We then take a large sample from this data and attempt to estimate the parameters of our chosen probability distribution by using the array of techinques available at our disposal.
I have a few questions:
i) Is the outline above the procedure used to get parameter estimates?
ii) (more important) What stops us from making any function a probability distribution? What I mean is we have this visual representation of the data, perhaps none of the known probability distributions that we have presently align with the data. What stops us from just saying "this continuous function will now be a distribution as long as it satisfies the necessary axioms." Is there something more rigourous to this? (perhpas I just haven't arrived there yet in my studies).
Comments: Here is a story based on an actual industrial problem. Some details have been changed to protect proprietary information, but the essence has not been changed. In my experience, the approach and methods used are somewhat more typical of actual probability modeling than are attempts at distribution ID and parameter estimation.
Data on $n = 100$ runs are available. The usual descriptive statistics (from R statistical software) are as follows:
Here is a histogram of the 100 observations, along with a kernel density estimator (KDE) of the population distribution. Tick marks on the horizontal axis show exact values of runs (sometimes nearly-tied observations show as one tick mark).
A Shapiro-Wilk normality test has P-value about 0.002, so we should not try to model $S$ as normally distributed. I think a search for a 'named' distribution that fits these data would not be fruitful.
The percentage of these 100 runs taking more than 35 time units is 18%. That was our preliminary estimate of $P(S > 35).$
Of course, another sample of 100 runs could give us a different estimate of $P(S > 35).$ A nonparametric bootstrap procedure gives a 95% confidence interval that the true probability lies between 11% and 26%.
Subsequently, it was possible to see data for a very large number of runs. That large sample had 15% of runs taking longer than 35 time units. So our current best guess is that $P(S > 35) \approx 0.15.$ There is more to the whole story than that, but I hope this quick view has given you some things to think about.
For the present you should probably take this course as it comes and learn what you can. But keep an open mind about statistical analysis and probability modeling. In particular when you have time, you might want to look on this site, our sister statistics site (Cross-Validated), and more generally online to learn more about KDEs and bootstrapping.