Is "Probability Theory" an Inseparable Aspect of Machine Learning?

157 Views Asked by At

I have always had the following question about Probability and Machine Learning.

  • As a simple example, suppose we have some data (e.g. heights of students: 175 cm, 181 cm, 162 cm, etc.) . If we assume that this data comes for a Normal Probability Distribution, we could use the Likelihood Function of a Normal Distribution to estimate the specific Normal Distribution (i.e. the mean and standard deviation) that would have been "most likely" (i.e. optimal) to produce this data.

  • As a slightly more involved example, when we are working with Regression Models, we are interested in optimizing the "conditional expectation of the response given the covariates", provided we assume some underlying Probability Distribution. In many aspects, this is just like the first example - in most cases, we assume that the difference between the Actual Response and the Predicted Response (i.e. "error") is Normally Distributed - effectively, this Normal Distribution has a mean of "b0 + b1x1..." . We are now trying to find out the optimal values of these beta regression coefficients provided some underlying Probability Distribution.

Now for instance, consider a more sophisticated model such as Neural Networks. In Neural Networks, we want to find out the optimal parameters (i.e. neuron weights) of the model - this is done optimizing the Loss Function associated with the Neural Network (based on the observed data).

I was thinking about this for a while: it seems that unlike the earlier two examples, the error of the Neural Network does not seem to be required to follow some Probability Distribution - but it seems like the general idea still follows: Given a specific input (i.e. combination of covariates), we would like to identify the response for this input that has the highest probability "conditional" on this observed input.

Thus, my question: Do the loss functions of Neural Networks have an inherent Probability component? Can we say that the predictions being made by some specific Neural Network model is "probabilistically the most likely response" given some observed inputs? Or does the notion of Probability have no real relevance here?

In the end, I am thinking that perhaps all Statistical Decision Theory (e.g. optimal classification label for a new observation, optimal prediction for a new observation) might have an inherent probabilistic interpretation, regardless of the Machine Learning model being used.

Thanks!

Note: I was thinking that maybe all this (i.e. Probability and Machine Learning Loss Functions) can come together through Empirical Loss Minimization (https://en.wikipedia.org/wiki/Empirical_risk_minimization) - I have heard people say that Machine Learning Models are trying to "learn" a High Dimensional Joint Probability Distribution corresponding to the observed data: Would it be a stretch to say that the prediction being made by the Neural Network model for a specific set of inputs is in fact the "probabilistically most likely prediction" corresponding to this set of inputs?

Note: On a more abstract level, can we say that Machine Learning algorithms are trying to minimize the "Risk of Misclassification" (http://www.cs.cmu.edu/~aarti/Class/10704_Fall16/lec11.pdf)

Note: Just thinking about this, I don't think that when estimating the Weight Parameters (i.e. Neuron Weights) of a Neural Network, the standard approach using Gradient Descent will automatically produce Confidence Intervals for each Weight Parameter: We almost never hear about Maximum Likelihood Estimation and Neural Networks together. For example, if the final value of Neuron Weight #34 is 556.2 - I don't think we have any reason to believe that this value of 556.2 corresponds to a Normal Distribution centered at 556.2 (i.e. mean of 556.2)?

1

There are 1 best solutions below

3
On

Your notes are pointing in the right direction. One does not need to define a sampling probability to train a model; the neural network example is good. Remember, though, that you can train a neural network by defining a probability distribution on the outcome and doing maximum likelihood. Another example is one of the many interpretations of least squares. We can think of least squares as finding a matrix that solves a system of linear equations. No probabilistic interpretation.

Where does probability come in, then? Probability comes in in many ways. For example, in Bayesian methods, you need to define a prior probability distribution (in most cases, you also have to define a sampling distribution/likelihood), which gives you "automatically" a probabilistic interpretation of the found parameters. In (classic) frequentist statistics, you need to make assumptions on the asymptotics of your data or methods. These give you the possibility of thinking of the derived parameters following some distribution.

Edit 29/Mar/2022

This is an edit to reflect on your note. I think it is fairly common to train a neural network to minimize the negative log-likelihood of a function. This is the approach taken, for example, in mixture density networks. With respect to confidence intervals, the answer is no. When you estimate neural networks through any "vanilla" gradient descent method, you don't get any confidence intervals. Finding ways to efficiently quantify uncertainty in neural networks is a big area of research nowadays. One way to do so can be through Bayesian NN, where you get a posterior distribution on the weights of the NN. Another is to use conformal prediction, where you get calibrated (un)certainty on the output of your neural network.