What is the sample space, outcomes, event space, random variables in machine learning?

2.9k Views Asked by At

Reading through materials of machine learning problems, I see people treating things like they are doing with probability.

Particularly, consider linear regression. I cannot figure out what is the sample space, outcomes, events, random variables. In what manner they are using the word "probability measure" in the field of machine learning?

For instance, please take a look at this article.

Moreover, in general, why do we need probability for machine learning? I don't see how we are going to calculate any probability of something in machine learning. While the need of linear algebra is obvious, since we are working on lists matrix of numbers. What properties of probability make it essential for machine learning?

2

There are 2 best solutions below

1
On

Well, linear regression is used to model the relationship between input X and output/outcome Y.

Outcome: The y-value for each sample.

Sample space: all possible outcomes, which could be, like in the figure on the wikipedia article all real numbers or a subset of it. (Compare to a binary classification problem: Sample space {-1, 1})

Event: Everything you want to know a probability for. E.g. How likely is it to observe an outcome between 1 and 3?

random variables: X values.

I'm not sure if I understand "Why do we need probability for machine learning?". Maybe I haven't thought about it enough yet. For me probability is the point of machine learning.

You need probability to find out how "good" ore likely your model (classifier, regressor, ...) is. You need concepts like a probability distribution to make sense of data, to find out which features are actually useful (e.g. PCA) (not all problems have one input variable x and one output variable y, that would be way to nice!)

I used this document and the wikipedia article for explanations.

0
On

I'd like to answer your question in the context of computer vision. An example would be image classification, where the labels of each image are mutually exclusive (one label per image). Lets say there are $n$ classes.

The input features are the image/s itself (which can be preprocessed in a multitude of ways such as rgb channel normalization, resizing, various data augmentation schemes etc.). The features are transformed through the convolutional layers of a backbone structure (such as VGG-$16$, Alexnet, or Resnet), and then fed through a Multi-Layer Perceptron to a Sigmoid layer to ultimately get a probability vector say $p$ with $n$ entries, one per class (sometimes $n+1$ for a background class, depending on if there are "empty" images).

During training, using a cross-entropy loss function, we compute for a single image (ignoring batching for now)

$\sum_{i=1}^{n} - g_{i} log(p_i)$ where $g_i$ is $1$ for an image with ground truth label $i$, and $0$ otherwise. As $p_i$ approaches $1$, we get a smaller loss, and hence a good prediction. As $p_i$ approaches $0$, we get a large loss, and hence a bad prediction. The network is learning to model the true probability distribution $p(Class_{i}|Image)$ which is $1$ for images with a classification label belonging to class $i$ and $0$ otherwise.

The optimizer (Stochastic Gradient Descent, Adam, etc.) then takes gradients of the model variables in the network that require gradient updates with respect to the loss, and updates the weights according to the optimizer algorithm.

For some other computer vision uses of probability...

In object detection, confidence scores (probabilities) are assigned to the detections that are output by a model, in addition to bounding boxes. This is done in order to threshold which detections will be evaluated in the computation of mean average precision or mean average recall, which are the evaluation metrics commonly used.

For a detection whose bounding box IOU is larger than some threshold, say $0.5$, for a given ground truth box with category label $c$, the model should learn to have high confidence values reflecting how certain the prediction is. The probability thresholds yield a trade-off between precision and recall. Set it too low? You will have a less precise model. Set it too high? You will miss predictions.

In semantic segmentation, for an image of size $(600, 600)$, with say $C$ classes (including the background), your model should get a feature map tensor of size $(600, 600, C)$ where by choosing at each pixel the channel value ($C$ value) that has maximal output, we choose which class the pixel belongs to for our inference prediction.