Can I think of probabilities as proportions instead?

96 Views Asked by At

I am new to probability theory, so bare with me if I do not nail all of the terminology (I will still try my best)! Also, I gave a short "What is my question" sentence, but I invite you to read the whole question if you want the full picture. To help aid in that, I added an index that will help you understand what the separate sections are about at a glance!

My Question: [Thinking of probabilities as proportions instead, helps me understand what the probabilities, of abstract mathematical concepts, really mean. Is this allowed and accurate for me to do? Moreover, is this a common technique used? Lastly, why do people describe the probability from the SoftMax function like it is the "chance" of a context-word being near a target-word in the raw-text, when it most certainly is not that prior to the final trained model.]

Index:

1.) Introduction to Probability as a Proportion of Magnitude: Viewing probability from the lens of a normalization technique, and how that allows for it to be seen as a proportion.

2.) Application to Word2Vec Model: Brief description of where the probability seemingly lies in the Word2Vec model.

3.) Misconceptions in Terminology: How this probability is misinterpreted because it is thought of as a probability in the conventional sense (how the majority of society uses it). And how the misinterpretation of the probability does not allow for one to see how the probabilities meaning changes from un-trained to trained model.

4.) Mathematical Interpretation of SoftMax Generated Probability: Showing how mathematically the probability is better described as a proportion.

5.) From Inherent Meaning to Human Interpretation: How us as humans draw conclusions of similarity between words from probabilities that seem to have no inherent meaning, along with the word-embedding's meaning.

Introduction to Probability as a Proportion of Magnitude:

Probabilities are really just scalars of range between $0$ and $1$, that tell you how much of a proportion the entire set's magnitude is made up of by a single item in that set (since the set of probabilities summed together is always $1$). So this really means that probability is really just a normalization technique, and on that same note, this means that literally anything with a magnitude can be made (normalized) into a probability. Going even deeper than that, things (items) without a magnitude can still be made into a probability as long as there is a set of them, since an item being a proportion of a whole set inherently gives that item a magnitude, right? For instance, when we talk about the probability of a coin being heads or tails, we are actually counting what magnitude the proportion of one side of the coin makes up out of the entire set of sides (summed together).

Application to Word2Vec Model:

The abstract mathematical concept turned to probability I personally am referring to, is cosine similarity. More specifically the probability made by the SoftMax function in the Word2Vec model being referred to as, "likelihood of a word being next to a target word in a text". This sort of implies a "probability" in the conventional sense where we can calculate the true chance of something happening from the start, with just the item and set given, no need for observational analysis (LIKE A COIN FLIP). Where in reality, the Word2Vec probability is one of proportion of total magnitude.

Misconceptions in Terminology:

This is a problem because people keep describing this probability scalar as a (like I said earlier):

"The likelihood/probability of a context-word being next to a target word in the raw-text"

Which leads down a slippery slope! Because this is only true at the very END of the training in the model, that is the only time the above quote is true. But many people I have interacted with have fallen down this slippery slope. They think that since the probability of a coin-toss from start to end is the "chance"/probability/likelihood of what side a coin will land on, that they can apply this thinking to the probability scalar in Word2Vec, saying that from start to end the probability scalar describes the "chance"/probability/likelihood of a context-word being next to a target-word in the raw-text.

Mathematical Interpretation of SoftMax Generated Probability:

THIS NEXT PART IS WORDY: In a purely mathematical sense, the probability from the SoftMax function, is really just the proportion a single scalar magnitude from the dot product between a single context word-embedding vector and target word-embedding vector (one item in a set) makes up out of the sum of scalar magnitudes (items) generated individually by every word in the vocabulary via the same dot-product process (the set of items). And this is verifiable by the fact that a higher scalar magnitude from the the dot-product of two word-embeddings, will always have a higher probability than a smaller scalar magnitude from the dot-product of two word-embeddings.

From Inherent Meaning to Human Interpretation:

In a purely mathematical sense, there is no inherent meaning created by this probability, because the proportions of magnitudes are created from an abstract mathematical concept, cosine similarity. They are only proportions not "chances". "No inherent meaning" in the conventional sense, by giving you a "chance" of something happening, like the side of a coin flip being heads.

We as human beings realize that cosine similarity scalar (simplified to dot-product between two vectors) tells us how close two vectors are in vector space. Which means that two word-embedding vectors that are super close to each other in vector space will give high probability scalar (due to that fact that they will first give a high cosine similarity scalar). This means, we as humans, can assign meaning to the probability scalar, by stating a fact that the higher a probability scalar is, the closer two word-embedding vectors are in vector space. Using this statement/fact like an axiom, we can now try and make words that appear close to each other in the raw-text, have word-embedding vectors that are close to each other in vector space (similar). But this alone still gives no inherent meaning of word similarity being correlated with two words being close to each other in vector space. We are missing the final puzzle piece, and arguably the most important piece. We use the distributional hypothesis to assign meaning to the word-embedding vectors (and the probability/proportion as well). Since two word-embedding vectors being close in vector space means two words are close in the raw text, then by the distributional hypothesis, we can state that two word-embedding vectors being close in vector space means that the two words represented by these vectors are semantically similar.

So by the end of the training the pipeline will look like so, two semantically similar words will be trained/changed to have word-embedding vectors near each other in vector space -> which means the cosine similarity will be higher -> which means the probability scalar will be higher -> which finally means that we can assign meaning to this probability, stating, "The higher the probability scalar, the more semantically similar two words are (in a trained model)".

And so in conclusion, I think that probability in many cases, especially here, should be called proportion instead. Because what we are really observing most of the time from the "probability" is the proportion of the magnitude one item of a set makes up out of the whole set, normalized to $[0,1]$.