Calculating Laplace's law for bigrams

Question

Calculating Laplace's law for bigrams

5.1k Views Asked by Bumbble Comm At 02 Apr 2026 - 10:07

Reading this PDF, I encountered a very simple simplification that I can't obtain. Basically, it asks what is the probability of the occurrence of a word $w_{n}$ given that we know another word $w_{n-1}$ already appeared before. Usually, this is calculated using the Maximum Likelihood Estimate which gives the probability:

$$P(w_{n}|w_{n-1}) = \displaystyle \frac{C(w_{n-1}w_{n})}{C(w_{n-1})}$$

where $C(w_{n})$ is the frequency of the word $w_{n}$

However, using an alternative probability called Laplace's law or Expected Likelihood Estimation we have as probability of $w_{n-1}w_{n}$

$$P(w_{n-1}w_{n}) = \frac{C(w_{n-1}w_{n})+1}{N + B} \tag{*}$$

where N is the number of tokens considered in our sample and B is the number of types which in this case would be B = V (V = vocabulary size) for unigrams and B = $V^{2}$ for bigrams.

The PDF says that $P(w_{n}|w_{n-1}) = \displaystyle \frac{C(w_{n-1}w_{n})+1}{C(w_{n-1}+V)}$ however I can't prove that. I expected to get that answer by using Bayes rule like this:

$$P(w_{n}|w_{n-1})=\displaystyle \frac{P(w_{n-1}w_{n})}{P(w_{n-1})}$$

and using (*) but I don't get anything similar.

Some help would be appreciated.

By the way, the part I'm referring to is contained in a slide titled "Laplace Add-One Smoothing"

Regards

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Accepted Answer

Laplace smoothing is a result of maximum aposteriori (MAP) estimation of the conditional probability $P(w_n|w_{n-1})$ under a Dirichlet prior.

Specifically, we want to estimate the $|V|$-valued multinomial distribution $p_k = P(w_n=k|w_{n-1})$, where $V$ is the vocabulary. Let us impose a Dirichlet prior on this multinomial distribution. That is, let $(p_1,\ldots,p_{|V|})$ be drawn from $Dir(\alpha,\ldots,\alpha)$, where $\alpha \geq 0$ is the concentration parameter of the Dirichlet distribution.

The MAP estimate of $p_k$ can be found out as under:

$\hat{p}_k = \arg \max_{p_k} \sum_{k_1=1}^{|V|} C(w_{n-1}, k_1)\log p_{k_1} + \log{\Gamma(\alpha|V|)} - |V|\log{\Gamma(\alpha)} + \sum_{k_1}^K(\alpha-1)\log p_{k_1}$

subject to $\sum_{k_1=1}^K p_{k_1} = 1$

($\Gamma(x)$ is the Gamma function)

The first term in the objective term is due to the multinomial likelihood function, while the remaining are due to the Dirichlet prior. We can now use Lagrange multipliers to solve the above constrained convex optimization problem. The solution is the Laplace smoothed bigram probability estimate:

$\hat{p}_k = \frac{C(w_{n-1}, k) + \alpha - 1}{C(w_{n-1}) + |V|(\alpha - 1)}$

Setting $\alpha = 2$ will result in the add one smoothing formula.

Calculating Laplace's law for bigrams

There are 1 best solutions below

Related Questions in PROBABILITY

Related Questions in BAYESIAN

Trending Questions

Popular # Hahtags

Popular Questions