On the notation of the likelihood function

1.1k Views Asked by At

Let $X$ be a random variable realized as the event $(X=x)$. The corresponding likelihood function is given by

$$\mathcal{L}_x:\Theta\rightarrow[0,1]$$ $$\theta\mapsto P(X=x|\theta)$$

for a space $\Theta$ of parameter configurations $\theta$.

In the literature, $\mathcal{L}_x(\theta)$ is sometimes written as $\mathcal{L}(\theta|X=x)$. I assume this is done to emphasize that the event $(X=x)$ is 'given'. However, this notation leads to confusion, since it suggests that $\mathcal{L}_x$ is a probability density ('conditioning' on the event $X=x$), which appears to not be true in general (cf. second answer in this thread on math.overflow).

So my questions are:

  1. Is $\mathcal{L}(\theta|X=x)$ just 'overloading' the notation $f(\cdot|\cdot)$, or is there some hidden meaning/analogy to conditional probability $P(\cdot|\cdot)$ which I am missing?
  2. Are there other areas in mathematics where $f(\cdot|\cdot)$ is used? Could you provide an example?

Currently, I think $\mathcal{L}(\theta|X=x)$ is a bad notational choice because it caused confusion for me when trying to understand the likelihood function. Especially since at any point $\theta$, one has $\mathcal{L}(\theta|X=x)=P(X=x|\theta)$

2

There are 2 best solutions below

1
On

In classical (frequentist) statistics $\theta$ is unknown constant, thus there is no sense in viewing $L(\theta|X)$ in any probabilistic manner. As you can see in the linked thread, $L(\theta|X)$ does not even need to be integrated (w.r.t. $\theta$) to one. Hence, the more common notation is $L(\theta; X=x)$ or its shorthand $L(\theta; X)$ or $L(\theta; x)$, which just designates the fact that we view it as a function of $\theta$ over the parametric space $\Theta$, and regard the $X$ as constant $X=x$.

In the case you are Bayesian, then you usually denote the posterior distribution of $\theta$ by $f(\theta|...)$ or $p(\theta|...)$, not to confuse it with the classical likelihood function. But to this notation to make sense you must assume a prior distribution for $\theta$, $f(\theta)$. That is, from the very beginning you regard $\theta$ as a random variable and not as a constant.

1
On

Consider data $x = 8$ sampled from a binomial population with success parameter $\theta$ (unknown and to be estimated) and a known number of Bernoulli trials $n = 12.$

PDF. If it happens that $\theta = 0.5,$ then the PDF gives you the probability of getting $X = 8:$ $$f(x |n=12, \theta=0.5) = {n \choose x}\theta^x(1-\theta)^{n-x}\\ = {12 \choose x}.5^x(1-.5)^{12-x} = {12\choose x}(.5)^{12},$$ for $x = 0, 1, \dots, 12.$

plot(0:12, dbinom(0:12, 12, 1/2), type="h", 
     lwd=2, col="blue", ylab="PDF", xlab="x", 
     main="PDF of BINOM(12, 1/2)")
 abline(h = 0, col="green2")

enter image description here

Likelihood. Now, if you have observed $x = 8$ and wish to find the corresponding estimate $\hat \theta$ of $\theta,$ then the PDF, considered now as a likelihood function, might be written as

$$\mathcal{L}(\theta|x,n) = \mathcal{L}(\theta|8,12)\propto \theta^x(1=\theta)^{n-x} = \theta^8(1-\theta)^4,$$ for $0 < \theta < 1,$ where the symbol $\propto$ (read 'proportional to') is used as a reminder that the (now irrelevant) constant ${n\choose x}$ has been omitted. Maximixing $\mathcal{L}(\theta|8,12)$ in $\theta$ we obtain the maximum likelihood estimate (MLE) $\hat\theta = x/n = 9/12 = 2/3.$

th = seq(0, 1, by=.01)
like = dbinom(8, 12, th)
plot(th, like, type= "l", ylab="Likelihood", 
     xlab = "theta", lwd=2, 
     main="Likelihood Function")
 abline(h=0, col="green2")
 abline(v=8/12, col="maroon")

mle = th[like==max(like)];  mle
[1] 0.67

enter image description here

Comparison. It is simple to say that the PDF and the likelihood function are the 'same thing'. But not exactly true.

  • The PDF is a function of $x,$ for given parameters $n$ and $\theta.$ As in the first plot above.

  • The likelihood function is a function of unknown $\theta$ for known values $n = 12, x = 8.$ As in the second plot.

It not not surprising that this difference in viewpoint gets shown explicitly in the notation.