Understanding Empirical Data Distribution

360 Views Asked by At

I've been trying to understand this paper and am having trouble understanding this part:

"We can approximate $p(x,y)=p(x)p(y|x)$ using the empirical data distribution $$p(x,y) =\frac{1}{N} \sum_{n=1}^N \delta_{x_n}(x) \delta_{y_n}(y)"$$

In another part of the paper they say $p(y|y_n) =\delta_{y_n}(y)$.

I have some background in probability but none in statistics; I was able to figure out what an Empirical CDF is, but not a pdf like here, so I'm not sure exactly what the authors are doing. Does the $\delta$ refer to the Dirac delta distribution?

1

There are 1 best solutions below

2
On

The empirical data distribution is a probability distribution which allocates probability $1/N$ to point in the training dataset and 0 otherwise. More formally, it is supported on $N$ points $(x_i,y_i)$ of the training set each having probability mass $1/N$$ and so all other points have mass 0.

Yes, $\delta_{x_n}(x)$ and $\delta_{y_n}(y)$ are indicator functions which are $1$ when $x=x_n$ and $0$ otherwise; similarly for $\delta_{y_n}(y)$.

$P(x,y)$ is the joint probability mass and you can check that it is 0 for any point not in the training set and $1/N$ for a point in the training set.