What does "Choose N ~ Poisson(ξ), Choose θ ~ Dir ( α )" mean in the context of Latent Dirichlet Allocation

1.3k Views Asked by At

I'm reading http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf and trying to understand the notation and concepts behind LDA, in order to implement it myself. I've followed some tutorials about the Poisson and Dirichlet distribution but I'm not super comfortable with them as topics yet.

Can someone explain what is meant on page 4 of the PDF:

LDA assumes the following generative process for each document w in a corpus D:

  1. Choose N ~ Poisson(ξ).
  2. Choose θ ~ Dir(α).

What are these symbols referring to? Extracting words from the Poisson Distribution? How is that even possible? And extracting parameters from a Dirichlet distribution is equally confusing.

2

There are 2 best solutions below

0
On BEST ANSWER

LDA is a hierarchical Bayesian model that represents each document $w$ in a corpus, $D$ as a "bag of words" of variable length ($N$) with a particular mixture of $k$ topics.

  • $N=$ the number of words that document contains
  • $\theta=(\theta_1,\theta_2,...\theta_K)$ the probability that a randomly selected "generic" word in $w$ belongs to topic $i$,where $i\in \{1...k\}$

Therefore, the paper is saying that they are modelling document lengths as poisson distributed (i.e, $N_{w_i}\sim Poi(\xi)$ and that the probability that a given "word" in that document belongs to topic $j$ is $\theta_j$, where $\theta=(\theta_1,\theta_2,...\theta_K)$ is a random multinomial distribution generated by the dirichlet distribution.

The paper goes on to say that each word in the overall "vocabulary" $V$ of the corpus has a different frequency of occurrence under each topic. Therefore, the overall process for generating a document within a corpus is:

  • The number of word "slots" is determined by a random variable $N_i$ that is distributed $Poi(\xi)$.
  • Each word "slot" needs to be assigned a topic according to a multinomial distributio with parameter k-vector $\theta$. The k-vector is a random variable that follows $Dir(\alpha)$
  • Now, fill in each "slot" with an actual word from a per-determined vocabulary $V$. You do this for each slot by first generating a topic $T$ from the multinomial distribution determined by the previously generated $\theta$. The frequency of each word in $V$ will vary by topic, so once the topic is chosen, you pick a word ($p_i$)from $V$ using the multinomial distribution $P(p_i=p|T)\;\;p\in V$.
  • Repeat for all $N_i$ slots.
  • Repeat entire process for each document.

That is the Bayesian generative process that is used for determining the word distributions for each topic.

2
On

Poisson(ξ): Poisson distribution with parameter ξ (a positive real number).

Dir(α): Dirichlet distribution with parameter α (a vector of positive real numbers).