I'm reading http://machinelearning.wustl.edu/mlpapers/paper_files/BleiNJ03.pdf and trying to understand the notation and concepts behind LDA, in order to implement it myself. I've followed some tutorials about the Poisson and Dirichlet distribution but I'm not super comfortable with them as topics yet.
Can someone explain what is meant on page 4 of the PDF:
LDA assumes the following generative process for each document w in a corpus D:
- Choose N ~ Poisson(ξ).
- Choose θ ~ Dir(α).
What are these symbols referring to? Extracting words from the Poisson Distribution? How is that even possible? And extracting parameters from a Dirichlet distribution is equally confusing.
LDA is a hierarchical Bayesian model that represents each document $w$ in a corpus, $D$ as a "bag of words" of variable length ($N$) with a particular mixture of $k$ topics.
Therefore, the paper is saying that they are modelling document lengths as poisson distributed (i.e, $N_{w_i}\sim Poi(\xi)$ and that the probability that a given "word" in that document belongs to topic $j$ is $\theta_j$, where $\theta=(\theta_1,\theta_2,...\theta_K)$ is a random multinomial distribution generated by the dirichlet distribution.
The paper goes on to say that each word in the overall "vocabulary" $V$ of the corpus has a different frequency of occurrence under each topic. Therefore, the overall process for generating a document within a corpus is:
That is the Bayesian generative process that is used for determining the word distributions for each topic.