Math Behind Latent Dirichlet Allocation.

126 Views Asked by At

I am trying to implement LDA from scratch but cannot find the correct values required to the parameters. In order to implement LDA one needs to be familiar with a few concepts, which are given in the snippet below. One is only required to use scipy and numpy libraries alone without any for loops.

Simplified LDA (each document has only one theme)

Now we will train simplified LDA from the seminar. In this model, every document is paired with only one theme in it. Probabilistic model looks as follows:

Random variables:

  • $w_{dn}$ -- word at $n$'th position in document $d$ (observed variable)
  • $t_d$ -- document theme $d$ (latent variable)

Model parameters:

  • $\Phi = \{\phi_{tw}\} \in \mathbb{R}^{T\times V}$ -- matrix of word distribution in topics ($T$ is number of topics, $V$ is dictionary size)
  • $\pi \in \mathbb{R}^T$ is vector of distribution of themes in the document corpus

Joint distribution on words in documents and themes of documents is defined as:

$$p(W, t \;|\; \Phi, \pi) = p(W \;|\; t, \Phi) p(t \;|\; \pi) = \prod_{d=1}^D \prod_{n=1}^{N_d}p(w_{dn}\;|\;t_d, \Phi) p(t_d \;|\; \pi) = \prod_{d=1}^D \prod_{n=1}^{N_d} \phi_{t_d w_{dn}}\pi_{t_d}$$

Since our model contains latent variables, we will optimize its parameters with EM-algorithm:

  • E-step: $KL(q(t) \;||\; p(t\;|\;\Phi, \pi) ) \to \min_{q(t)}$
  • M-step: $\mathbb{E}_{q(t)} \log p(W, t \;|\; \Phi, \pi) \to \max_{\Phi, \pi}$ In the next task you will need to implement this simplified model. Let's observe necessary formulas for E-step, M-step and optimized functional (lower bound on log likelihood).

E-step

E-step requires finding the posterior distribution $q(t_d)$ for each document d. Since this is a categorical random variable, it is enough to know $q(t_d = t)$ for each theme $t = 1, \ldots, T$. This whole distribution can be represented as matrix $\mu \in \mathbb{R}^{D \times T}$, where $\mu_{dt} = q(t_d = t)$.

Explicit formula here is quite large. The more convenient way to define it looks like this:

$$ \log\mu_{dt} = \log\pi_t + \sum\limits_{w = 1}^{V}BOW_{d, w}\log\phi_{t, w} + const $$

Here $BOW_{d, w}$ is the number of times word $w$ appeared in the document $d$ and is exactly a format in which we store our texts.

$const$ is defined by the condition $\sum\limits_{t = 1}^{T}\mu_{dt} = 1$.

Essentially, the simplest way to calculate it in practice is to first compute

$$ A_{dt} = \log\pi_t + \sum\limits_{w = 1}^{V}BOW_{d, w}\log\phi_{t, w} $$ and then take $$ \mu_{dt} = Softmax(A_{dt}), $$

where Softmax is taken over the last dimension, corresponding to $T$. This will ensure that the rows of the matrix indeed form a probability distribution.

M-step

Using the matrix $\mu$, defined previously, it is easy to write M-step formulas for parameters of the model $\pi$ and $\Phi$:

$$ \pi_t = \frac{\sum\limits_{d}\mu_{dt}}{D} $$

$$ \phi_{tw} = \frac{\sum\limits_{d}\mu_{dt}BOW_{d,w}}{\sum\limits_{w}\sum\limits_{d}\mu_{dt}BOW_{d, w}} $$

Evidence lower bound

Since ELBO is essentially a functional optimized by an EM-algorithm, it is crucial to take a look at its values (at least, it should increase after each step and this is a very useful debugging tool). Here, it is equal to:

$$ ELBO = \sum\limits_{d}\sum\limits_{t}\mu_{dt}\left[ \sum\limits_{w = 1}^{V} \log\phi_{tw}BOW_{d, w} + \log\pi_t\right] - \sum\limits_{d}\sum\limits_{t}\mu_{dt}\log\mu_{dt} $$ Tip: every time you see a sum like $\sum\limits_{k}A_{ik}B_{kj}$, this corresponds to some matrix multiplication: $$ (AB)_{ij} = \sum\limits_{k}A_{ik}B_{kj} $$

And if indices are somehow permuted, e.g. $\sum\limits_{k}A_{ik}B_{jk}$, this corresponds to some matrix product but with one of the matrices transposed :) Use these observations every time you need to calculate the corresponding functional.

Make all the calculations vectorized and do not use cycle for.

Anyone good in this kindly help.

Here is the scratch piece of code.

class SimpleLDA(object):
    def __init__(self, n_topics, epsilon=1e-15, tol=1e-5):
        self.n_topics = n_topics
        self.epsilon = epsilon
        self.tol = tol
        
    def fit(self, bow, verbose=True):
        self._initialize(bow)
        elbo = -np.inf
        for it in range(1000):
            self._e_step(bow)
            self._m_step(bow)
            new_elbo = self._count_elbo(bow)
            diff = new_elbo - elbo
            if verbose:
                print("\n{}:\n elbo: {}\n increase: {}".format(it, new_elbo, diff))
            if diff < self.tol:
                break
            elbo = new_elbo
    
    
    def _initialize(self, bow):
        V = bow.shape[1]
        self.mu = np.asmatrix(np.random.random(self.n_topics, V))
        self.pi = np.abs(np.random.randn(self.n_topics))
        self.pi = self.pi / self.pi.sum()
        self.phi = np.abs(np.random.randn(self.n_topics, V))
        self.phi = self.phi / self.phi.sum(axis=1)[:, np.newaxis]
        
    def _e_step(self, bow):
        self.mu = ...
        ### YOUR CODE HERE ###

    def _m_step(self, bow):
        self.pi = ...
        self.phi = ...
        
        ### YOUR CODE HERE ###

    def _count_elbo(self, bow):
        # mask of positions to sum: 1 means value is large enough to take it into account
        # 0 means value is too small and is omitted from calculation of the sum discussed above
        # it is not necesseary to use it, but it can be helpful

        mask = self.mu > self.epsilon
        elbo = ### YOUR CODE HERE ###
        return elbo