Clustering elements according to covariance matrix

51 Views Asked by At

I'm doing a little bit of topic modelling (which is not really my area) with twitter tweets. The situation is the following: I have a (sort of) covariance matrix where the entrie $C_{ij}$ corresponds to the frequency of the words $i$ and $j$ occuring together in a tweet.

Given this Matrix $C$ I would like to automatically cluster words into different topics. However, since my background isn't statistics nor data analysis, I think I might need the correct terms to search for. I'm not sure if k-means or PCA is what I need.

In the optimal case, I end up with a not prior specified number of topics, that gathers that combines only the words that really correlated. Especially, I don't want all words to be assigned, given that some words only correlate very little.

1

There are 1 best solutions below

1
On

There is an interesting approach in topic-modelling. Lets define $P(w|d)$ as the probability of the word $w$ to appear in the given document $d$. These documents are tweets. Then we define a bunch of themes: $T = \{t_1, t_2, \dots, t_n\}$ and use such equality:

$P(w|d) = \sum_{t_i}P(w|t_i)P(t_i|d_i)$.

But it's not appropriate for our case as we have not empirical data $P(w|d)$. However, we could use the similar approach.

The value $C_{ij}$ can be considered as $P(w_i\cap w_j) = P(w_i | w_j)P(w_i)$. Suppose that $P(w_i)$ can be calculated. Then we can get required conditional probability $P(w_i | w_j)$ and expand it by themes: $P(w_i | w_j) = \sum_{t_i}P(w_i|t_i)P(t_i|w_j)$. So, if $S$ is the matrix of $P(w_i,w_j)$ then it will be just matrix expansion: $S = S_{wt}\cdot S_{tw}$. And you can see that $S_{tw}$ is a distribution of themes on the words. It could be used In clusterization.

The problem there is how to build $S = S_{wt}\cdot S_{tw}$. It's nontrivial problem and I'm not good at it. But the first steps in solving you can google by "PLSA and LDA models" and "PLSA and LDA models with regularizations".