Clustering elements according to covariance matrix

51 Views Asked by Bumbble Comm At 25 Mar 2026 - 11:17

I'm doing a little bit of topic modelling (which is not really my area) with twitter tweets. The situation is the following: I have a (sort of) covariance matrix where the entrie $C_{ij}$ corresponds to the frequency of the words $i$ and $j$ occuring together in a tweet.

Given this Matrix $C$ I would like to automatically cluster words into different topics. However, since my background isn't statistics nor data analysis, I think I might need the correct terms to search for. I'm not sure if k-means or PCA is what I need.

In the optimal case, I end up with a not prior specified number of topics, that gathers that combines only the words that really correlated. Especially, I don't want all words to be assigned, given that some words only correlate very little.

Original Q&A

There are 1 best solutions below

Bumbble Comm On 30 May 2015 - 1:15

There is an interesting approach in topic-modelling. Lets define $P(w|d)$ as the probability of the word $w$ to appear in the given document $d$. These documents are tweets. Then we define a bunch of themes: $T = \{t_1, t_2, \dots, t_n\}$ and use such equality:

$P(w|d) = \sum_{t_i}P(w|t_i)P(t_i|d_i)$.

But it's not appropriate for our case as we have not empirical data $P(w|d)$. However, we could use the similar approach.

The value $C_{ij}$ can be considered as $P(w_i\cap w_j) = P(w_i | w_j)P(w_i)$. Suppose that $P(w_i)$ can be calculated. Then we can get required conditional probability $P(w_i | w_j)$ and expand it by themes: $P(w_i | w_j) = \sum_{t_i}P(w_i|t_i)P(t_i|w_j)$. So, if $S$ is the matrix of $P(w_i,w_j)$ then it will be just matrix expansion: $S = S_{wt}\cdot S_{tw}$. And you can see that $S_{tw}$ is a distribution of themes on the words. It could be used In clusterization.

The problem there is how to build $S = S_{wt}\cdot S_{tw}$. It's nontrivial problem and I'm not good at it. But the first steps in solving you can google by "PLSA and LDA models" and "PLSA and LDA models with regularizations".

Clustering elements according to covariance matrix

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in REFERENCE-REQUEST

Related Questions in DATA-ANALYSIS

Related Questions in DATA-MINING

Trending Questions

Popular # Hahtags

Popular Questions