I'm trying to understand K-Means||, a scalable version of K-Means++, which itself is an "improved" version of the clustering algorithm K-Means.
Please find here the link to K-Means|| paper http://theory.stanford.edu/~sergei/papers/vldb12-kmpar.pdf
Without detail What does "sampling with an oversampling factor mean? If $P(x)$ defines a probability, and $l$ is an integer $\ge 2$ say, what does sampling from $l\cdot P(x)$ mean?
Within context
You need to know:
- K-Means++ samples from a distribution $P(x)=\frac{d^2(x,C)}{\phi_X(C)}$ where $d^2(x,C)$ is the distance between the point $x$ and its nearest cluster in clustering $C$, and $\phi_X(C)$ is the cost function of the given clustering $C$, corresponding to the square distance between every point x and its nearest cluster $C$. At the end of the day, $\phi_X(C) = \sum_{x \in X} d^2(x,C)$ and $p_x$ defines a correct probability distribution.
- K-Means|| says it samples "each point $x \in X$ independently with probability $P(x) = l \cdot \frac{d^2(x,C)}{\phi_X(C)}$ where $l$ is said to be an oversampling factor.
I would love to understand what "an oversampling factor" genuinely means. To my mind, it seems weird to sample from something that is not a proper probability since it does not sum to one, but to $l$. Another concept I don't know must be implied here. (e.g. if $P(x_0) = 0.6$ then with $l=2$ what would sampling from $P(x)$ mean?
Ready to explain again if I'm not being clear,
Thanks a lot
I have recently been learning about the K-Means|| implementation, and I think it's good to note a clarification made in this paper Good Seedings for k-Means, where they limit the probability of sampling a data point $x \in X$ to $min(1, \frac{\ell d(x, C)^2}{\phi_X(C)})$.
Btw, just for verify something I have been thinking, is the expected number of samples drawn from $X$ $\ell$ and if so, why it is equal to the sum of the probabilities? Like,
$$ E \left(P(X) \right) = \sum_{i=1}^n P(x_i) = \sum_{i=1}^n \ell \frac{d(x_i, C)^2}{\sum_{i=1}^n d(x_i, C)^2} = \ell \frac{1}{\phi_X(C) } \sum_{i=1}^n d(x_i, C)^2 = \ell \frac{\phi_X(C)}{\phi_X(C)} = \ell $$
Regards