Integrating over euclidean spaces: A formula from a seminal paper on $k$-means clustering

24 Views Asked by At

I am studying the seminal paper Some methods for classification and analysis of multivariate observations (MacQueen). It is the first presentation of the $k$-mean clustering model. I am a computer science student, so my formal mathematical education has been highly practical and rather incomplete.

In the paper I found the following paragraph:

enter image description here

Before reading the paper I did not know what a Lebesgue measure was. I have acquired the immediate and intuitive understanding that a Lebesgue measure is an attribution of volume to subsets of an $n$-dimensional space$-$I know nothing more than this.

What I have trouble understanding, and I suspect this relates to my aforementioned limitation, are the integrations over the $S_i$ spaces. Each $S_i$ here is defined as a subset of $E_n$. With $x = (x_1, \ldots, x_n)$, we have

\begin{align} S_1(x) &= T_1(x) \\ S_2(x) &= T_2(x)S_1'(x) \\ &\vdots \\ S_k(x) &= T_k(x)\prod_{j=1}^{k-1}S'_j \end{align}

where

$$T_i(x) = \Big\{ \alpha : \alpha \in E_n, |\alpha - x_i| \leq |\alpha - x_j| \text{ with } 1 \leq j \leq k \Big\}$$

This is, $T_i$ is the set of points in $E_n$ that is closer to $x_i$ than to any $x_j \in x$.

Fair enough: $S_1, \ldots, S_k$ are the sets of points in $E_n$ closer to $x_1, \ldots, x_k$ respectively. But what is the meaning of $\int_{S_i} \ldots$? What does it mean when the subscript of an integration is a set? Furthermore, in the context above, why is the integration occurring with respect to $p(z)$ instead of simply $z$?

Since I am aware I am limited on the subject at hand, I would also welcome $-$apart from specific answers$-$ book recommendations to study the theory behind what I am showing of the paper.