Problem (Kernelization of $k$-means) $\quad$ Let $\mathcal{X}$ be a set and consider a PSD kernel $K$ on $\mathcal{X}$ with corresponding RKHS $\mathcal{H}$ and feature map $\varphi$. The goal of this exercise is to show that, given a set of observation $x_1, \ldots, x_n$ in $\mathcal{X}$, we can consider solving the $k$-means problem on $\mathcal{H}$ with observations $\varphi\left(x_1\right), \ldots, \varphi\left(x_n\right)$ only through the manipulation of the Gram matrix $G:=\left(K\left(x_i, x_j\right)\right)_{i j}$. For this, we recall that the loss function of the $k$-means problem is given by $$ L\left(c_1, \ldots, c_k\right):=\sum_{i=1}^n \min _{j=1, \ldots, k}\left\|\varphi\left(x_i\right)-c_j\right\|_{\mathcal{H}}^2, $$ where $c_1, \ldots, c_k \in \mathcal{H}$.
- Show that $$ L\left(c_1, \ldots, c_k\right)=\sum_{i=1}^n\left\|\varphi\left(x_i\right)-c_{s_i}\right\|_{\mathcal{H}}^2, $$ where $s_i=\arg \min _j\left\|\varphi\left(x_i\right)-c_j\right\|_{\mathcal{H}}^2$.
- Show that the map $c \mapsto \sum_{i=1}^n\left\|\varphi\left(x_i\right)-c\right\|_{\mathcal{H}}^2$ is minimized when $c=\frac{1}{n} \sum_{i=1}^n \varphi\left(x_i\right)$. Beware that $\mathcal{H}$ may be an infinite dimensional Hilbert space, so you cannot faithfully rely on gradient computation as we usually did on $\mathbb{R}^d$.
- Introducing $C_j:=\left\{i, s_i=j\right\}$ for $j=1, \ldots, k$, deduce that minimizing (9) boils down to minimizing $$ \left(s_1, \ldots, s_n\right) \mapsto \sum_{i=1}^n\left\|\varphi\left(x_i\right)-\frac{1}{\left|C_{s_i}\right|} \sum_{j \in C_{s_i}} \varphi\left(x_j\right)\right\|_{\mathcal{H}}^2, $$ where $s_1, \ldots, s_n$ are now variables in $\{1, \ldots, k\}$.
- Deduce that performing $k$-means in $\mathcal{H}$ boils down to an optimization problem on $\left(s_1, \ldots, s_n\right)$ that only requires the knowledge of $K$. Note: you can reach a simple formula after careful simplification. Nonetheless, the optimization problem remains highly non-trivial.
This is an exercise in my class and I am doing it. I have a question in 2 about how to find the minimizer in case of infinite dimension Hilbert space. I hope everyone can give me some hints.
Hint: Assume that the average is not the minimizer. Then, using properties of the inner product, simplify the cost to arrive at a contradiction.
Solution: Let $h_1, \dots, h_n$ be any collection of points in some (potentially infinite-dimensional) Hilbert space $\mathcal{H}$. Denote by $\bar h = \frac1n \sum_{i=1}^n h_i$ their average. Assume that $\bar h$ is not the minimizer of the map $$ h \mapsto \sum_{i=1}^n \|h_i - h\|^2 $$ In that case, there exists some $\epsilon \in \mathcal{H}\setminus\{0\}$ such that $\bar h + \epsilon$ is the minimizer. Then, the following holds: $$ \sum_{i=1}^n \|h_i - \bar h - \epsilon\|^2 = \\ \sum_{i=1}^n \|h_i - \bar h\|^2 + \| \epsilon\|^2 + 2 \langle \epsilon, h_i - \bar h \rangle = \\ \left( \sum_{i=1}^n \|h_i - \bar h\|^2 \right) + n \|\epsilon\|^2 + 2 \langle \epsilon, \sum_{i=1}^n h_i - n \bar h \rangle = \\ \left( \sum_{i=1}^n \|h_i - \bar h\|^2 \right) + n \|\epsilon\|^2 > \\\sum_{i=1}^n \|h_i - \bar h\|^2 $$ Which gives a contradiction to our assumption that $\bar h + \epsilon$ was the minimizer.