I've read the Wikipedia article and a lot of posts on stackexchange (like this really thorough one) on determining the number of clusters in a data set. Based on that, I am currently using the silhouette analysis in MATLAB.
Clustering $2$-dimensional data (around $10^3$ points) works fine, I can determine the average silhouette value for $k=2,3,\ldots,k_{\mathrm{max}}$ for some $k_{\mathrm{max}}\in\mathbb{N}$ and of those $k$, pick the one that corresponds to the highest average silhouette value. That takes less than a minute to run.
However, with $4$-dimensional data (around $10^5$ points), this approach takes a long time. The clustering itself (using kmeans
in MATLAB) is still fairly quick, but calculating the silhouette value is slow. So my thought was: perhaps one of the other methods is faster. Hence my question:
Can anyone provide insight into the performance in higher dimensions of the different methods for choosing the optimal number of clusters in $k$-means clustering?