Let $A=\{A_1,A_2,\cdots,A_m\}$ and $B=\{B_1,B_2,\cdots,B_n\}$ be two sets of points in $k$-dimensional Euclidean space. Each points $A_i$ or $B_i$ can be thought of as a feature vector of a data sample. I want to know if two distribution of $A$ and $B$ are similar or not.
I can proceed univariate analysis by drawing $k$ histograms for $A$ and $B$, respectively, and view the difference of them for each $k$.
Or I can proceed like this ; that's what I'm asking for. $A$ and $B$ are two clusters of points in a Euclidean space. So I can measure the distance between this two clusters. There might be various way to define the distance, I can define by the minimal distance like
$$d(A,B)=\min_{i,j}||A_i-B_j||$$
where $||\cdot||$ is the L2 norm. Or I can define by the distance between centroid
$$d(A,B)=||C_A-C_B||$$
where
$$ \begin{align*} C_A&=\frac1m\sum_{i=1}^mA_i\\ C_B&=\frac1n\sum_{j=1}^nB_j \end{align*} $$
The former is bad since $A$ and $B$ practically overlap in some domain so that the distance is always set to nearly zero. The latter is better but have a limit ; If $A'$ has identical centroid as in $A$ but $A'$ is more scattered than $A$, then it is not desirable that $d(A',B)=d(A,B)$ ; it should be $d(A',B)<d(A,B)$.
An alternative way to estabilish the distance, I can take standard deviations of the cluster into account ;
$$d(A,B)=\frac{||C_A-C_B||}{s_As_B}$$
where $s_A$ and $s_B$ are standard deviation of $A$ and $B$, respectively.
or I can define like
$$ \begin{align*} d(A,B)&=\frac{||C_A-C_B||}{{s_A}^2{s_B}^2}\\ d(A,B)&=\frac{||C_A-C_B||}{{s_A}^2+{s_B}^2} \end{align*} $$
Is there a standard way of defining this distance?
Note 1 : I heard the word "within cluster sum of variance" in the context of K-means clustering. But it doesn't seem to involve standard deviation.
Note 2 : chat GPT recommended the last equation.