Measure of distance between two gaussian distributions

12k Views Asked by At

let' say I have two different phenomena classes, and I extract two different kinds of values for each of them. For example, comparing two different leaves, I extract length and weight of several hundreds of instances.

From this experimental observation values, I calculate the mean and standard derivation, and assume they follow a normal distribution, so:

$L_{1}\sim\mathcal N(\mu_{11},\sigma_{11})$ Length distribution for first class of leaf

$W_{1}\sim\mathcal N(\mu_{21},\sigma_{22})$ Weight distribution for first class of leaf

$L_{2}\sim\mathcal N(\mu_{31},\sigma_{32})$ Length distribution for second class of leaf

$W_{2}\sim\mathcal N(\mu_{41},\sigma_{42})$ Weight distribution for second class of leaf

I want to take the characteristic that better distinguishes between both classes, so I need some kind of measurement of distance between $L_{1},L_{2}$ and $W_{1},W_{2}$, to take the one with longest distance. Which mathematical notion helps me here?

2

There are 2 best solutions below

0
On BEST ANSWER

This could be very late a solution but it may benefit future reads.

In general in pattern-recognition, when the two distributions have equal variance we apply mahalanobis distance. But your features have different variance and the mahalanobis distance would tend to zero (for details, read on wikipedia ).

For your case, Bhattacharyya bound would work. This is used in general to compare Gaussian distributions with different variance. (It can also be used with other distributions).

For your example, distance between $L_1$ and $L_2$ can be computed by following equation:

\begin{equation} D_{L_1L_2} = \frac{1}{8} (\mu_{11}-\mu_{31})^T \sigma^{-1}(\mu_{11}-\mu_{31}) + \frac{1}{2} \ln (\frac{\sigma}{\sqrt{\sigma_{11}\sigma_{32}}}) \end{equation}

where, $\sigma=\frac{\sigma_{11}+\sigma_{32}}{2}$.

Similarly, you can compute distance between $W_1$ and $W_2$.

0
On

I guess the general topic you are looking for is 'discriminant analysis'. See Wikipedia.

The issue is not to test whether the two populations are different (presumably already established), but to find a (possibly linear) function of length and weight that would allow you to classify a new leaf as type 1 or type 2.

Discriminant analysis was founded by Fisher in the 1930s. It is implemented in Minitab, R, and other statistical packages. Fisher's introductory paper used a dataset on Iris flowers of three varieties, classifying them according to a linear function of sepal width and length and petal width and length. Famous dataset.

The plot below shows that a line can separate varieties 'Iris Setosa' and 'Iris Versicolor' based on meaurements of the two variables sepal width and sepal width. Roughly speaking: connect the centers of the data clouds for I. Setosa and I. Versilolor with a line; the perpendicular bisector of that line is the linear discriminator. This basic method from Fisher's first paper assumes equal variances for all variables, not exactly true here. So you can find a better discriminator by eye. Some implementations of linear discriminant analysis (LDA), as in R, adjust for unequal variances. [Figrures from Trumbo (2002)]

enter image description here

Sightly more complex: With three varieties (I. Setosa, I. Versicolor, and I. Virginica) and two variables (sepal width and petal width) it is possible to discriminate much of the time, but there is a slight overlap between observed specimens of I. Versicolor and I. Viginica. The idealized LD functions are again perpendicular bisectors of lines connecting the centers of data clouds. The bisectors meet in a point (for 2-D) splitting the plane into three regions--one for each variety.

enter image description here

Notes: (1) Fisher's initial analysis assumed (based on genetic information) that the centers of the three data clouds should actually fall on a straight line, not exactly true here. Most modern versions of LDA do not make such assumptions. (2) There are also Bayesian versions of LDA that take into account prior probabilities for the varieties along with the measurements of the new specimen. [Implementations of LDA in many statistical software packages permits use of prior information, where applicable.] (3) The second plot uses 50 specimens of each variety (some overplotting of coincident points). Fisher's dataset has 100. The idea was to see whether analysis of the other 50 would give about the same linear discriminators. (Yes.)