Multiset entropy metric

217 Views Asked by At

Assume that we have a multiset $M$. I am looking for a metric (ideally something algorithmically effective) which evaluates multiset and produces output, which represents info about difference between it's objects. My multiset always contains values from the interval $<0,1>$.

I have following requirements:

  • Multiset with same objects must have a metric value of 0. I.E. $$Metric({0.1,0.1,0.1})=0$$
  • It must consider distance between different objects, and it should not be dependent on average (so no standard deviation...) I.E. $$Metric({0.11,0.12,0.13,0.14})<Metric({0.1,0.2,0.3,0.4})$$ $$Metric({0.1,0.11,0.91,0.9})<Metric({0.1,0.3,0.6,0.9})$$ $$Metric({0.1,0.2,0.8,0.9})<Metric({0.1,0.3,0.6,0.9})$$

So far I came just with an idea of calculating euclidean distance between each two variables (sum or average of it) but this seems a bit ineffective to be implemented. I think it may be some kind of entropy,but I did not found entropy which would consider distance between objects. It would be ideal if it would be already implemented in one of the python libraries. Sorry if this question is a bit 'dirty', I am not a mathematician. Thanks in advance.

1

There are 1 best solutions below

0
On BEST ANSWER

I was a bit intrigued by the idea of an "entropy" metric-like measure for your multiset, so I took a look at a few possibilities:

I'll put the code at the end. Note that I test your "example cases" by duplicating the set repeatedly and adding Gaussian noise, to prevent numerical issues.

Note also that some of the issues could be handled by enforcing the finite support $[0,1]$, but I didn't try it (e.g. adding 0 and 1 as values to each example multiset).

1) Just looking at the discrete Shannon entropy of the values, using a binning approach to estimate the probability mass function. This satisfies the example cases, except the first one. It also does not give the identical set zero entropy, but this is because of the added noise and not considering the actual finite support.

2) Using kernel density estimation, I computed the resulting differential entropy of the estimated probability density. This one passes all your requirements, BUT it allows negative values (meaning [0.11,0.12,0.13,0.14] might appear to have less entropy than [1,1,1,1] for instance), which is presumably not good.

3) Then I tried to look at the differential entropy of the multiset of pairwise distances. That is, given multiset $S=\{s_i\}$, compute $$ D = \{ |s_i - s_j| \;\forall\;s_i,s_j\in S \} $$ so that $|D|=|S|^2$. Again we have an issue with negativity, but for the examples considered [0.1,0.1,0.1] had the lowest value, so I took: $$ \mathcal{E}[S] = \exp( \mathscr{D}[ \mathbb{P}_S ] ) $$ as the "metric entropy" of the multiset $S$ where $\mathbb{P}_S$ is the approximate probability distribution of values and $\mathscr{D}$ is the differential entropy.

This approach satisfies your constraints, except that the first multiset of identical elements does not have zero. Furthermore, if you truly have an array of identical elements, it will output $e^0=1.0$.

Anyway, just some ideas...

Here is the code:

import numpy as np, statsmodels.api as sm

# Test arrays
xa_init = [ [0.1, 0.1, 0.1, 0.1],
            [0.11, 0.12, 0.13, 0.14], 
            [0.1, 0.2, 0.3, 0.4], 
            [0.1, 0.11, 0.91, 0.9], 
            [0.1, 0.3, 0.6, 0.9],
            [0.1, 0.2, 0.8, 0.9]   ]

# Expand array values with duplication and Gaussian noise
def noiseExpand(a,sigma=0.0005,n=50):
    return np.concatenate(
        (np.array([ np.random.normal(size=n)*sigma + elem for elem in a ]).flatten(), np.array([0.0,1.0])),
        axis=0)
xa = [ noiseExpand(u) for u in xa_init ]

# Discrete entropy
# See: http://stackoverflow.com/questions/24144777/how-to-compute-the-shannon-entropy-and-mutual-information-of-n-variables
def entropy(x, bins = None):
    if bins is None: bins = 2 * np.sqrt(float(len(x)))
    c = np.histogram(x, bins)[0]
    c_normalized = c / float(np.sum(c))
    c_normalized = c_normalized[np.nonzero(c_normalized)]
    h = -sum(c_normalized * np.log(c_normalized))  
    return h

print("Discrete Entropy Values")
for a,ai in zip(xa,xa_init): print(str(ai) + "    " + str(entropy( a )))
print(entropy(xa[1]) < entropy(xa[2]))
print(entropy(xa[3]) < entropy(xa[-2]))
print(entropy(xa[-1]) < entropy(xa[-2]))

# KDE entropy
def contEntropy(a):
    dens = sm.nonparametric.KDEUnivariate( np.array(a) )
    dens.fit()
    return dens.entropy # Differential entropy of kde dist

print("\nContinuous Entropy Values")
contEntropies = [ contEntropy(a) for a in xa ]
for i,(a,ai) in enumerate(zip(xa,xa_init)): 
    print( str(ai) + "    " + str(contEntropies[i]) )
print(contEntropies[1] < contEntropies[2])
print(contEntropies[3] < contEntropies[-2])
print(contEntropies[-1] < contEntropies[-2])

# Entropy of pairwise distances
def getPairwiseDists(a):
    ds = []
    for x in a:
        for y in a: ds.append( abs(x - y) )
    return ds

distSet = [ getPairwiseDists(a) for a in xa ]
print("\nDistance Entropy Values")
contEntropies = [ np.exp( contEntropy(a) ) for a in distSet ]
for i,ai in enumerate(xa_init): 
    print( str(ai) + "    " + str(contEntropies[i]) )
print(contEntropies[1] < contEntropies[2])
print(contEntropies[3] < contEntropies[-2])
print(contEntropies[-1] < contEntropies[-2])

which gives output (remember that I'm duplicating and adding noise)

Discrete Entropy Values
[0.1, 0.1, 0.1, 0.1]    0.0624089186408
[0.11, 0.12, 0.13, 0.14]    0.0624089186408
[0.1, 0.2, 0.3, 0.4]    1.43497759302
[0.1, 0.11, 0.91, 0.9]    1.09183542442
[0.1, 0.3, 0.6, 0.9]    1.43497759302
[0.1, 0.2, 0.8, 0.9]    1.43497759302
True
True
False

Continuous Entropy Values
[0.1, 0.1, 0.1, 0.1]    8.01080946894e-27
[0.11, 0.12, 0.13, 0.14]    -2.89669758564
[0.1, 0.2, 0.3, 0.4]    -0.716414093314
[0.1, 0.11, 0.91, 0.9]    0.19069278778
[0.1, 0.3, 0.6, 0.9]    0.201091210836
[0.1, 0.2, 0.8, 0.9]    0.133030025965
True
True
True

Distance Entropy Values
[0.1, 0.1, 0.1, 0.1]    0.00183372085171
[0.11, 0.12, 0.13, 0.14]    0.0215692440406
[0.1, 0.2, 0.3, 0.4]    0.159330383004
[0.1, 0.11, 0.91, 0.9]    0.433608602748
[0.1, 0.3, 0.6, 0.9]    0.733649786038
[0.1, 0.2, 0.8, 0.9]    0.610790776053
True
True
True