why the entropy of a data set is not defined as H(p)=p(1-p)?

30 Views Asked by At

I'm learning some ML algorithm online, which talks about the use of entropy as a measurement of impurity of the dataset. Suppose the dataset contains two types of objects: objects with positive label and objects with negative label. The entropy of the dataset is defined as $H(p)=-plog_2(p)-(1-p)log_2(1-p)$, where $p$ is the proportion of objects with positive labels. The idea is that impurity should be maximized when $p=1/2$ and minimized when $p=0$ or $p=1$. Seems that this could also be achieved with the much simpler function $H_1(p)=p(1-p)$. I'm wondering what's special about the functional form $H(p)=-plog_2(p)-(1-p)log_2(1-p)$? Why this functional form is preferred? what's the rationale behind this functional form/where does it come from?

1

There are 1 best solutions below

1
On BEST ANSWER

There is nothing special about using base $2$ logarithms in your entropy expression. You could multiply your function by any positive constant and it would have whatever properties you want, which amounts to changing the logarithm's base.

Anyway, to see a theorem giving conditions under which entropy functions have the expression you ask about, see Theorem 6.1 here. It assumes for each $n\geq 2$ that there is an "entropy" function $H_n$ on $n$-tuples $(p_1,\ldots, p_n)$ in $[0,1]$ adding up to $1$, where each $H_n$ is continuous, symmetric, and there's a consistency condition relating $H_{n+1}$ to $H_n$ and $H_2$. The conclusion is that there is a constant $k$ such that $H_n(p_1,\ldots,p_n) = -k\sum_{i=1}^n p_i\log p_i$ for each $n$.

When $n = 2$, the conclusion is saying $H_2(p,1-p) = -k(p\log p + (1-p)\log(1-p))$.