Why does Benford's Law (or Zipf's Law) hold?

22.3k Views Asked by At

Both Benford's Law (if you take a list of values, the distribution of the most significant digit is rougly proportional to the logarithm of the digit) and Zipf's Law (given a corpus of natural language utterances, the frequency of any word is roughly inversely proportional to its rank in the frequency table) are not theorems in a mathematical sense, but they work quite good in the real life.

Does anyone have an idea why this happens?

(see also this question)

8

There are 8 best solutions below

3
On BEST ANSWER

As a rough/somewhat-intuitive explanation of why Benford's Law makes sense, consider it with respect to amounts of money. The amount of time(/effort/work) needed to get from \$1000 to \$2000 (100% increase) is a lot greater than the amount of time needed to get from \$8000 to \$9000 (12.5% increase)--increasing money is usually done in proportion to the money one has. Thought about in the other direction, it should take a fixed amount of time to, say, double one's money, so going from \$1000 to \$2000 takes as long as from \$2000 to \$4000 and \$4000 to \$8000, so the leading digit spends more time at lower values than at higher values. Because the value growth is exponential, the time spent at each leading digit is roughly logarithmic.

3
On

For the case of Bendford's Law, of course scale invariance is a necessary condition; it the law must be true either if we measure things in meters or in feet or in furlongs, thus multiplying given data for a constant, the only distribution which allows this is the logarithmic one. But its being necessary does not mean that it is the answer, of course.

Scale invariance is not relevant for Zipf's Law, however, since we have an absolute rank.

1
On

Benford's law is easy to prove, but a little harder to explain. The easiest way to understand why it works is to think a a random set of number as not one number but two: you have the number itself and the highest possible number that it could be. Therefore, you have a potential range of possible numbers for every random number. Also, for you to have a good data set the potential range should be random for each random number. If the potential range is the same for each random number, then Benford's law will not correctly predict the leading number.

For example, if you set a computer to randomly pick 15 numbers from a potential range of numbers 1 - 300 each time, the odds that the leading number will be a 1 or a 2 would be the exact same. The chance that the leading number would be a $3 - 9$ would be significantly less and therefore, the random numbers would not line up very well with Benford's law.

In order for Benford's law to work, each random number must have an equally random potential: n = random number; d = highest possible number on a range $1$ through $d$. For example, the first random number is picked out of 25, the next random number is picked out of 1679, the next random number is picked out of 500, the next out of 12478, the next out of 20, the next out of 36 etc. etc. The sort of random set is more likely to lead with a 1 followed by a 2 consistant with Benford's law.

0
On

Benford's Law

To examine the distribution of the mantissas of a dataset, we can examine the fractional parts of the common logarithms of the data. That's because the fractional part of the common logarithm is the common logarithm of the mantissa.

For example, consider numbers with mantissa $2.5$: $$ \begin{align} \log(.025) &= .39794000867203760957 - 2\\ \log(.25) &= .39794000867203760957 - 1\\ \log(2.5) &= .39794000867203760957 + 0\\ \log(25) &= .39794000867203760957 + 1\\ \log(250) &= .39794000867203760957 + 2\\ \log(2500) &= .39794000867203760957 + 3 \end{align} $$ Thus, if the mantissa of a number is $2.5$, the fractional part of its common logarithm is $.39794000867203760957$.

If the data spans several decades (powers of $10$, not years; see Decade (log scale), when we combine the data from all of the decades, it tends to even out. Thus, the fractional parts of the common logarithms of the data should be evenly distributed.

For example, suppose the common logarithm of the data is distributed across $4$ decades as shown below:

$\hspace{7mm}$enter image description here

Summing the distributions of the fractional parts of the logarithms, that is, moving the decades on top of each other and adding curves, we get the black curve at the top of the image below, which is close to evenly distributed:

$\hspace{64mm}$enter image description here

Thus, we arrive at the principal assumption of Benford's Law: the logarithm of the mantissa of data which spans several decades is typically distributed evenly.

Note the hash marks on the bottoms of the graphs above. These marks separate where the different leading digits of the mantissa live. On the line segment below, we expand these hash marks and align the leading digit of the mantissa with the fractional part of the common logarithm. The leading digit of the mantissa is $1$ if the fractional part of the common logarithm is between $0$ and $.30103$; the leading digit is $2$ if the fractional part is between $.30103$ and $.47712$; and so on.

$\hspace{7mm}$enter image description here

If the principal assumption of Benford's Law holds, the fractional part of the common logarithm is evenly distributed. In view of the previous diagram, it is obvious that the probability of $1$ being the leading digit is greater than that of $2$ being the leading digit; the probability of $2$ is greater than that of $3$; and so on. This is made precise below.

Data that has a mantissa starting with the digit $1$ has a common logarithm whose fractional part ranges from $\log(1)$ to $\log(2)$. If the fractional part of the common logarithm of the data is evenly distributed, then the portion of the data that starts with $1$ would be $$ \frac{\log(2)-\log(1)}{\log(10)-\log(1)}=.30102999566398119521 $$ Similarly, data that has a mantissa starting with the digit $2$ has a common logarithm whose fractional part ranges from $\log(2)$ to $\log(3)$. Thus, the portion of the data starting with $2$ would be $$ \frac{\log(3)-\log(2)}{\log(10)-\log(1)}=.17609125905568124208 $$ In the same manner, data that has a mantissa starting with the digit $d$ has a common logarithm whose fractional part ranges from $\log(d)$ to $\log(d+1)$. Thus, the portion of the data starting with $d$ would be $$ \frac{\log(d+1)-\log(d)}{\log(10)-\log(1)}\tag{1} $$ Using $(1)$, we can compute the probability that such data will start with the digit $d$: $$ \begin{array}{} d&P(d)\\ \hline\\ 1&.30102999566398119521\\ 2&.17609125905568124208\\ 3&.12493873660829995313\\ 4&.096910013008056414359\\ 5&.079181246047624827723\\ 6&.066946789630613198203\\ 7&.057991946977686754929\\ 8&.051152522447381288949\\ 9&.045757490560675125410 \end{array} $$ This distribution of leading digits is called Benford's Law.


Further Digits

The probability that the first two digits are $10$ is $$ \frac{\log(11)-\log(10)}{\log(100)-\log(10)}=.041392685158225040750 $$ The probability that the first two digits are $20$ is $$ \frac{\log(21)-\log(20)}{\log(100)-\log(10)}=.021189299069938072794 $$ Adding the probabilities for all first digits, we can compute the probability that the second digit is $0$ to be $.11967926859688076667$.

In this manner, we can compute the probability that the second digit is $d$: $$ \begin{array}{} d&P(d)\\ \hline\\ 0&.11967926859688076667\\ 1&.11389010340755643889\\ 2&.10882149900550836859\\ 3&.10432956023095946693\\ 4&.10030820226757934031\\ 5&.096677235802322528359\\ 6&.093374735783036121570\\ 7&.090351989269603369600\\ 8&.087570053578861399175\\ 9&.084997352057692199898 \end{array} $$ For reference, here are the probabilities that the third digit is $d$: $$ \begin{array}{} d&P(d)\\ \hline\\ 0&.10178436464421710175\\ 1&.10137597744780144287\\ 2&.10097219813704129959\\ 3&.10057293211092617495\\ 4&.10017808762794737592\\ 5&.099787575692177452606\\ 6&.099401309944962084127\\ 7&.099019206561896092170\\ 8&.098641184154777437875\\ 9&.098267163678253538152 \end{array} $$ Here are the probabilities that the fourth digit is $d$: $$ \begin{array}{} d&P(d)\\ \hline\\ 0&.10017614693993552632\\ 1&.10013688811757926504\\ 2&.10009767259461432585\\ 3&.10005850028348653742\\ 4&.10001937109690488020\\ 5&.099980284947840433784\\ 6&.099941241749525329518\\ 7&.099902241415451708313\\ 8&.099863283859370683672\\ 9&.099824368995291309873 \end{array} $$

0
On

I have a mathematics background but was not familiar with Ben ford's Law until I saw an excellent documentary by Lateef Hussein. When I thought why it occurs I immediately thought range would bias to earlier numbers. So the lead digit is not randomly selected but selected from a range of numbers. This could be $1-9$ or it could be $1-5$ or $1-3$ and so forth hence we would actually expect a bias towards lower numbers. The law and the probability curve now makes sense. This is only my opinion at an informal answer.

0
On

My explanation comes from a storage perspective: Uniformly distributed high-precision numbers that range over many powers are very expensive to store. For example, storing 9.9999989 and 99999989 is much more expensive than storing 10^1 and 10^8 assuming "they" are okay with the anomaly. The "they" refers to our simulators ;-) Yes, I'm referring to the possibility that we are in a simulation. Instead of truly random numbers, using small randomness just around whole powers might result in huge cost savings. This type of pseudo-precision probably works really well in producing "realistic" simulations while keeping storage costs to a minimum.

0
On

Recently learned about Benford's law. Here are my thoughts.

At first glance, this doesn't make sense because we want to assume that the datasets to which it applies are random numbers. But, they are not random numbers. These are numbers of things that are being accumulated, and numbers that accumulate always start at zero and move upward.

A random dart thrown at a randomly distributed set of numbers (having the same number of numbers starting with each digit) would certainly not follow Benford's law because the number can go from 0 to 75483 with a single throw of a dart, then to 35, then to 8203.

So, if you notice, as you are counting, every time you get to a new order of magnitude (10, 100, 10000, ... 1,000,000), you encounter a block of numberst starting with 1 that is as large as the entire block of numbers that you have already counted through. In other words - to avoid a number starting with 1 you have to double the count thus far. Then you get to the twos and you encounter a block of 2s that are 1/2 the size of the block counted so far. The 3s are 1/3 the size, 4's 1/5 the size and eventually the 9's are 1/9 the size of the count so far. So, to avoid the 9s you only have to increase the count by 11% to avoid them, and then you get back to a block of 1s that requires a 100% increase to be avoided.

One person above said this applies to things that are growing exponentially. This is not correct. It applies to anything that is accumulating at any rate.

Another said it is due to a second-order distribution where you have selected an upper limit first, then selected a random number below that upper limit. If this were the case, then accumulated counts with upper limits starting with 9 would follow one law and accumulated counts with upper limits starting with 1 would follow a different law.

It has to do be the path of numbers through which an accumulated count has to travel - always through a big block of 1s to get to a block of 2s. And the higher the first digit, the less of a percentage increase is needed to bypass it.

0
On

Let's take two quantities that we're counting:

  • Quantity L grows linearly. It starts at 100 and grows by 1 unit per day
  • Quantity E grows exponentially. It starts at 100 and grows by 1% per day

If we record these quantities every day for a long time, we'd see that it takes L as much time to go from 1000 -> 2000 as it does to go from 8000 -> 9000 whereas for E, it will take longer to go from 1000 -> 2000 than it would to go from 8000 -> 9000. This remains true for all orders of magnitude. This generalises to all quantities that grow exponentially and thus all processes that grow exponentially will spend longer with their leading digit as 1 than 2 and longer at 2 than 3 and .....

As most processes in nature grow exponentially (size of a settlement, number of trees in a forest, salaries, number of covid cases, ....), we see that the leading number of things we measure will disproportionately be 1

Note: As a previous poster pointed out, this applies to anything growing at a 'rate'. Not just growing exponentially.