How do you know that tree depth is $\left\lfloor \frac{\log(n-1)}{\log(m)} \right\rfloor + 1$?

217 Views Asked by At

I recently saw this SO answer which says that a tree depth is equal to:

Math.floor(Math.log(dataSize - 1) / Math.log(chunkSize)) + 1

where dataSize is equal to the length of an array, and chunkSize is the size of each chunk you chunk an array into, such as through this divide function:

function divide(data, size) {
    if (data.length <= size) return data;
    const result = [];
    for (let i = 0; i < data.length; i += size) {
        result.push(data.slice(i, i + size))
    }
    return divide(result, size);
}

const tree = divide(data, 5);

How do they know this? What is the math behind this logarithm etc. stuff. How do you determine from base principles that this is in fact how you calculate the depth of the corresponding tree? To me it is just pure magic, like this equation was stumbled upon by brute force and "just works, let's leave it at that". Would love to see how you can prove this is true, and how to explain it from base principles.

2

There are 2 best solutions below

0
On BEST ANSWER

A comment first: The current top answer states that if you have $n$ data entries and block size $m$, then the tree depth $D$ is $D = \lceil \log_m(n) \rceil$, which looks right to me (notice that this is basically the same quantity, just with an offset of $-1$ before taking the logarithm, and $+1$ after taking it).

So how do we find this formula ourselves, and where does the log come from? (intuitively at least, there are some additional technicalities that I won't cover).

Let the number of data entries be $n$, and the size $m$. Each loop of divide(data,size) is grouping the $n$ data entries into chunks of size at most $m$. The algorithm repeats this division on the groups, grouping these again into chunks of size at most $m$. It does this until there is one single chunk.

Thus, the depth of the tree is the number of times we repeat this procedure. So the depth $D$ of the tree is, in purely numerical terms, the number of times we must divide $n$ by $m$ to get something less than or equal to $1$. That is, $D$ is the smallest number such that

$$n \times \left(\frac{1}{m}\right) \times \dots \text{$D$ times} \dots \times \left(\frac{1}{m}\right) \leq 1$$

[At this point, it suffices to know that $\log_m(n)$ is a generalization of the number of times you need to divide $n$ by $m$ to get 1. Since we are working with integers, you'd need to take the ceiling to make sure you end up with at most one full chunk, possibly less. However, it seems that this is where your issue is, so let's explain further.]

If you're confused with how the logarithm tells us this, it's worth starting from the ground up. The logarithm is the inverse of the exponential. So the equation $\log_m(n) = k$ means that $m^k = n$. For $m,n,k$ all integers, $m^k = n$ just means $n$ is equal to $1$, multiplied by the number $m$ a total of $k$ separate times. So $n = 1\times m\times \dots \text{$k$ times} \dots \times m = m^k$. Since the logarithm is the inverse of the exponential, we see that $\log_m(n) = k$ can be interpreted as $k$ is the number of times you need to multiply $1$ by $m$ to obtain $n$, or, equivalently, $k$ is the number of times you need to divide $n$ by $m$ to obtain $1$.

With this in mind, let's go back to the algorithm: It starts with $n$ chunks (each is just one data point). At step 1, it divides the number of chunks by $m$ (by grouping), and makes $1$ tree layer. It repeats this, adding $1$ layer to the tree, until there it encounters a step at which there are $m$ or fewer chunks to group --- at which point, it makes one final return (which we can think of as the last division, guaranteeing we have at most $1$ full chunk left). So starting with $n$ chunks, each layer of the tree corresponds to $1$ division of the number of chunks by $m$, until we have "at most" one full chunk. So the depth $D$ is the smallest integer such that dividing $n$ by $m$ $D$ times yields a value $\leq 1$. So $D = \lceil \log_m(n) \rceil$.

For understanding algorithm complexity, it's very useful to think of exponentials and logarithms like this, in terms of "repeated multiplications" and "repeated divisions".


To make sure this is extra clear, let's work through an example. Say data =$\{1,2,\dots, 12 \}$ and size=$3$. Note that $\lceil \log_3(12) \rceil = 3$.

enter image description here

In the first loop, the algorithm divides the $12$ data points into $4$ chunks $A$, $B$, $C$ and $D$, creating level $1$ of the tree. Since $4>3$, it divides by $3$ again, turning the $4$ chunks into $\lceil \frac{4}{3} \rceil = 2$ chunks, and creating layer $2$. The final iteration notices that $2<3$, so the function makes its final returning, creating the third and final layer of the tree. So we simply tracked the number of times we divided $12$ by $3$ before reaching something less than or equal to $1$ (there's some trickiness with the ceilings, but that's a minor detail you can work out), so we divided $\lceil \log_3(12) \rceil$ times.

[Keeping in the theme of seeing logs as inverses of exponentials, you can also think of this as a reverse question: How many layers deep must I make an $m$-ary tree before the last layer contains at least $n$ leaves?]

EDIT: lots of typos

2
On

Given an integer $n$, if you represent in base $b$, how many digits will that representation have?

Example: $54321$ in base-$10$ is $5$ digits. How do you calculate the number of digits in base-10. Of course, 54321 is already in base-10, so you could just count the digits. What if the base was $5$?

$54321 = 3214241_5$, the number of digits is 7.

How can we calculate the number of digits as a function of the given number and base?

Recall the base conversion algorithm. It is repeated division of the partial quotients by the base, collecting partial remainders every step of the way until we get to quotient of $0$.

For $54321_{10}$ in base-$5$, we have

$$54321 = 5 \times 10864 + 1$$ $$10864 = 5 \times 2172 + 4$$ $$2172 = 5 \times 434 + 2$$ $$434 = 5 \times 86 + 4$$ $$86 = 5 \times 17 + 1$$ $$17 = 5 \times 3 + 2$$ $$3 = 5 \times 0 + 3$$

Reading the remainders from the bottom up, we have the base-$5$ representation of $54321_{10} = 3214241_5$. The number of digits of $z$ in base-$b$ is given by

$$\lfloor log_b z \rfloor + 1$$

Proof: Let $z$ be a positive integer. The base-$b$ representation of $z$ has $d$ digits if $b^{d-1} \le n \lt b^d$. Taking logarithms (base-$b$) both sides, we get $d - 1 \le log_b (z) \lt d$. $\therefore d = \lfloor log_b (z) \rfloor + 1$,

The logarithm in base-$b$ can be converted to natural logarithms in base-$e$ and we have the above expression equivalent to

$$\lfloor log_b (z) \rfloor + 1 = \left\lfloor {log (z) \over log (b) }\right\rfloor + 1$$

The algorithm complexity of the $divide()$ function in the question is closely related to the above problem. Substitute $size$ for $5$ and $length(data)$ for $54321$. The number of digits is equivalent to the number of $divide()$ calls.

This is equivalent to the repeated division of the quotient in the base conversion procedure. The $length(result)$ is the quotient after dividing $length(data)$ by $size$.

In other words, if you drew a quotient-remainder tree where the left node was the quotient and the right node was the remainder, you will have a tree with depth equal to the number of digits in base-$b$.