I recently saw this SO answer which says that a tree depth is equal to:
Math.floor(Math.log(dataSize - 1) / Math.log(chunkSize)) + 1
where dataSize is equal to the length of an array, and chunkSize is the size of each chunk you chunk an array into, such as through this divide function:
function divide(data, size) {
if (data.length <= size) return data;
const result = [];
for (let i = 0; i < data.length; i += size) {
result.push(data.slice(i, i + size))
}
return divide(result, size);
}
const tree = divide(data, 5);
How do they know this? What is the math behind this logarithm etc. stuff. How do you determine from base principles that this is in fact how you calculate the depth of the corresponding tree? To me it is just pure magic, like this equation was stumbled upon by brute force and "just works, let's leave it at that". Would love to see how you can prove this is true, and how to explain it from base principles.
A comment first: The current top answer states that if you have $n$ data entries and block size $m$, then the tree depth $D$ is $D = \lceil \log_m(n) \rceil$, which looks right to me (notice that this is basically the same quantity, just with an offset of $-1$ before taking the logarithm, and $+1$ after taking it).
So how do we find this formula ourselves, and where does the log come from? (intuitively at least, there are some additional technicalities that I won't cover).
Let the number of data entries be $n$, and the size $m$. Each loop of divide(data,size) is grouping the $n$ data entries into chunks of size at most $m$. The algorithm repeats this division on the groups, grouping these again into chunks of size at most $m$. It does this until there is one single chunk.
Thus, the depth of the tree is the number of times we repeat this procedure. So the depth $D$ of the tree is, in purely numerical terms, the number of times we must divide $n$ by $m$ to get something less than or equal to $1$. That is, $D$ is the smallest number such that
$$n \times \left(\frac{1}{m}\right) \times \dots \text{$D$ times} \dots \times \left(\frac{1}{m}\right) \leq 1$$
[At this point, it suffices to know that $\log_m(n)$ is a generalization of the number of times you need to divide $n$ by $m$ to get 1. Since we are working with integers, you'd need to take the ceiling to make sure you end up with at most one full chunk, possibly less. However, it seems that this is where your issue is, so let's explain further.]
If you're confused with how the logarithm tells us this, it's worth starting from the ground up. The logarithm is the inverse of the exponential. So the equation $\log_m(n) = k$ means that $m^k = n$. For $m,n,k$ all integers, $m^k = n$ just means
$n$ is equal to $1$, multiplied by the number $m$ a total of $k$ separate times. So $n = 1\times m\times \dots \text{$k$ times} \dots \times m = m^k$. Since the logarithm is the inverse of the exponential, we see that $\log_m(n) = k$ can be interpreted as$k$ is the number of times you need to multiply $1$ by $m$ to obtain $n$, or, equivalently,$k$ is the number of times you need to divide $n$ by $m$ to obtain $1$.With this in mind, let's go back to the algorithm: It starts with $n$ chunks (each is just one data point). At step 1, it divides the number of chunks by $m$ (by grouping), and makes $1$ tree layer. It repeats this, adding $1$ layer to the tree, until there it encounters a step at which there are $m$ or fewer chunks to group --- at which point, it makes one final return (which we can think of as the last division, guaranteeing we have at most $1$ full chunk left). So starting with $n$ chunks, each layer of the tree corresponds to $1$ division of the number of chunks by $m$, until we have "at most" one full chunk. So the depth $D$ is the smallest integer such that dividing $n$ by $m$ $D$ times yields a value $\leq 1$. So $D = \lceil \log_m(n) \rceil$.
For understanding algorithm complexity, it's very useful to think of exponentials and logarithms like this, in terms of "repeated multiplications" and "repeated divisions".
To make sure this is extra clear, let's work through an example. Say data =$\{1,2,\dots, 12 \}$ and size=$3$. Note that $\lceil \log_3(12) \rceil = 3$.
In the first loop, the algorithm divides the $12$ data points into $4$ chunks $A$, $B$, $C$ and $D$, creating level $1$ of the tree. Since $4>3$, it divides by $3$ again, turning the $4$ chunks into $\lceil \frac{4}{3} \rceil = 2$ chunks, and creating layer $2$. The final iteration notices that $2<3$, so the function makes its final returning, creating the third and final layer of the tree. So we simply tracked the number of times we divided $12$ by $3$ before reaching something less than or equal to $1$ (there's some trickiness with the ceilings, but that's a minor detail you can work out), so we divided $\lceil \log_3(12) \rceil$ times.
[Keeping in the theme of seeing logs as inverses of exponentials, you can also think of this as a reverse question: How many layers deep must I make an $m$-ary tree before the last layer contains at least $n$ leaves?]
EDIT: lots of typos