Min and max value of index of coincidence

58 Views Asked by At

I am pretty new to cryptography and trying to understand some mathematical aspects. Studying different types of cipher, I came across the definition of index of coincidence which is as follows:

Index of coincidence for a ciphertext of length $n$, in which the letter $A$ occurs $f_A$ times, $B$ occurs $f_B$ times, etc., is given by the formula $\frac{f_A(f_A−1)+f_B(f_B−1)+...+f_Z(f_Z−1)}{n(n−1)}$

The definition seems to be understandable for me, but I have a problem with one of the task connected with this topic:

Let $n$ be a natural number. Determine the largest and smallest possible value of index of coincidence for a string of length $n$. Give an example of a string of length $n$ for which the value of index of coincidence is the largest/the smallest.

Determining the maximum value of index of coincidence is quite obvious - it is $1$ and occurs when all letters in the ciphertext are the same (then we have $\frac{n(n-1)} {n(n-1)} = 1$).

Regarding the minimum value, I thought about the case when $n \le 26$ (the length of ciphertext is less than or equal to the number of letters in English alphabet). Then, the minimum value of index of coincidence is $0$ and it is when each letter in ciphertext occurs at most once.

But what if $n > 26$? Then it is not possible that each letter occurs at most once and some of them have to be repeated. I have no idea how to determine the minimum value of index of coincidence then. Can you help me/give me some tips?

1

There are 1 best solutions below

7
On

To summarize the discussion in the comments:

For any alphabet and any $n$, the minimal solution is realized when, given any two distinct letters $A,B$, we have $|f_A-f_B|≤1$. If the alphabet has $k$ letters and the string is $n$ letters long, then either a letter occurs $\big \lfloor \frac nk \big \rfloor$ times, or one more than that.

Proof: as there are only finitely many possible words of a given length, the minimum configuration must exist (it might not be unique, of course). Pick a minimal configuration. Suppose, to the contrary of our claim, that in this minimal configuration we can find two letters with $f_A≥f_B+2$. We need to derive a contradiction. To do so, consider the same configuration only let $f_A'=f_A-1$ and $f_B'=f_B+1$. That is to say, make a new configuration by replacing one of the $A's$ with a $B$. Now, the new numerator differs from the old only in the terms corresponding to $A,B$. We note that $$f_A'(f_A'-1)+f_B'(f_B'-1)=(f_A-1)(f_A-2)+(f_B+1)f_B=$$ $$=f_A(f_A-1) +f_B(f_B-1)-2(f_A-1)+2f_B$$

And now we just have to remark that $f_A≥f_B+2\implies 2(f_A-1)≥2f_B+2$ so this switch lowers the numerator. This contradicts our assumption of minimality, so we are done.

To describe the minimal configuration precisely, write $n=qk+r$ using the usual division algorithm. Thus, $0≤r<k$ and $q=\big \lfloor \frac nk \big \rfloor$. Now use $q$ of each letter in your word (in whatever locations, that's not important), this will leave $r$ blanks. Fill those blanks with $r$ distinct letter, doesn't matter which. In this way, we see that, for $r$ letters, $f_*=\big \lfloor \frac nk \big \rfloor+1$ while for the other $k-r$ letters $f_*=\big \lfloor \frac nk \big \rfloor$