How much data can I store on used punch cards?

275 Views Asked by At

I have inherited a bunch of used punch cards filled with random data, which I plan to reuse to store my own data. The question is: how efficiently can I store my data?

Assume each punch card has some fixed number of squares, each square can either have a punch or not. So each square holds 1 bit of information. When I get the cards, each square already has a punch independently probability 1/2. A punch can be added to any square that does not already have one, but a square that is punched must stay punched.

If I ignore the current state of the cards when I am writing my data, then I can view this as a noisy channel. If $X$ is the bit I try to write and $Y$ is the bit that ends up on the card (say 0=no punch, 1=punch), then $P(Y=1|X=1) = 1$ and $P(Y=1|X=0) = 1/2$. If $P(X=1)=p$, then I calculate that the mutual information of $X$ and $Y$ is $$ -\frac{1}{2}\left((1-p)\log(1-p) + (1+p)\log(1+p)\right) + p\log(2), $$ which attains a maximum value of $\log(5/4)\approx 0.322$ at $p=\frac{3}{5}$ (using the base 2 logarithm). This means that with a cleverly-chosen coding scheme, I can record around 0.322 bits of information for every square. This makes sense: half the squares are already filled, so can store no additional information, and the uncertainty in where those squares are means that the information capacity should be somewhat less than 0.5.

The analysis above this assumes that I am deciding where to punch without looking at where the cards are already punched. If examine the used cards before deciding where to punch, is it possible I can do better than $\log(5/2)$? Certainly 1/2 is an upper bound.

Edit Calculation of mutual information: If $P(X=1)=p$, then $P(Y=0)=(1-p)/2$ and $P(Y=1)=(1+p)/2$. The joint distribution and outer product distribution of $X$ and $Y$ have probability mass functions, respectively, $$ \begin{array}{l|l|l|} X\backslash Y&0&1\\ \hline 0&\frac{1-p}{2}&\frac{1-p}{2}\\ \hline 1&0&p\\ \hline \end{array},\hspace{2cm} \begin{array}{l|l|l|} X\backslash Y&0&1\\ \hline 0&\frac{(1-p)^2}{2}&\frac{(1-p)(1+p)}{2}\\ \hline 1&\frac{p(1-p)}{2}&\frac{p(1+p)}{2}\\ \hline \end{array} $$ The mutual information is the Kullback–Leibler divergence of the joint distribution and the outer product distribution, which is $$ \begin{gather*} \frac{1-p}{2}\log\left(\frac{1}{1-p}\right) + \frac{1-p}{2}\log\left(\frac{1}{1+p}\right)+0+p\log\left(\frac{2}{1+p}\right)\\=-\frac{1}{2}\left((1-p)\log(1-p) + (1+p)\log(1+p)\right) + p\log(2). \end{gather*} $$

1

There are 1 best solutions below

0
On BEST ANSWER

I think your bounds $0.322<R<0.5$ are right, and that the upper bound is achievable.

Here's an AEP argument, adapted from Heegard, Chris; El Gamal, Abbas A., On the capacity of computer memory with defects, IEEE Trans. Inf. Theory 29, 731-739 (1983). ZBL0512.94005.

Let's do it in general, with $p_e$ being the "erasure" probability (here, the proportion of punched squares).

We consider binary sequences of length $n$, generated with some fixed $p > p_e$. The amount of typical sequences will be approximately $N=2^{n h(p)}$ where $h(p)=-p \log p - (1-p) \log (1-p)$ is the binary entropy function. We divide randomly this set in $M=2^{n R}$ equally populated bins. We'll transmit a number in $1 \cdots M$ as an index bin.

Now, imagine we are given a "erasure pattern" (punched card). How many sequence we'll have in our typical set, compatible with this pattern? We have approximately $n p_e$ erasures, hence $n (1-p_e)$ free bits, with a reduced probabilty $p'=(p-p_e)/(1-p_e)$ The answer is then $E \approx 2^{n (1-p_e) h(p´)}$.

We want $E \gg M$, to be reasonably sure of finding, inside our to-be-transmited bin, at least some compatible sequence. For this, asymptotically, it's necesarry (and sufficient) that:

$$R<(1-p_e) h(p´)$$

The limit value $R_0 = h(p') (1-p_e)$ is maximized by $p=(1+p_e)/2$. And the optimal rate is the $1-p_e$.

In our case, where $p_e=\frac12$, we have $p=\frac34$ and $R_0=\frac 12$

Of course, all this makes sense in the limit $n \to \infty$. To devise an efficient and practical encoding for finite $n$, is another matter.


A concrete encoding (not too practical, nor too optimal), for $p_e=\frac12$

Take $n=32$, and consider the $X=\binom{n}{n/4}=10518300$ sequences of $n$ bits which have $n_0 = \frac14 n=8$ zeros and $n_1 = \frac34 n = 24$ ones. Here I'll consider the punched squares as ones.

We divide randomly that set in $M=2^{14}=16384$ bins of size $X/M$ each, and we index these bins. These indexes will be our codewords. All this is known by the decoder.

The encoder proceeds in this way: to send the codeword $i \in \{ 1,2 \cdots M\}$, we search inside bin $i$ for some sequence that is compatible with the punched card. If we succeed, we send that sequence. If not, we send a dummy all-ones sequence (which the decoder will understand as fail-and-retry), and retry with the next punched card.

The probability of a single fail (a sequence in the bin is not compatible with the card) is $q_1=1-2^{-n_0}=0.9960937$, the probability that every one in the bin fails is (approximately) $q_* = q_1^{X/M} = 0.08105163$

Hence the expected number of sends is $1/p = 1.08820042$

Then our rate is

$$R = \frac{\log_2 M }{ n/p} = \frac{14 }{34.8224} = 0.4020$$

By taking larger numbers we can improve the rate. For example, for $n=64$ and $M=1609263900$ I get $R=0.4732$