What would be a good error correction code for recovering from 50% data corruption?

139 Views Asked by At

For my use case it's quite likely that much of the data will be corrupted in such a way that I do not know which bits are corrupted and which bits aren't. As in, half or so of the data may be completely randomized. Hamming codes look good but I'm not sure how to extend it to more than 1 bit of correction. Reed-Solomon codes are also very good, but I don't know how to apply it without knowing which bits are corrupted. What would be a good coding scheme for this use case?

2

There are 2 best solutions below

4
On BEST ANSWER

If by half data are corrupted, you mean that the binary probability of error is equal to $p = 1/2$, then you are in trouble. For a BSC (Binary Symmetric Channel), the channel capacity per symbol is equal to:

$$C_S = 1 + p \, log_2 p + (1-p)\,log_2(1-p)$$

And then the capacity is equal to $C_S = 0$ for $p = 1/2$.

This implies that no code can correct such a randomized channel.

In a comment, you detail that the data are corrupted by blocks (this should have been told in the post itself). For example, one half is highly corrupted, on half is correct. In such an extreme situation, the best if to make a block transmission, with a CRC for each block, to detect which blocks are corrupted, and which are not. Then data blocks detected as corrupted (highly incorrect) must be thrown. To get the whole information, you have to perform some kinds of data block repetition.

The important aspect is that if a given block is highly corrupted, you can use the above capacity analysis to decide that the best thing we can do with it is to throw it.

In a real transmission system, with a return channel, this is managed by retransmitting the erroneous blocks (ARQ process).

0
On

You will need some very significant amount of redundancy here.

You can work under the weakest assumption that 50% of all bits are randomised. So 50% are correct because they were not randomised, 25% are correct because randomising gave the correct value, and 25% are wrong because randomising gave the wrong value.

As you describe it, you can use a better assumption, that large ranges are either completely correct of completely randomised. Let's say ranges of 1024 bytes are either completely correct or completely random. A CRC check will identify the correct ranges at a cost of say 8 byte. So n 1024 byte blocks will give you on average n/2 x 1016 bytes of known correct data, and n/2 x 1024 bytes of random rubbish. You only need to transmit about twice as much data as needed without corruption. It's a bit harder if the unmodified chunks are at random places and have smaller sizes.

You would be much better off if your reading process can tell you that data is damaged. For example for a scratched CD, it should be possible to identify damaged data.