Splitting a file into $m$ pieces of size $1/n$, such that any $n$ pieces allow you to recover the file?

107 Views Asked by At

Let's say we have a file (which we could define as a finite sequence of 0's and 1's (or any other two symbols)).

For $m > n$, can you create $m$ pieces (which are themselves files), each $\frac 1n$th the size of the original file, such that any $n$ pieces can recreate the original file?

(I have a feeling this may relate to , but I'm not sure.)

Note: I think as stated it is impossible, so how about saying the the pieces approach $\frac 1n$th the size of the original file as the size of the original file approaches $\infty$?

Note: I've solved the specific case of $m=3$ and $n=2$. For every two bits of the original file (say $a$ and $b$), you give each file one bit. The first piece gets $a$, the second piece gets $b$, and the third piece gets $a \oplus b$. If you have the first and second piece, you have all the bits. If you have the third piece and another piece, you can get the other piece.

3

There are 3 best solutions below

0
On

What you need is a so called MDS code. You use symbols over some q-ary alphabet where each symbol is a "sub file". These types of codes are exceedingly rare for most parameter combinations. You've happened upon one of the few examples of a binary MDS code, i.e. The single parity check code. Check out Reed Solomon codes for an example with huge applications both theoretical and practical,

2
On

What you described can be attained by random linear [network] coding. Assume a group of $g$ packets (segments of a file) $\{\mathbf{u}_1,\cdots,\mathbf{u}_g\}$, each of size $l$. To generate a new coded packet, these $g$ packets are combined by a random coding vector ${\mathbf c}_i$ containing $g$ random elements $(c_{i1},...,c_{ig})$ to form a new packet of the same length. Denoting the coded packet by ${\mathbf y}_i$, the encoding can be represented by \begin{equation} \label{eq_yi} {{\mathbf y}_i}_{(1\times l)}={{\mathbf c}_i}_{(1\times g)}{\mathbf U}_{(g\times l)}, \end{equation} where the sizes are given in the parenthesis and the original message (file) is denoted by \begin{equation} \mathbf{U}=\begin{bmatrix} \mathbf{u}_1\\ \mathbf{u}_2\\ \vdots \\ \mathbf{u}_g \end{bmatrix}, \end{equation} in which $\mathbf{u}_i, \,\forall i \in [1,g]$ is the row vector representation of the $i$'th original packet. For convenience, we can denote the process to generate $T$ encoded packets in the matrix form:
\begin{equation} \label{eq_y} {\mathbf Y}_{(T\times l)}={\mathbf C}_{(T\times g)}{\mathbf U}_{(g\times l)}. \end{equation} where \begin{equation} \begin{matrix} \mathbf{Y}=\begin{bmatrix} \mathbf{y}_1\\ \mathbf{y}_2\\ \vdots \\ \mathbf{y}_T \end{bmatrix} ,& \mathbf{C}=\begin{bmatrix} \mathbf{c}_1\\ \mathbf{c}_2\\ \vdots \\ \mathbf{c}_T \end{bmatrix}. \end{matrix} \end{equation} The decoding problem is a system of linear equations and can be solved only when ANY $g$ linearly independent coded packets are available (random coefficients are assumed known).

To recover ${\mathbf U}$ one needs to solve \begin{equation} \label{eq_rlncdec} \hat{{\mathbf U}}={\mathbf C}_c^{-1}{\mathbf Y}_c, \end{equation} where ${\mathbf C}_c$ is a subset of the coding coefficient vectors in which $g$ linearly independent vectors exist and ${\mathbf Y}_c$ is the corresponding coded matrix.

It can be shown that if you generate slightly more than $g$ coded packets ($T>g$), then you will have with high probability $g$ independent packets.

Note: operations are perfomed in $GF(q)$. A convenient case would be $q=2$.

0
On

While the basic idea is already there in the use of Network Coding techniques and Reed Solomon codes in erasure correction mode, much of the recent state of the art work concentrates on Distributed Storage Codes, where the tradeoffs between parameters such as speed of encoding and decoding, bandwidth, and storage, as well as techniques such as repair by transfer are used.

A good place to start is the collection at http://storagewiki.ece.utexas.edu/doku.php