How do I model time to data unavailability given the following parameters? Data unavailability means that every machine that hosts a replica of data is down at the same time. All replicas for a single piece of data exist in a single cell on different machines.
- $n$ - number of machines. A machine exists in exactly 1 cell.
- $c$ - the number of cells. A cell is a group of machines. Assume each cell contains the same number of machines.
- $f$ - mean time to failure of an individual machine.
- $r$ - the number of replicas for each piece of data. If data is replicated twice, it exists on 2 separate machines in the same cell.
- $d$ - the number of pieces of data in the system. Data is uniformly distributed across cells and each cell uniformly distributes data across machines.
- $t$ - the recovery time for a machine after it fails.
For example, I might have:
- $n$ = 100 machines
- $c$ = 5 cells each containing 20 machines.
- $f$ = 6 months is the mean time to failure for a machine.
- $r$ = 2, data is replicated twice in the same cell on different machines.
- $d$ = 1000 pieces of data. Meaning each cell has 200 pieces and each machine has 10 pieces.
- $t$ = 1 week to recover a failed machine.
The Availability in Globally Distributed Storage Systems paper suggests an exponential distribution might work.
The exponential distribution is a reasonable approximation for the following reasons. First, the Weibull distribution is a generalization of the exponential distribution that allows the rate parameter to increase over time to reflect the aging of disks. In a large population of disks, the mixture of disks of different ages tends to be stable, and so the average failure rate in a cell tends to be constant. When the failure rate is stable, the Weibull distribution provides the same quality of fit as the exponential. Second, disk failures make up only a small subset of failures that we examined, and model results indicate that overall availability is not particularly sensitive to them. Finally, other authors ([24]) have concluded that correlation and non-homogeneity of the recovery rate and the mean time to a failure event have a much smaller impact on system-wide availability than the size of the event
I did a monte carlo simulation to model the problem because I didn't know how to do it analytically. Graphs at the bottom.
Python code: https://github.com/jschaf/cellarch
The simulation works as follows:
Partition NUM_MACHINES into NUM_PARTITIONS separate groups. This simulation uses partition to refer to separate groups instead of
cellused by the math StackExchange question.Uniformly distribute NUM_DATA pieces of data to all subsets of machines such that:
Generate machine failure start times pulling samples from an exponential distribution.
Get the cumulative sum of the failure times to generate subsequent failure times for a machine. Meaning, turn [1, 3, 2, 7] into [1, 4, 6, 13].
Create an outage for a machine by adding the time to repair to the failure start time. The time to repair is drawn from a normal distribution. Meaning, assuming we draw time
Rfrom the normal distribution turn[1, 4, 6, 13]into[(1, 1 + R1), (4, 4 + R2), (6, 6 + R3), (13, 13 + R4)]Find all outages where N machines are down at the same time. This is an outage clique. When N == NUM_REPLICAS, this means we might have an outage for some subset of data.
Find all outage cliques where each machine in the clique hosts the same piece of data. The found cliques mean some data is completely unavailable.
Run the above steps many times to get the time to first data unavailability.