Mean time to sharded data being unavailable in a distributed storage system

114 Views Asked by At

How do I model time to data unavailability given the following parameters? Data unavailability means that every machine that hosts a replica of data is down at the same time. All replicas for a single piece of data exist in a single cell on different machines.

  • $n$ - number of machines. A machine exists in exactly 1 cell.
  • $c$ - the number of cells. A cell is a group of machines. Assume each cell contains the same number of machines.
  • $f$ - mean time to failure of an individual machine.
  • $r$ - the number of replicas for each piece of data. If data is replicated twice, it exists on 2 separate machines in the same cell.
  • $d$ - the number of pieces of data in the system. Data is uniformly distributed across cells and each cell uniformly distributes data across machines.
  • $t$ - the recovery time for a machine after it fails.

For example, I might have:

  • $n$ = 100 machines
  • $c$ = 5 cells each containing 20 machines.
  • $f$ = 6 months is the mean time to failure for a machine.
  • $r$ = 2, data is replicated twice in the same cell on different machines.
  • $d$ = 1000 pieces of data. Meaning each cell has 200 pieces and each machine has 10 pieces.
  • $t$ = 1 week to recover a failed machine.

The Availability in Globally Distributed Storage Systems paper suggests an exponential distribution might work.

The exponential distribution is a reasonable approximation for the following reasons. First, the Weibull distribution is a generalization of the exponential distribution that allows the rate parameter to increase over time to reflect the aging of disks. In a large population of disks, the mixture of disks of different ages tends to be stable, and so the average failure rate in a cell tends to be constant. When the failure rate is stable, the Weibull distribution provides the same quality of fit as the exponential. Second, disk failures make up only a small subset of failures that we examined, and model results indicate that overall availability is not particularly sensitive to them. Finally, other authors ([24]) have concluded that correlation and non-homogeneity of the recovery rate and the mean time to a failure event have a much smaller impact on system-wide availability than the size of the event

1

There are 1 best solutions below

0
On BEST ANSWER

I did a monte carlo simulation to model the problem because I didn't know how to do it analytically. Graphs at the bottom.

Python code: https://github.com/jschaf/cellarch

The simulation works as follows:

  • Partition NUM_MACHINES into NUM_PARTITIONS separate groups. This simulation uses partition to refer to separate groups instead of cell used by the math StackExchange question.

  • Uniformly distribute NUM_DATA pieces of data to all subsets of machines such that:

    1. Every subset of machines reside in the same partition.
    2. Every subset has exactly a size of NUM_REPLICAS.
  • Generate machine failure start times pulling samples from an exponential distribution.

  • Get the cumulative sum of the failure times to generate subsequent failure times for a machine. Meaning, turn [1, 3, 2, 7] into [1, 4, 6, 13].

  • Create an outage for a machine by adding the time to repair to the failure start time. The time to repair is drawn from a normal distribution. Meaning, assuming we draw time R from the normal distribution turn [1, 4, 6, 13] into [(1, 1 + R1), (4, 4 + R2), (6, 6 + R3), (13, 13 + R4)]

  • Find all outages where N machines are down at the same time. This is an outage clique. When N == NUM_REPLICAS, this means we might have an outage for some subset of data.

  • Find all outage cliques where each machine in the clique hosts the same piece of data. The found cliques mean some data is completely unavailable.

  • Run the above steps many times to get the time to first data unavailability.

Outages with a MTTF up to 60 months Outages with a MTTF up to 30 months Outages by partition