Failure recovery problem

48 Views Asked by At

I have a problem like this:

I have a computer that fails with chance p. I split a job into sequential parts each 1 step. I run them 1 by 1 on computer and each one takes time T. To be able to recover from fails, I make checkpoints every x step with time C*T, so when fail, I recover from last checkpoint. how I need to select x, to have best time for system overall? if I make backup every turn, overall time will so very big,because making checkpoint takes a lot time if I make backup very rarely, I need to fall back too many steps when a failure happens. so x must be a good number.

Any idea how I can solve this problem?

Update: I have tried to calculate expected time between 2 checkpoints in this way: O-----f--O O=checkpoint -=step f=failure x=number of steps $$ E(t) = x*T*(1-p)^x + \sum_{f=1}^\infty (x*T*(f+1)*x^f*p^f*(1-p)^{f*(x-1)}) $$ which can be completely wrong..O.O