I am trying to prove a structural property of a Markov Decision Process (MDP), but I have not been able to do so. I am wondering if someone can give me some insight in how to prove it or give me some insight in why it is not possible.
The problem description
There is a production machine that puts stickers on a toy. Placing a sticker on a toy takes exactly one time unit. This machine takes a sticker from a roll that is empty after 100 stickers, which means the roll should be replaced which takes on average T_replace. During replacement the machine is down which costs C_downtime per time unit. During placing a sticker on a toy, the machine can have a problem with probability P_problem. If a problem occurs, the current sticker in process is useless and is thrown away. The recovering of the machine takes on average T_recover and during this time the machine is down. Now when the machine faces a problem, one can choose to preventively replace the sticker roll, which costs C_sticker per sticker that is still on the roll (that one has to be thrown away) and the replacement of the roll during the problem takes the maximum of the two time variables T_replace and T_recover as duration.
The model
I modeled the process using a Markov Decision Process in which each sticker quantity >1 during production is a decision epoch at which you have to choose what to do when you face an problem during production of the next sticker:
a_1: throw away the roll;
a_2: you wait until the machine solves the problem and continue production with the remaining amount of stickers on the sticker roll.
Then the states are given by a variable for the machine status, M, and a variable for the amount of bags on the roll, B.
M can be: In production, Undergoing roll replacement, Facing a problem, Facing a problem while undergoing roll replacement
B can be: 1,…,50
P_recover is the probability of recovering from a problem in one time unit
P_replace is the probability of replacing the roll in one time unit
P_both is the probability of performing the replacement and recovering from a problem in one time unit
C_downtime>C_sticker
Note that you already know that it is optimal to perform a_1 when the machine faces a problem while placing the last sticker since there are no stickers left on the roll and it is always optimal to replace the roll. Furthermore, lets choose the parameters such that at some point with B>1 it is optimal to throw away the remaining amount of stickers on the roll.
Question
My question is whether it is possible to prove that the long term expected costs are lower in a state with more stickers than in a state with less stickers. Intuitively this seems correct since having less stickers on the roll means you will have to replace sooner. However, I have not been able to find a formal proof or a counter example. I drew a smaller version of the model with B = {1,..,4} Model with 4 stickers. If I described somthing unclearly, please let me know. I will update the description. Thanks to everyone who gave it a shot!