Each week, we can decide to do a hunt or skip it. Starting at week $1$, the available reward from that hunt is a yellow shard, then a blue shard, then a red shard, repeating in that order (so e.g. week $4$ would be the second opportunity to obtain a yellow shard).
If we do a hunt, there is a base probability of $0.2$ that the shard is "tau-forged." If we do not obtain a tau-forged shard, then this probability increases by $0.2$, continuing until we obtain a tau-forged shard, at which point it resets to the base of $0.2$.
I'd like to determine a policy, based on the current available shard and tau-forged probability, to maximise the number of red tau-forged shards obtained over a potentially infinite time.
It is straightforward to write a simulation to determine the expected number of red tau-forged shards for a given policy, but given the number of potential policies, it seems more feasible to model this as a Markov decision process.
The state space should be $\{\text{yellow}, \text{blue}, \text{red}\}\times\{0.2,0.4,0.6,0.8,1\}$ and the action space $\{0,1\}$. The reward for each state/action pair would be $0$ for all transitions aside from those that transition into $(\text{yellow},0.2)$ - which would be when a red tau-forged shard is obtained.
From here, given a discount factor $0<\gamma<1$, we should be able to use an algorithm such as Bellman's value iteration to determine an optimal policy. It's been some time since I've studied MDP though, so I wanted to confirm this is a suitable (and computationally tractable) way of approaching this problem before proceeding to complete and solve the model.
(And of course, if anything is unclear about the question, feel free to ask and I will clarify to the best of my ability.)