I want an optimal strategy for a very long time horizon, say $K=100000$. I have dynamic decision making problem where next state $x_{k+1}$ is determined by the probability distribution $f(x_{k+1}|u_{k},x_{k})$. I want to find a decision strategy $\pi$ such as $u_k=\pi(x_k)$ to maximize the overall expected reward $$ \mathcal{E}\left[R|\pi\right] = \mathcal{E}\left[\sum_{k=1}^K \gamma^k r(x_k,u_k)|\pi\right] $$ Where $r(x_k,u_k)$ is a single-step reward. I know there are many ADP (aproximate dynamic programming) and reinforcement learning approaches to solving this problem. Specifically, I am using the fitted Q iteration since I have no information about the dynamic model. Their convergence is assured for $\gamma<1$.
The problem I face is the following: to have $\gamma$ realistic, it is very close to one, say $\gamma = 0.9999999$. In that case, the iterative algorithms do not converge.
The reason for such a high $\gamma$ is related to the long time horizon. I consider to provide decision support on hourly basis for the next 10 years, i.e. $K=87600$ hours. Considering the interest rate to be $0.1$ per year, it means $$ \gamma = 0.9^{1/365} \doteq 0.9997 $$ My question is: How to cope with such a long horizon and such a large discount factor?