Consider a state $x_t$ that probabilistically evolves over time according to a controlled Markov chain, i.e., according to known probabilities
$$\mathbb P(x_{t+1}=x' \,|\, x_t=x,a_t=a)$$
where $a_t$ is the control action at time $t$. To each state $x$, a known reward $r(x)$ is assigned. Actions have no cost. Our objective is to find the sequence of actions so as to maximize the expected reward at a given final time, $\mathbb E[r(x_T)]$. We do not care about rewards before the final time.
It seems that this problem can naturally be addressed by dynamic programming. I've defined the final value function as follows
\begin{align} V_T(x_T) = r(x_T) \end{align}
and value functions for each $t<T$ as
\begin{align} V_t(x) &= \max_{a\in A}\mathbb{E} \Big( V_{t+1}(f(x_t,a_t)) \mid x_t=x,a_t=a \Big)\\ &= \max_{a\in A}\sum_{x'} \mathbb P(x_{t+1}=x' \,|\, x_t=x,a_t=a) V_{t+1}(x') \end{align}
I am confused about how to actually implement this (I'm trying to write the pseudocode). I don't know how to initialize the problem. In a lot of dynamic programming approaches they set the value function at the final time to be zero, but that doesn't seem to make any sense here.
Question: How do I implement the dynamic programming equations above?
My ideas so far: Since we know the reward function, perhaps we can initialize $V_T(x)$ to $r(x)$ for every possible $x$? Then using the value functions $V_t(x)$, $t<T$, we can step back all the way to $V_0(x)$. Doing this for every possible $x$ and keeping track of the optimizers will give us a table of optimal actions $a$ for any state $x$ that we may encounter. Is this correct?