The following quote is from Bertsekas's Dynamic Programming and Optimal Control. I'm only looking for a nudge in the right direction as to how to interpret the following equations, particularly equation (1.2) and last equation. The rest is provided for context.
What exactly is going on with the expected value and why does equation (1.2) differ from the infinite summation (i.e., why does it require more "complex mathematical formulations")? Lastly, can the disturbances $w_k$ be i.i.d random variables despite how they may be dependent on the current state and the control?
Consider the stationary discrete-time dynamic system $$x_{k+1} = f(x_k,u_k,w_k), \quad k=0,1,..., \quad (1.1)$$ where for all $k$, the state $x_k$ is an element of a space $S$, the control is an element of space a $C$, and the random disturbance $w_k$ is an element of space a $D$. We assume that $D$ is a countable set. The control $u_k$ is constrained to take values in a given nonempty subset $U(x_k)$ of $C$, which depends on the current state $x_k [u_k \in U(x_k),$ for all $ x_k \in S]$. The random disturbances $w_k, k= 0,1,...,$ are characterized by probability distributions $P(\cdot | x_k,u_k)$ that are independent of $k$, where $P(w_k|x_k,u_k)$ is the probability of occurrence of $w_k$, when the current state and control are $x_k$ and $u_k$, respectively. Thus the probability of $w_k$ may depend explicitly on $x_k$ and $u_k$, but not on values of prior disturbances $w_{k-1},...,w_0$.
Given an initial state $x_0$, we want to find a policy $\pi = \{\mu_0,\mu_1,...\}$, where $\mu_k : S \mapsto C$, $\mu_k(x_k) \in U(x_k)$, for all $x_k \in S$, $k=0,1,...,$ that minimizes the cost function $$J_\pi(x_0)=\lim_{N\to \infty} \mathop{{}^{E}_{w_k}}_{k= 0,1,...}\{\sum_{k=0}^{N-1}a^kg(x_k,\mu_k(x_k),w_k)\} \quad (1.2)$$ subject to the system equation constraint (1.1). The cost per stage $g: S \times C \times D \mapsto \mathbb{R}$ is given, and $\alpha$ is a positive scalar referred to as the discount factor.
We denote by $\Pi$ the set of all admissible policies $\pi$, i.e., the set of all sequences of functions $\pi = \{\mu_0, \mu_1,...\}$ with $\mu_k : S \mapsto C$, $\mu_k(x) \in U(x)$ for all $x \in S$, $k=0,1,...$ The optimal cost function $J^*$ is defined by
$$J^*(x) = \min_{\pi \in \Pi}J_\pi(x), \quad x \in S$$
An optimal policy, for a given initial state $x$, is one that attains the optimal cost $J^*(x)$. This policy may depend on $x$, but we will generally find that for most problems, an optimal policy, when it exists, may be taken to be stationary, i.e., have the form $\pi =\{\mu,\mu,...\}$, in which case it is referred to as the stationary policy $\mu$. We say that $\mu$ is optimal if $J_\mu(x) = J^*(x)$ for all states $x$.
The cost $J_\pi(x_0)$ given by Eq. (1.2) represents the limit of expected finite horizon costs. These costs are well defined as discussed in Section 1.5 of Vol. I. Another possibility would be to minimize over $\pi$ the expected infinite horizon cost $$\mathop{{}^{E}_{w_k}}_{k= 0,1,...}\{\sum_{k=0}^{\infty}a^kg(x_k,\mu_k(x_k),w_k)\}.$$ Such as cost would require a far more complex mathematical formulation (a probability measure on the space of all disturbance sequences; see Be[S78]). However, we mention that, under the assumptions that we will be using, the preceding expression is equal to the cost given by Eq. (1.2). This may be proved using the monotone convergence theorem (see Section 3.1) and other stochastic convergence theorems, which allow interchange of limit and expectation under appropriate conditions.
We finally note that while we have restricted the disturbances to take values in a countable set, our model is considerably more general than a model where the system is a controlled Markov chain with a countable number of states. For example our model includes as a special case deterministic problems with arbitrary state and control spaces.
Generally our goal in reinforcement learning is take actions that maximize our expected total reward, from now until the end of time. Often the reward states will be many moves away, so we want an unlimited time horizon to allow us to plan far in advance. The discount factor $a$ helps make this more computationally manageable, since it gives us a "soft" horizon - we care less and less about things farther and farther in the future.
The only difference between (1.2) and the last equation is a technical one - (1.2) is the limit of an expectation, while the last equation is the expectation of a limit (an infinite sum). The second one is way more complicated, because it requires an infinite-dimensional probability distribution. (1.2) gets around this by taking successively larger and larger expectations of finite-dimensional distributions instead. As the text notes, these will generally give the same answer, but (1.2) is mathematically simpler.
The disturbances will generally not be i.i.d., since they can depend on the state. Although they have no direct dependence on each other, they are still coupled by the state. For example, imagine we bounce deterministically between states 1 and 2 on every timestep, and that the disturbance distribution looks very different in each state. Then if you tell me the disturbance at one time, I can guess what it's going to be at the next time.