I am having trouble in giving meaning to the joint and conditional probabilities related to the observations and states of HMMs in the Appendix A of Speech and Language Processing by Jurafsky and Martin. More specifically, a forward trellis is defined as follows in page 6 of the appendix:
The forward algorithm trellis α$_{t}(j)$ represents the probability of being in state $j$ after seeing the first $t$ observations.
Formally,
α$_t(j)=P(o_1, ..., o_t, q_t=j|$ λ$)$
On the other hand, on page 13, ξ$_t(i, j)$ is defined as:
... the probability of being in state $i$ at time $t$ and state $j$ at time $t+1$.
Also formally,
ξ$_t(i, j)=P(q_t=i, q_{t+1}=j|$ $O,$ λ$)$
I understand why in the trellis' case, the observations might be taken as part of the joint probability. But why are we conditioning the states on the observations when computing ξ$_t(i, j)$?