I'm having this example for a simple binary symmetric channel (BSC) to bound the mutual information of $X$ and $Y$ as
\begin{align*} I(X;Y) &= H(Y) - H(Y|X)\\ &= H(Y) - \sum p(x) H(Y \mid X = x) \\ &= H(Y) - \sum p(x) H(p) \\ &= H(Y) - H(p) \\ &\leq 1 - H(p) \end{align*}
However, as the title states, I don't really understand why I can write
\begin{align*} \sum p(x) H(Y \mid X = x) = \sum p(x) H(p) \end{align*}
I know that
\begin{align*} \mathbb{P}[Y = 0 \mid X = 0 ] &= 1 - p \\ \mathbb{P}[Y = 1 \mid X = 0 ] &= p \\ \mathbb{P}[Y = 1 \mid X = 1 ] &= p \\ \mathbb{P}[Y = 0 \mid X = 1 ] &= 1 - p \end{align*}
but let's assume I set $p = \frac{1}{3}$, would that mean that I have
\begin{align*} I(X;Y) \leq 1- H(p) = 1- H(\frac{1}{3}) \approx 0.4716 \text{ bit} \end{align*}
I ask because if this is the case, why is it not
\begin{align*} I(X;Y) \leq 1- H(1-p) = 1- H(\frac{2}{3}) \approx 0.61 \text{ bit} \end{align*}
instead?
Or, and this would make the most sense to me, it's actually $p = (p_{error}, 1-p_{error})= (\frac{1}{3}, \frac{2}{3})$ and thus we have
\begin{align*} I(X;Y) \leq 1- H(p) = 1- H(\frac{1}{3}, \frac{2}{3}) \approx 0.0817 \text{ bit} \end{align*}
The reason for the validity of the equation
\begin{equation} \sum p(x) H(Y \mid X = x) = \sum p(x) H(p) \end{equation}
can perhaps be better seen if we denote the right-hand side by
\begin{equation} \sum p(x) H_b(p) \end{equation}
where $H_b(\cdot)$ is the binary entropy function (https://en.wikipedia.org/wiki/Binary_entropy_function). To see this, note that the defining property of the BSC is precisely that independent of what the source symbol X is, an error, that is a bit-flip, occurs with a fixed probability $p$. In other words:
\begin{equation} \forall x: H(Y \mid X = x) = H(Err \mid X = x) = H_b(p) \end{equation}
where the first equality is due to the fact that for a binary input the enropy of the "error" is equal to entropy of $Y$ and the second equality follows by the paragraph above.