So for the single valued model we have:
$p(C_1|\textbf{x}) = \frac{p(\textbf{x}|C_1)p(C_1)}{p(\textbf{x}|C_1)p(C_1)+p(\textbf{x}|C_2)p(C_2)}$
If we rearrange the terms, we can write this as a sigmoid function:
$\frac{1}{1+exp(-a)}=\sigma(a)$...(4.57)
where $a=ln \frac{p(\textbf{x}|C_1)p(C_1)}{p(\textbf{x}|C_2)p(C_2)}$...(4.58)
Then we moved onto the continuous input case where we assumed $p(C_k|\textbf{x})$ was gaussian:
$p(\textbf{x}|C_k) = \frac{1}{(2\pi)^{D/2}}\frac{1}{|\Sigma|^{1/2}}exp(-\frac{1}{2}(\textbf{x}-\mu_k)^T\Sigma^{-1}(\textbf{x}-\mu_k))$ ...(4.64)
Then I became confused when the text said, using 4.57 and 4.58 we have: $p(C_1|\textbf{x}) = \sigma(\textbf{w}^T\textbf{x}+w_0)$
where:
$\textbf{w} = \Sigma ^{-1}(\mu_1-\mu_2)$
$w_0=-\frac{1}{2}\mu_1^T\Sigma^{-1}\mu_1+\frac{1}{2}\mu_2^T\Sigma^{-1}\mu_2+ln\frac{p(C_1)}{p(C_2)}$
Is it saying that if I plug everything in the sigmoid I will recover from it $p(C_1|\textbf{x}) = \frac{p(\textbf{x}|C_1)p(C_1)}{p(\textbf{x}|C_1)p(C_1)+p(\textbf{x}|C_2)p(C_2)}$
but $p(x|C_1)$ and $p(x|C_2)$ are the normal distributions like in 4.64? Why can't we just use the Bayes theorem as is? Why do we have create a sigmoid function out of seemingly no where?
"Can't we just use the Bayes theorem as it is?"
Yes we can. And yes, Bayes' theorem does indeed reproduce the formula you quoted:
\begin{align} p(C_1|\mathbf x) & = \frac{p(\mathbf x | C_1) p(C_1)}{p(\mathbf x | C_1) p(C_1) + p(\mathbf x|C_2) p(C_2)} \\ & = \frac{p(C_1)\exp\left( - \tfrac 1 2 (\mathbf x - \mu_1)^T\Sigma^{-1}(\mathbf x - \mu_1) \right)}{p(C_1) \exp\left( - \tfrac 1 2 (\mathbf x - \mu_1)^T\Sigma^{-1}(\mathbf x - \mu_1) \right)+ p(C_2) \exp\left( - \tfrac 1 2 (\mathbf x - \mu_2)^T\Sigma^{-1}(\mathbf x - \mu_2) \right)} \\ & = \frac{\exp\left( ( \mu_1 - \mu_2)^T\Sigma \mathbf x - \tfrac 1 2 \mu_1^T\Sigma \mu_1+ \tfrac 1 2 \mu_2^T\Sigma \mu_2 + \ln \tfrac {p(C_1)}{p(C_2)} \right)}{\exp\left( ( \mu_1 - \mu_2)^T\Sigma \mathbf x - \tfrac 1 2 \mu_1^T\Sigma \mu_1+ \tfrac 1 2 \mu_2^T\Sigma \mu_2 + \ln \tfrac {p(C_1)}{p(C_2)} \right) + 1} \\ & = \sigma(\mathbf w^T \mathbf x + w_0) \end{align} [NB in the second line, I omitted the factors of $\frac{1}{(2\pi \det \Sigma)^{d/2}}$ in the numerator and denominator - they cancel out.]
"What is the benefit in writing it in this way?"
Writing the result in this way makes it clear that the decision as to which class $\mathbf x$ is most likely to belong to is given in terms of a linear function of $\mathbf x$:
$$ p(C_1|\mathbf x ) > \tfrac 1 2 \ \iff \ \sigma(\mathbf w^T \mathbf x + w_0) > \tfrac 1 2 \ \iff \ \mathbf w^T \mathbf x + w_0 > 0.$$
In other words, the decision boundary is the linear hyperplane, $\mathbf w^T \mathbf x + w_0 = 0$.