I was reading pattern recognition and machine learning by Christopher bishop in chapter 4.1.3 page 186 about least square classification failure I stumbled on this phrase
"The failure of least square should not surprise us when we recall that it corresponds to Maximum likelihood under the assumption of a Gaussian conditional distribution" however, I can not understand this! what is least square relation with conditional? why are we talking about conditional distribution? and how can it relate to gaussian? I would be so grateful if U could help me. please.

Suppose the relationship between the feature vectors $\mathbf x_i$ and the target variables $y_i$ is modelled as
$$y_i = f(\mathbf x_i) + \epsilon,$$
where the function $f$ represents the "true model", and $\epsilon \sim \mathcal N(0, \sigma^2)$ is Gaussian noise.
Then the log likelihood for the dataset is $$ \log P(y_1, \dots, y_N | \mathbf x_1 , \dots, \mathbf x_N) = - \frac{1}{2\sigma^2} \sum_{i=1}^N (y_i - f(\mathbf x_i))^2 - \frac{N}{2} \log (2\pi \sigma^2).$$
Treating $\sigma^2$ as a constant, and ignoring constant terms, we see that this log-likelihood is proportional to the least-squares loss function,
$$ L(y_1, \dots, y_N | \mathbf x_1, \dots, \mathbf x_n) = \sum_{i=1}^N (y_i - f(\mathbf x_i))^2.$$
So optimising the log-likelihood (under the assumption that the noise is Gaussian) is equivalent to optimising the least-squares loss function.
The point that Bishop is making here is that, for classification problems, this Gaussian noise model is not very sensible. For one thing, $y_i$ should always be $0$ or $1$ for classification! But the Gaussian noise model can give you fractional values for $y_i$, and even negative values or values greater than one!