We learned in class to use a linear model to to predict a real target value y. We made the assumption that $$ y = w^Tx + w_0 + \epsilon $$ where $x$ is the input vector,$w$ is the vector of the linear model parameters and $\epsilon$ is a Gaussian distributed error. The professor showed us that since $y|x$ is a Gaussian random variable, maximizing the likelihood( for $w$) is actually minimizing MSE namely minimizing
$$ \frac{1}{N} \sum_{i=1}^N (y_i - (w^Tx_i + w_0 ))^2 $$ for $w$. This is all good.
But now he said that minimizing the MSE expression above is also valid for a binary classification problem where $ y \in \{-1,1\} $. What is the statistical justification for this ? it is clear that $ y = w^Tx + w_0 + \epsilon $ does not hold any more and $y|x$ is not a Gaussian random variable.
The validity comes from the close relationship between linear regression and linear discriminant analysis (LDA) for two classes. This result is detailed in ESL Section 4.3 and Exercise 4.2, whose solution is available online. To sum up, when there are only two classes and when the numbers of observations from both classes coincide, the least square classification of $y\in\{-1,1\}$ by solving the MSE criteria produces the same decision as the LDA.