In learning more about the Normal distribution, I came upon the following sentence in the first section of the Normal Distribution article on Wikipedia
Moreover, many results and methods (such as propagation of uncertainty and least squares parameter fitting) can be derived analytically in explicit form when the relevant variables are normally distributed.
My question is why? Can somebody show me where in the derivation of the least squares fitting formula the assumption of normally-distributed data must be made, and/or why least squares fitting breaks down if the data are not so-distributed? As an example of why I am confused, here is a derivation of least squares fitting (in the linear case), and I cannot figure out where the assumption of being normally distributed comes in.
Same question for least squares fitting.
Thank you!
Define the data set to be a sequence $\{(Y_i, X_i)\}_{i=1}^n$ of pairs of real numbers. Fitting a line to this sequence amounts to finding $\theta = (\alpha, \beta) \in \mathbb R^2$ such that the equation $Y_i = \alpha X_i + \beta$ "captures" the relationship between each $Y_i$ and $X_i$ for each $i$ in some sense. But in what sense? One way to make this question meaningful is to assume a statistical model. Namely, assume that, for each $i$, $$Y_i = \alpha X_i + \beta + e_i,$$ where each $e_i$ is a random variable with known distribution. In other words, assume that each $X_i$ is a deterministic variable and that $Y_i$ is related to $X_i$ in a linear but "noisy" way. Under this assumption, the question of finding $\theta$ can be cast as a question of finding the $\theta$ that maximizes the "likelihood" that $\{(Y_i, X_i)\}_{i=1}^n$ was observed given that $\theta$ was the parameter of the above statistical model.
One way to define likelihood is by use of a conditional probability density function (p.d.f.), which we denote by $f(Y_i \; | \; X_i, \theta)$ for each $i$. (I use this notation to emphasize that $X_i$ is deterministic, so it admits no useful notion of probability density. Each $X_i$ will instead play a role in parametrizing the distribution of $Y_i$.) For example, if each $e_i$ is a normal random variable with zero mean and variance $\sigma^2$, and the $e_i$ are all mutually independent, then the $Y_i$ are also independent normal variables with variance $\sigma^2$, but each $Y_i$ has a mean of $\alpha X_i + \beta$. Hence, we may define the likelihood of the data to be the following joint conditional p.d.f.: \begin{align*} \ell((Y,X); \theta) &= \prod_{i=1}^n f(Y_i \; | \; X_i, \theta) \\ &= \frac{1}{(2\pi)^{n/2}\sigma^n}\exp\left(\frac{-1}{2\sigma^2}\left(\sum_{i=1}^n(Y_i - \alpha X_i - \beta)^2\right)\right). \end{align*} Note that the mutual independence of $e_i$ was essential for writing this formula, for otherwise the joint conditional p.d.f. would not factorize in the first step above. Clearly, maximizing this likelihood function over $\theta$ is equivalent to minimizing $$\sum_{i=1}^n(Y_i - \alpha X_i - \beta)^2$$ over $\theta$. (Again, keep in mind that $\theta = (\alpha, \beta)$.) This last expression is the square-error formula used in least-squares fitting. The qualifier "least" implies that we're minimizing this formula (or, equivalently, maximizing the above likelihood formula). This equivalence between minimizing square-error and maximizing likelihood could only be shown because the $e_i$ in the statistical model are i.i.d. normally distributed with zero mean and finite variance. Hence, the normality assumption is essential in motivating the use of least-squares fitting from a statistical standpoint. However, as mentioned by Pantelis Sopasakis in the comments, it is not essential to the actual computation of $\theta$, which is the focus of the derivations in the link you posted.