I would like to understand more about the Least Squares, but it remains unclear to me, where these equations come from.
for a linear approximation $y=ax+b$
$a=\frac{m\sum_{i=1}^{m} x_iy_i-(\sum_{i=1}^{m}x_i)(\sum_{i=1}^{m}y_i)} {m\sum_{i=1}^{m}x_i^2 - (\sum_{i=1}^{m}x_i)^2}$
and
$b = \frac{(\sum_{i=1}^{m}x_i^2)(\sum_{i=1}^{m}y_i)-(\sum_{i=1}^{m}x_iy_i)(\sum_{i=1}^{m}x_i)}{m\sum_{i=1}^{m}x_i^2 - (\sum_{i=1}^{m}x_i)^2}$
What I have tried so far:
In the least squares method, the quadratic error should be minimized.
This means
$E(a,b)=\sum_{i=1}^{m}[y_i-(ax_i+b)]^2$
should be minimal.
For a minimum to exist, the partial derivatives of this expression must be zero.
$\frac{\partial{E}}{\partial{a}}$ = $2 \sum_{i=1}^{m}[y_i-(ax_i+b)]\hspace{2pt}(-x_i)=0$
$\frac{\partial{E}}{\partial{b}}$ = $2 \sum_{i=1}^{m}[y_i-(ax_i+b)]\hspace{2pt}(-1)=0$
Rearranging these, yields the following:
$a\sum_{i=1}^{m}x_i^2 + b\sum_{i=1}^{m}x_i = \sum_{i=1}^{m}x_iy_i$
$a\sum_{i=1}^{m}x_i+bm = \sum_{i=1}^{m}y_i$
This is where I'm stuck. I see the terms $x_iy_i$, etc. that also appear in the expressions in question - but I don't know how to rearrange to get there.
You are doing everything right, but for such derivation it is easier to use matrix calculus (link). It is quite similar to usual $\mathbb R\to\mathbb R$ function differentiating rules. This is the one we are going to use: $$d\langle A\textbf{x}, \textbf{x}\rangle=\langle (A + A^T)\textbf{x}, d\textbf{x}\rangle $$ Let me briefly show: $$f(\textbf{x}) = \|A\textbf{x}-\textbf{b}\|_2^2 \to \min$$ To use the rule above we need other function presentation: $$f(\textbf{x}) = \langle A\textbf{x} - \textbf{b}, A\textbf{x} - \textbf{b} \rangle = \langle A\textbf{x}, A\textbf{x} \rangle - 2\langle A\textbf{x}, \textbf{b} \rangle + \langle \textbf{b}, \textbf{b} \rangle= $$ $$= \langle A^TA\textbf{x}, \textbf{x} \rangle - 2\langle A^T\textbf{b}, \textbf{x}\rangle + \langle \textbf{b}, \textbf{b} \rangle$$ Now it is easy to apply the rule: $$df(\textbf{x}) = 2 \langle A^TA \textbf{x}, d\textbf{x} \rangle - 2\langle A^T\textbf{b}, d\textbf{x}\rangle \Rightarrow \nabla f = 2(A^TA\textbf{x} - A^T\textbf{b})$$ To get the extremum point we need to find gradient zeros: $$A^TA\textbf{x} = A^T\textbf{b} \Rightarrow \textbf{x} = (A^TA)^{-1}A^T\textbf{b}$$ We get the same result as least squares method does. But the proof is not finished. We need to find the Hessian and proof that it is positive defined.
UPD: For your example: $$A = \begin{bmatrix} x_1 & 1\\ x_2 & 1 \\ \vdots & \vdots\\ x_m & 1 \end{bmatrix}$$ Then: $$ A^TA = \begin{bmatrix} x_1 & x_2 & \dots & x_m\\ 1 & 1 & \dots & 1 \end{bmatrix} \begin{bmatrix} x_1 & 1\\ x_2 & 1 \\ \vdots & \vdots\\ x_m & 1 \end{bmatrix}= \begin{bmatrix} \sum_{k=1}^m x_k^2 & \sum_{k=1}^m x_k\\ \sum_{k=1}^m x_k & 1 \end{bmatrix} $$ Now it is easy to get expressions for $a$ and $b$.