Matrix formulation is straightforward:
$\mathbf{y} = \mathbf{X} \boldsymbol{\hat{\beta}} +\boldsymbol{\hat{\varepsilon}}$
cost function: $E = {\boldsymbol{\hat{\varepsilon}}}^T{\boldsymbol{\hat{\varepsilon}}} = {(\mathbf{y} - \mathbf{X}\boldsymbol{\hat{\beta}})}^T(\mathbf{y} - \mathbf{X}\boldsymbol{\hat{\beta}})$
... differentiating wrt $\boldsymbol{\hat{\beta}}$ and searching for extremum:
$\frac{\partial E}{\partial \boldsymbol{\hat{\beta}}} = 2 \mathbf{X}^T\mathbf{X} \boldsymbol{\hat{\beta}} - 2 \mathbf{X}^T \mathbf{y} = 0$
thus the OLS estimate of $\boldsymbol{\hat{\beta}}$ is: $\boldsymbol{\hat{\beta}} = (\mathbf{X}^T\mathbf{X})^{-1} \mathbf{X}^T \mathbf{y}$
So, there is probably some limitation to the previous relation (e.g. $(\mathbf{X}^T\mathbf{X})^{-1}$ have to exist) ... am I right?
If I try to make the same with component notation, there is a problem in the same formula (I'll come back later to this).
In component formalism (using Einstein's summation convention):
$E = (X_{ij} \beta_j - y_i)^2 = (X_{ij}\beta_j)^2 - 2 X_{ij}\beta_j y_i + y_i^2$
$\frac{\partial E}{\partial \beta_j} = 2X_{ij} \beta_j X_{ij} - 2X_{ij} y_i = 0$
$X_{ij} \beta_j X_{ij} = X_{ij} y_i$
Now, every term is just scalar, so it's tempting to cancel $X_{ij}$ on both sides. However, this just leads to trivial relation: $y_i = X_{ij} \beta_j$
Can someone help me to enlighten this, please? Isn't it somehow connected to the use of only lower indices? When I have to consider both lower and upper indices (tensors and duals)?
Thank you!
The error in your component derivation: When you differentiate wrt $\beta_j$, the index $j$ now has two roles: one as the generic index in the sum, and one as the index specifying which $\beta$ you are differentiating with respect to. The partial derivative wrt $\beta_j$ should treat the other $\beta$'s as constant, but your notation can no longer distinguish them!
Better to use a new index, say $k$, to perform the differentiation. When you do this, you find that the partial derivative will be $$\frac{\partial E}{\partial\beta_k}=2X_{ij}\beta_jX_{ik} - 2X_{ik}y_i.\tag1$$ There's still summing going on (involving $i$ and $j$, with $k$ held constant) when you set (1) to zero, so it doesn't make sense to factor out $X_{ij}$. Convert back into matrix notation and you'll get $$X^TX\beta=X^Ty.\tag2$$ Specifically, $\sum_i\sum_jX_{ij}\beta_jX_{ik}$ is the $k$th member of the vector $X^TX\beta$, while $\sum_iX_{ik}y_i$ is the $k$th member of $X^Ty$.