I was reading about linear regression and mean squared error in machine learning, and I came across this explanation:
Suppose that we have a design matrix of $m$ example inputs that we will not use for training, only for evaluating how well the model performs. We also have a vector of regression targets providing the correct value of $y$ for each of these examples. Because this dataset will only be used for evaluation, we call it the test set. We refer to the design matrix of inputs as $\mathbf{X}^{\text{(test)}}$ and the vector of regression targets as $\mathbf{y}^{\text{(test)}}$.
One way of measuring the performance of the model is to compute the mean squared error of the model on the test set. If $\hat{\mathbf{y}}^{\text{(test)}}$ gives the predictions of the model on the test set, then the mean squared error is given by
$$\text{MSE}_{\text{test}} = \dfrac{1}{m} \sum_{i} (\hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}})_i^2.$$
Intuitively, one can see that this error measure decreases to $0$ when $\hat{\mathbf{y}}^{\text{(test)}} = \mathbf{y}^{\text{(test)}}$. We can also see that
$$\text{MSE}_{\text{test}} = \dfrac{1}{m} \vert\vert \hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}} \vert\vert_2^2,$$
so the error increases whenever the Euclidean distance between the predictions and the targets increases.
I have two (related) areas of confusion here.
What is the $i$ iterating over in the sum?
For the latter equation, we have the $2$-norm (the Euclidean norm). But, unless I'm misunderstanding the notation here, we don't necessarily have that $\text{MSE}_{\text{test}} = \dfrac{1}{m} \sum_{i} (\hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}})_i^2 = \dfrac{1}{m} \vert\vert \hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}} \vert\vert_2^2$ for $i = 2$, right? Again, I think I might be confused about the notation here (specifically, for the first equation), so that might be where my confusion comes from. Can someone please clarify this?
Thank you.
$\mathbf{y}^{\text{(test)}}$, and $\hat{\mathbf{y}}^{\text{(test)}}$ are vectors of length $m$, and thus so is their difference. $i$ runs from $1$ to $m$ and is iterating over the entries of the vector $\hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}}$.
The definition of the Euclidean norm for the vector $\hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}}$ is $\sqrt{\sum_{i} (\hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}})_i^2}$, and thus $\text{MSE}_{\text{test}} = \dfrac{1}{m} \sum_{i} (\hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}})_i^2 = \dfrac{1}{m} \vert\vert \hat{\mathbf{y}}^{\text{(test)}} - \mathbf{y}^{\text{(test)}} \vert\vert_2^2$ actually does hold.