Any help is appreciated. I would like to know if i missed something, or if that would be correct?
- $m$ is the number of training examples
- $L$ is the Loss-Function
- $\hat{\mathbf{y}}^{(i)}$ is the output vector for training example $i$
- $\mathbf{y}^{(i)}$ is the target vector for training example $i$
- I used the factor $\frac{1}{2}$ to simplify the derivative
$$ E = \frac{1}{2m} \sum_{i=1}^{m} L^{(i)} = \frac{1}{2m} \sum_{i=1}^{m} \frac{1}{2} \| \hat{\mathbf{y}}^{(i)} - \mathbf{y}^{(i)} \|^2 $$
For the purpose of simplifying derivative, it suffices to let
$$E = \frac1{2m}\sum_{i=1}^m \|\hat{y}^{(i)} - y^{(i)}\|^2$$
though for the purpose of minimization, they are equivalent as they differ by a positive scalar multiplication.