Questions about normal equations (Deep Learning Book equation 5.12)

71 Views Asked by At

I am reading "Deep Learning", (Goodfellow et al.) and I am trying to understand the development of the normal equations, starting from the MSE minimization

It has been way too long that I try to understand this equation, and I think I am missing one element.

The predicted output is defined as a basic linear function:

$$ \hat{y} = w^Tx \tag{5.3}$$ with $w \in\mathbb{R}^n$ a vector of parameters and $x \in\mathbb{R}^n$ the input vector.

The Mean Squared Error is defined as:

$$MSE_{test}=\frac{1}{m}\sum_{i}^{}(\hat{y}^{(test)}-y^{(test)})^2_i\tag{5.4}$$

The sum is then rewritten as an Euclidian distance for i=2:

$$MSE_{train}=\frac{1}{m}\left \|(\hat{y}^{(train)}-y^{(train)})\right \|^2_2 \tag{5.5}$$

The goal here is to find the optimal weights by minimizing the Mean Squared Error. To do so, the gradient descent technique is used:

$$\bigtriangledown _{w}MSE=0 \tag{5.6}$$ $$\bigtriangledown _{w}\frac{1}{m}\left \|(\hat{y}^{(train)}-y^{(train)})\right \|^2_2=0\tag{5.7}$$ $$ \frac{1}{m}\bigtriangledown _{w}\left \|({X}^{(train)}w-y^{(train)})\right \|^2_2=0 \tag{5.8}$$ $$\bigtriangledown _{w}{({X}^{(train)}w-y^{(train)})}^{T}({X}^{(train)}w-y^{(train)})=0 \tag{5.9}$$ $$ \bigtriangledown _{w}({w^{T}}{X^{(train)T}}{X^{(train)}}w-2{w^{T}}{X^{(train)T}}{y^{(train)}}+{y^{(train)T}}{y^{(train)})}=0 \tag{5.10}$$ $$ 2{X^{(train)T}}{X^{(train)}}w-2{X^{(train)T}}{y^{(train)}}=0 \tag{5.11}$$ $$ w=({X^{(train)T}}{X^{(train)}})^{-1}{X^{(train)T}}{y^{(train)}} \tag{5.12}$$

My questions might be simple, but it has been a while I didn't do matrices operation and I still need to exercice a bit:

  1. In the equation $(5.3)$: $\hat{y}=w^Tx$ however from equation $(5.7) \rightarrow (5.8)$ the following equality apply : $\hat{y}^{(train)}=X^{(train)}w$. Why isn't it $\hat{y}^{(train)}=w^{T}X^{(train)}$ instead.

  2. $(5.9) \rightarrow (5.10)$ When I develop it, I obtain the following equation: $$ \bigtriangledown _{w}({w^{T}}{X^{(train)T}}{X^{(train)}}w-{w^{T}}{X^{(train)T}}{y^{(train)}}-{y^{(train)T}}{X^{(train)}}w+{y^{(train)T}}{y^{(train)})}=0 $$

By identification, I see that I have the first and last term, but I don't have the $-2w^{T}{X^{(train)T}}{y^{(train)}}$. The closest I can get is by doing so:

$${w^{T}}{X^{(train)T}}{y^{(train)}}-{y^{(train)T}}{X^{(train)}}w={w^{T}}{X^{(train)T}}{y^{(train)}}-({w^{T}}{X^{(train)T}}{y^{(train)}})^T$$

I know I am almost there, but I am missing the final property that would help me close the deal.

If anyone with more experience than me could teach me this piece of knowledge, I would gladly learn.

Best regards,

Valentin

1

There are 1 best solutions below

1
On
  1. $\hat{y}=w^Tx=x^Tw$ when $x$ is a vector in $\mathbb{R}^n$.

$\hat{y}^{(train)}\in \mathbb{R}^m$ is a column vector where each entry corresponds to a training instance.

$X^{(train)} \in \mathbb{R}^{m \times n}$ and $w \in \mathbb{R}^{n}$, hence $X^{(train)}w \in \mathbb{R}^m$ where each row correspodns to a training instance.

$w^T \in \mathbb{R}^{1 \times n}$, we can't compute $w^TX^{(train)}$ as there is a dimension mismatch in matrix multiplication.

  1. $w^TX^{(train)T}y^{(train)}$ is a scalar. Hence it is symmetric and equalt to its transpose. That is it is equal to $y^{(train)T}X^{(train)}w$