I'm working through the derivation for the weights solution in multivariate linear regression, but keep getting tripped up when I try to solve for the case of multiple targets. Starting with the residual sum of squares defines as:
$$ RSS = \sum_{n=1}^N (\mathbf{t}_n - \mathbf{x}_n^T\mathbf{w})^2 = \sum_{n=1}^N (\mathbf{t}_n - \mathbf{x}_n^T\mathbf{w})(\mathbf{t}_n - \mathbf{x}_n^T\mathbf{w})^T $$
where targets at each sample n are defined as $\mathbf{t}_n = \left[t_{n,1} \: t_{n,2} \: \dotsc \: t_{n,K} \right]$, so that we can predict $K$ targets for each sample $\mathbf{x}_n$ in $D$ dimensions expressed as the column vector:
$$ \mathbf{x}_n = \begin{bmatrix} x_{0,n}=1 \\ x_{1,n} \\ x_{2,n} \\ \vdots \\ x_{D,n} \end{bmatrix} $$
This implies that the weights $\mathbf{w}$ should be a $D+1 \times K$ matrix of the form:
$$ \mathbf{w} = \begin{bmatrix} w_{0,1} & w_{0,2} & w_{0,3} & \dotsc & w_{0,K} \\ w_{1,1} & w_{1,2} & w_{1,3} & \dotsc & w_{1,K} \\ \vdots \\ w_{D,1} & w_{D,2} & w_{D,3} & \dotsc & w_{D,K} \end{bmatrix} $$
My goal is to get to:
$$ \mathbf{X}^T \mathbf{T} = \mathbf{X}^T \mathbf{X} \mathbf{w} $$
and not the solution for $\mathbf{w}$ explicitly because I don't want to do any matrix inversions explicitly, but rather have Python solve the above equation using a line like numpy.linalg.lstsq(np.dot(X.T,X), numpy.dot(X.T, T)). The Python stuff is just to provide context.
I know the overall trajectory of the derivation is driven by finding $\mathbf{w}$ that minimizes $RSS$. To minimize $RSS$, start by taking the partial of $RSS$ wrt to $\mathbf{w}$, setting = 0 and do some rearranging:
$$ \begin{align*} \frac{\partial RSS}{\partial \mathbf{w}} = \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N (\mathbf{t}_n - \mathbf{x}_n^T\mathbf{w})(\mathbf{t}_n - \mathbf{x}_n^T\mathbf{w})^T \\ = \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N (\mathbf{t}_n - \mathbf{x}_n^T\mathbf{w})(\mathbf{t}_n^T - \mathbf{w}^T\mathbf{x}_n) \\ = \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N (\mathbf{t}_n \mathbf{t}_n^T - \mathbf{t}_n \mathbf{w}^T \mathbf{x}_n - \mathbf{x}_n^T \mathbf{w} \mathbf{t}_n^T + \mathbf{x}_n^T \mathbf{w} \mathbf{w}^T \mathbf{x}_n) \\ = \frac{\partial}{\partial \mathbf{w}} \left[ \mathbf{T} \mathbf{T}^T - \sum_{n=1}^N \mathbf{t}_n \mathbf{w}^T \mathbf{x}_n - \sum_{n=1}^N \mathbf{x}_n^T \mathbf{w} \mathbf{t}_n^T + \sum_{n=1}^N \mathbf{x}_n^T \mathbf{w} \mathbf{w}^T \mathbf{x}_n \right] \\ = - \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{t}_n \mathbf{w}^T \mathbf{x}_n - \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{x}_n^T \mathbf{w} \mathbf{t}_n^T + \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{x}_n^T \mathbf{w} \mathbf{w}^T \mathbf{x}_n \\ = 0 \end{align*} $$
I have 3 specific questions:
- Is the math to this point correct? If not, what needs to be corrected?
- How do you convert the remaining three summation terms into matrices? I can see simple dot products like $\sum_{n=1}^N \mathbf{t}_n \mathbf{t}_n^T = \mathbf{T} \mathbf{T}^T$, but the other terms are what I'm having trouble with.
- How do you take the $\frac{\partial}{\partial \mathbf{w}}$ of the three matrix terms obtained from question 2)?
After getting my last equation into matrix form, the rest is pretty easy. I'll start digging into my matrix calculus in the meantime. Thanks.
UPDATE 6/8/2016
The first two summation terms are the transpose of each other, but since these are both scalars, they are equivalent. As an exercise, I expanded both of these terms out and can show that:
$$ \mathbf{t}_n \mathbf{w}^T \mathbf{x}_n = \mathbf{x}_n^T \mathbf{w} \mathbf{t}_n^T = \sum_{k=1}^K t_{k,n} \sum_{d=0}^D w_{d,k} x_{d,n} $$
So the above equation: $$ - \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{t}_n \mathbf{w}^T \mathbf{x}_n - \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{x}_n^T \mathbf{w} \mathbf{t}_n^T + \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{x}_n^T \mathbf{w} \mathbf{w}^T \mathbf{x}_n = 0 $$ can be simplified a little further as: $$ 2 \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{t}_n \mathbf{w}^T \mathbf{x}_n = \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{x}_n^T \mathbf{w} \mathbf{w}^T \mathbf{x}_n $$
This means that if I can show that:
$$ \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{t}_n \mathbf{w}^T \mathbf{x}_n = \mathbf{X}^T \mathbf{T} $$
and that:
$$ \frac{\partial}{\partial \mathbf{w}} \sum_{n=1}^N \mathbf{x}_n^T \mathbf{w} \mathbf{w}^T \mathbf{x}_n = 2 \mathbf{X}^T \mathbf{X} \mathbf{w} $$
I'm done. It looks like the answer to my 1st question is "yes" if the last two expressions can be proved.
Let us assume $Y=X \beta+\epsilon$. Given data points $(x_{1},y_{1}),\cdots (x_{n},y_{n})$, we want to minimize $$ \sum (y_{i}-\beta x_{i})^2=(y-x \beta)(y-x \beta)^{T},y=(y_1,\cdots y_{n}),x=(x_{1},\cdots, x_{n}) $$ Thus differentiate with respect to $\beta$ we have $$ -x(y- x\beta)^{T}-(y- x\beta)x^{T}=-2x^{T}(y-x \beta) $$ where we used for dot products $$ a\cdot b^{T}=b\cdot a^{T} $$ Set it to zero we have $$ x^{T}y=x^{T}x\beta\rightarrow \widehat{\beta}=(x^{T}x)^{-1}x^{T}y $$ Note that usually $x_{i}$ is a vector of $p$-inputs (or features in machine learning language). Thus in general $\beta$ is a $n\times p$ matrix and $X$ is a $1\times p$ matrix, $X^{T}X$ is $p\times p$ matrix, and the best guess we have is $$ \widehat{y}=x(x^{T}x)^{-1}x^{T}y $$ which is often called the hat marrix as it denotes the projection from the space of $Y$ to the linear space spanned by $x_{i}$.