Deriving vectorized form of linear regression

203 Views Asked by At

We first have the weights of a D dimensional vector $w$ and a D dimensional predictor vector $x$, which are all indexed by $j$. There are $N$ observations, all D dimensional. $t$ is our targets, i.e, ground truth values. We then derive the cost function as follows: enter image description here

We then compute the partial derivate of $\varepsilon$ with respect to $w_j$: enter image description here

I'm confused as to where the $j'$ is coming from, and what it would represent.

We then write it as: enter image description here

Then, we vectorize it as:

enter image description here

I'm confused as to the derivation of the vectorization of $A$ from $A_{jj'}$ likely because I don't know what $j'$ is. But how would the vectorization go in terms of steps and intuition?

Here is the link to the handout.

EDIT: I do want to know where the $j'$ factors in.

2

There are 2 best solutions below

1
On BEST ANSWER

I think the calculation is much nicer if we make good use of vector notation. Let's combine the numbers $w_1, \ldots, w_D, b$ into a vector $$ w = \begin{bmatrix} b \\ w_1 \\ \vdots \\ w_D \end{bmatrix} $$ and let's define $$ \hat x_i = \begin{bmatrix} 1 \\ x_1^{(i)} \\ \vdots \\ x_D^{(i)} \end{bmatrix}. $$ (So $\hat x_i$ is an "augmented" feature vector.) Our goal is to minimize $$ \mathcal{E}(w) = \frac{1}{2N} \sum_{i=1}^N \| \hat x_i^T w - t^{(i)} \|^2. $$ We can do this by setting the gradient of $\mathcal{E}$ equal to $0$. By the chain rule, $$ \mathcal{E}'(w) = \frac{1}{N} \sum_{i=1}^N (\hat x_i^T w - t^{(i)} ) \hat x_i^T. $$ If we use the convention that the gradient is a column vector, then $$ \nabla \mathcal{E}(w) = \mathcal{E}'(w)^T = \frac{1}{N} \sum_{i=1}^N (\hat x_i^T w - t^{(i)} ) \hat x_i. $$ This is the formula you wanted to derive, but in vector form.

0
On

After deriving the formulas myself, it turns out that $j'$ was used as a placeholder variable. And $A_{jj'}$ is essentially just the mean of the sum of two pairs of predictors over all $N$ observations. Thus, we have that as a matrix, $X^TX$ since we need to have the summation of $x_1\cdot x_1+x_1\cdot x_2+x_1\cdot x_3...+x_n\cdot x_n$. I will be upvoting littleO's answer since it is a good derivation using matricies of the optimal weights.