We first have the weights of a D dimensional vector $w$ and a D dimensional predictor vector $x$, which are all indexed by $j$. There are $N$ observations, all D dimensional. $t$ is our targets, i.e, ground truth values. We then derive the cost function as follows: 
We then compute the partial derivate of $\varepsilon$ with respect to $w_j$: 
I'm confused as to where the $j'$ is coming from, and what it would represent.
Then, we vectorize it as:
I'm confused as to the derivation of the vectorization of $A$ from $A_{jj'}$ likely because I don't know what $j'$ is. But how would the vectorization go in terms of steps and intuition?
Here is the link to the handout.
EDIT: I do want to know where the $j'$ factors in.


I think the calculation is much nicer if we make good use of vector notation. Let's combine the numbers $w_1, \ldots, w_D, b$ into a vector $$ w = \begin{bmatrix} b \\ w_1 \\ \vdots \\ w_D \end{bmatrix} $$ and let's define $$ \hat x_i = \begin{bmatrix} 1 \\ x_1^{(i)} \\ \vdots \\ x_D^{(i)} \end{bmatrix}. $$ (So $\hat x_i$ is an "augmented" feature vector.) Our goal is to minimize $$ \mathcal{E}(w) = \frac{1}{2N} \sum_{i=1}^N \| \hat x_i^T w - t^{(i)} \|^2. $$ We can do this by setting the gradient of $\mathcal{E}$ equal to $0$. By the chain rule, $$ \mathcal{E}'(w) = \frac{1}{N} \sum_{i=1}^N (\hat x_i^T w - t^{(i)} ) \hat x_i^T. $$ If we use the convention that the gradient is a column vector, then $$ \nabla \mathcal{E}(w) = \mathcal{E}'(w)^T = \frac{1}{N} \sum_{i=1}^N (\hat x_i^T w - t^{(i)} ) \hat x_i. $$ This is the formula you wanted to derive, but in vector form.