Suppose we have constant and specific matrices $X,Y$ and $W$ is the weights matrix which its elements are variable. Also $X_i$ denotes the $i$th column in matrix $X$ and so on for other matrices. Goal is finding a $W$ which minimize this function:
$$ E(W) = \sum\limits_{i} ||WX_i - Y_i||^{2} $$
In a class I saw someone said we can find the optimum $W$ by by calculating $\frac{d}{dW}E(W)$. Then with attention to the \begin{align} ||V||^{2} = V^{T}V \ \ \ \ \ \ \ \ \ \ \ \ \ \ & (1) \end{align} , when $V$ is a columnar vector, he continued:
\begin{align*} \frac{d}{dW}E(W) = 0 & \Rightarrow (WX - Y)X^{T} = 0 \\ \Rightarrow W = YX^{T}(X^{2})^{-1} \end{align*}
He claimed that above formula gets the $W$ which minimize $E(W)$. But I do not know is that true and more important question is: How does he calculate $\frac{d}{dW}E(W)$ when $W$ is a matrix? Also can you provide references for more studying?
The last line should be $W=YX^T(XX^T)^{-1}$, not $W=YX^T(X^2)^{-1}$. We discuss the problem answering your question (the calculus optimization approach, which is of independent interest if for no other reason than how the symbol $\frac{d}{dW}$ should be interpreted). But at the end we provide a linear algebra approach. The linear algebra approach makes use of the fact that minimization of squares goes hand in hand with orthogonal projections, and the space of $n\times m$ matrices has a natural orthogonality structure.
I will note that the problem has $Y_i$, $WX_i$ as vectors, whereas in ordinary least squares (where problems similar to this frequently arise), $WX_i$ and $Y_i$ are typically single numbers. We can reduce this problem to that case by separately considering $$E(W)=\sum_i \|WX_i-Y_i\|^2=\sum_j\sum_i \|(WX_i-Y_i)[j]\|^2,$$ where $(WX_i-Y_i)[j]$ is the entry on the $j^{th}$ row of the vector $WX_i-Y_i$. Note that this becomes distinct optimization problems, one for each $j$. Minimizing the contribution from a given $j$ involves the $j^{th}$ row of $W$ and no others, so this optimization problem completely separates. However, I think the problem is more natural as stated, so I will respond without this separation.
Suppose we have $T:\mathbb{R}^{n\times p}\to \mathbb{R}$. Then $T$ is a function of $n\times p$ matrices. We can look at $T(W)=T(w_{ij})$, $1\leqslant i\leqslant n$, $1\leqslant j\leqslant p$. When we write $\frac{d}{dW}T(W)$, we can think of this as the matrix of partial derivatives $$\frac{d}{dW}T(W)=\begin{pmatrix} \frac{\partial T}{\partial w_{1,1}} & \frac{\partial T}{\partial w_{1,2}} & \ldots & \frac{\partial T}{\partial w_{1,p}} \\ \frac{\partial T}{\partial w_{2,1}} & \frac{\partial T}{\partial w_{2,2}} & \ldots & \frac{\partial T}{\partial w_{2,p}} \\ \vdots & \vdots & \ddots & \vdots \\ \frac{\partial T}{\partial w_{n,1}} & \frac{\partial T}{\partial w_{n,2}} & \ldots & \frac{\partial T}{\partial w_{n,p}}\end{pmatrix}.$$ When we look at the equation $\frac{d}{dW}T(w)=0$, we are setting it equal to the zero matrix.
For an $n\times 1$ column vector $U$, let $U[j]$ denote the entry on the $j^{th}$ row of $U$. This is defined for $1\leqslant j\leqslant n$.
In order for $WX_i$ to be defined, we must have $W$ is $n\times p$ for some $n,p$, and $X_i$ is $p\times 1$. Moreover, for $WX_i-Y_i$ to be defined, it must be the case that $Y_i$ is $n\times 1$. Since we sum over $i$, I will assume this is equal to the number of columns in $X$ as well as the number of columns of $Y$. That is, I assume $X,Y$ have an equal number of columns. Let $m$ be this common number of columns. So $WX$ is $n\times m$ and $Y$ is $n\times m$. We have \begin{align*} \|WX_i-Y_i\|^2 & =(WX_i-Y_i)^T(WX_i-Y_i)=X_iW^TWX_i-X_i^T W^TY_i-Y_i^TWX_i+Y_i^TY_i \\ & = X_iW^TWX_i-2Y^T_iWX_i+Y_i^TY_i.\end{align*} Thus $$E(W)=\sum_{i=1}^m X_iW^TWX_i-2\sum_{i=1}^m Y^T_iWX_i+\sum_{i=1}^m Y_i^TY_i=\sum_{i=1}^n \|WX_i\|^2-2\sum_{i=1}^m Y^T_iWX_i+\sum_{i=1}^m \|Y_i\|^2.$$
Note that $Y^T_iY_i$ does not depend on $W$, so $\frac{\partial}{\partial w_{j,k}}Y_i=0$ for all $j,k$.
Let us define the Hilbert-Schmidt norm of a matrix $$M=(m_{i,j})=\|M\|_{HS}=\Bigl(\sum_{i,j}m_{i,j}^2\Bigr)^{1/2}.$$ Let us also define the matricial inner product $M\cdot N=\text{tr}(M^TN)$, where $M,N$ are matrices of the same dimension (say, both are $a\times b$). Then $M^T$ is $b\times a$ and the trace $\text{tr}(M^TN) = \sum_{i,j}m_{i,j}n_{i,j}$. Note that $M\cdot M=\|M\|_{HS}^2$. This gives $\mathbb{R}^{a\times b}$ a Hilbert space structure identical to $\mathbb{R}^{ab}$.
The reason we introduce these notions is that $$\sum_{i=1}^m \|WX_i\|^2=\|WX\|_{HS}^2=WX\cdot WX,$$ $$\sum_{i=1}^n Y^T_iWX_i=Y\cdot WX,$$ and $$\sum_{i=1}^m \|Y_i\|^2=\|Y\|_{HS}^2=Y\cdot Y.$$
Let's again take $M,N$, whose entries may depend on the $w_{p,q}$. Then $$M\cdot N=\sum_{i,j}m_{i,j}n_{i,j},$$ and, by product rule, $$\frac{\partial}{\partial w_{r,s}} (M\cdot N)=\sum_{i,j}\frac{\partial m_{i,j}}{\partial w_{r,s}}n_{i,j}+\frac{\partial n_{i,j}}{\partial w_{r,s}}m_{i,j}.\tag{1}$$ Letting $[WX]_{i,j}$ denote the $i,j$ entry of $WX$ and letting $[Y]_{i,j}$ denote the $i,j$ entry of $Y$, we have $$[WX]_{i,j}=\sum_{l=1}^p w_{i,l}[X]_{l,j},$$ so $$\frac{\partial}{\partial w_{r,s}}[WX]_{i,j}=[X]_{s,j}$$ if $i=r$ (because of the term $w_{r,s}[X]_{s,j}$ in the sum) and otherwise the partial derivative is $0$ (because $w_{r,s}$ does not occur in $[WX]_{i,j}$ at all).
Since $Y$ does not depend on $W$, $\frac{\partial}{\partial w_{r,s}}[Y]_{i,j}=0$ for all $i,j,r,s$.
By $(1)$, we have \begin{align*} \frac{\partial}{\partial w_{r,s}} (WX\cdot WX)=2\sum_{i,j}[WX]_{i,j}\frac{\partial}{\partial w_{r,s}}[WX]_{i,j}=2\sum_j [WX]_{r,j}[X]_{s,j}=[WX]^rX^s,\end{align*} where $[WX]^r$, $X^s$ denote the $r^{th}$ and $s^{th}$ rows of $WX,X$, respectively, and $\cdot $denotes the usual dot product $1\times m$ row vectors.
\begin{align*} \frac{\partial}{\partial w_{r,s}} (WX\cdot Y)=\sum_{i,j}[Y]_{i,j}\frac{\partial}{\partial w_{r,s}}[WX]_{i,j}=\sum_{j}[Y]_{r,j}[X]_{s,j}=Y^r\cdot X^s,\end{align*} where $Y^r$, $X^s$ denote the $r^{th}$ and $s^{th}$ rows of $Y,X$, respectively, and $\cdot $denotes the usual dot product $1\times m$ row vectors.
\begin{align*}\frac{\partial}{\partial w_{r,s}}Y\cdot Y=0. \end{align*}
So we obtain \begin{align*} \frac{\partial}{\partial w_{r,s}}E & = \frac{\partial }{\partial w_{r,s}}\Bigl[(WX)\cdot (WX)-2 Y\cdot WX+Y\cdot Y\Bigr] = 2\Bigl[[WX]^r\cdot X^s-Y^r\cdot X^s\Bigr]=2[WX+Y]^r\cdot X^s. \end{align*} So $\frac{d}{dW}E(W)$ is the matrix with these entries. We can check directly from teh definition of matrix multiplication that $2[[WX]^r\cdot X^s-Y^r\cdot X^w]$ is exactly the $r,s$ entry of $2(WX-Y)X^T$, so the (matrix) equation $\frac{d}{dW}E(W)=0$ is exactly equivalent to $2(WX-Y)X^T=0$.
Thus we arrive at $W=YX^T(XX^T)^{-1}$ (which is not the same as $YX^T(X^2)^{-1}$).
LINEAR ALGEBRA APPROACH In any finite dimensional inner product space $H$ (let's think of this as a vector space $H$ with a dot product $g\cdot h$ for $g,h\in H$ having the property that $\|g\|^2=g\cdot g$), we have a fact: For any subspace $M$ of $H$ and any $h\in H$, there is a unique element $g$ of $M$ which is closest to $h$. That is, there is a unique $g\in M$ such that $$\|h-g\|\leqslant \|h-m\|$$ for all $m\in M$. Moreover, $g$ is obtained as the orthogonal projection of $h$ onto $M$.
We have another fact: An operator $P:H\to H$ is the orthogonal projection onto $M$ if $1.$ its range is $M$, $2.$ $(Pg)\cdot h=g\cdot (Ph)$ for all $g,h\in H$ (in other words, $P$ is self-adjoint), and $3$ $P\circ P=P$ (that is, $P$ is idempotent). To see why an operator with these three properties is the orthogonal projection onto $M$, first note that for any $g\in H$, we can write $$g=Pg+(I-P)g.$$ We want to know that $Pg\in M$ and $(I-P)g\cdot h=0$ for all $h\in M$. That is, we have decomposed $g$ into a piece in $M$ (that piece is $Pg$) and a piece orthogonal to $M$ (that piece is $(I-P)g$). We have assumed that the range of $P$ is $M$, so we know $Pg\in M$ by assumption. Fix any $h\in M$. Since $M$ is the range of $P$, there exists $u\in H$ such that $Pu=h$. Then $$(I-P)g\cdot h=(I-P)g\cdot Pu=P(I-P)g\cdot u = (P-P^2)g\cdot u=(P-P)g\cdot u=0\cdot u=0.$$
Now let $H=\mathbb{R}^{n\times m}$ with the trace dot product and Hilbert-Schmidt norm defined above. Let $M=\{WX:W\in \mathbb{R}^{n\times p}\}$, where $X$ is a fixed $p\times m$ matrix. This is a subspace. For a fixed $n\times m$ matrix $Y$, the unique minimizer of $$E(W)=\sum_i \|WX_i-Y_i\|^2=\|Y-WX\|_{HS}^2$$ will occur when $WX=PY$, where $P$ is the orthogonal projection onto $M$. I claim that the orthogonal projection onto $M$ is given by $PU=UX^T(XX^T)^{-1}X$ for any $n\times m$ matrix $U$. In particular, we get the minimizer at $WX=YX^T(XX^T)^{-1}X$. "Striking" the rightmost $X$ from both sides of this equation gives $W=YX^T(XX^T)^{-1}$.
How do we know we can strike the $X$? In general, we can't. But in problems such as this, we assume that we can, otherwise there is no unique minimizer $W$. Being able to "strike" the $X$ is equivalent to the $X$ having linearly independent rows which is equivalent to $XX^T$ being invertible, which is a necessary and sufficient condition for the unique minimizing $W$ to exist, and is typically assumed in such problems (including in th provided solution).