Let $f: \mathbb{R}^n \rightarrow \mathbb{R}^n$, $x \in \mathbb{R}^n$. Let $\frac{\partial g}{\partial x}$ be a Jacobian matrix so that $\frac{\partial g}{\partial x} =\begin{bmatrix} \frac{\partial g_1}{\partial x_1} & \frac{\partial g_1}{\partial x_2} & \dots & \frac{\partial g_1}{\partial x_n} \\[1ex] % <-- 1ex more space between rows of matrix \frac{\partial g_2}{\partial x_1} & \frac{\partial g_2}{\partial x_2} & \dots & \frac{\partial g_2}{\partial x_n} \\[1ex] % \dots & \dots & \dots & \dots \\ \frac{\partial g_m}{\partial x_1} & \frac{\partial g_m}{\partial x_2} & \dots & \frac{\partial g_m}{\partial x_m} \end{bmatrix}$.
If $m = 1$, then $\frac{\partial g}{\partial x}$ is a gradient. In my notes, the gradient is expressed as a column, instead of a row, so I've gotten a little bit confused with dimensionality.
Prove that
- If $a \in \mathbb{R}^n$, $x \in \mathbb{R}^n$, then $\frac{\partial(a^{\intercal}x)}{\partial x}= a.$
- If $\mathbf{A} \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, then $\frac{\partial(\mathbf{A}x)}{\partial x}= \mathbf{A}$.
- If $\mathbf{A} \in \mathbb{R}^{m \times n}$, $x \in \mathbb{R}^n$, then $\frac{\partial(x^\intercal\mathbf{A}x)}{\partial x} = (\mathbf{A} + \mathbf{A^\intercal})x$; in particular, if $\mathbf{A}^\intercal = \mathbf{A}$, then $\frac{\partial(x^\intercal\mathbf{A}x)}{\partial x} = 2\mathbf{A}x$.
- If $x \in \mathbf{R}^n$, then $\frac{\partial ||x||^2}{\partial x} = 2x$.
I believe it should not be too hard.
- By multiplying a vector and vector transpose, we obtain $a^\intercal x = \langle a_1x_1 + \dots + a_nx_n \rangle$. Therefore, $\frac{\partial(a^{\intercal}x)}{\partial x}= [\frac{\partial(a^{\intercal}x)}{\partial x_1}, \dots, \frac{\partial(a^{\intercal}x)}{\partial x_n}] = [a_1, \dots, a_n] = a.$
- Similarly to the first, $\frac{\partial(Ax)}{dx} = [\frac{\partial(a_1x)}{\partial x}, \dots, \frac{\partial(a_mx)}{\partial x}]$ = $[a_1,\dots, a_m] = \mathbf{A}$.
- For $\mathbf{A}$ being symmetrical, we could write out $x^\intercal\mathbf{A}x = \sum_{i = 1}^{n} \sum_{i = 1}^{n} x_i a_{ij} x_j$ and show that $a_{1i} = a_{i1}$. How do I proceed with $\mathbf{A}$ being non-symmetrical $m \times n$?
- $\frac{\partial||x||^2}{\partial x} = \frac{\partial}{\partial x}\sum_ix^2_i = \sum_i2x_i = 2x$.
Could you please check it up and point out mistakes, perhaps making it more rigorous? Thanks.
Everything you write is fine. For what concerns point 3., first note that it makes sense only if $m=n$. After that, you just decompose $A$ in its symmetric and antisymmetric part: $$ A=\frac{A+A^T}{2}+\frac{A-A^T}{2}. $$ Only the symmetric part of $A$ gives a contribution to the expression $x^T A x$. Indeed, if $B$ is an antisymmetric matrix, i.e., if $B^T=-B$, then $$ x^T B x=Bx\cdot x=x\cdot B^T x=- x\cdot B x=-x^T B x $$ from which $2 x^T B x=0$, from which $x^T B x=0$.
Therefore, $x^T A x=x^T \frac{A+A^T}{2}x$ and you can apply the result you computed for $A$ symmetric. Namely $$ \partial_x (x^T A x)=\partial_x(x^T \frac{A+A^T}{2}x)=2(\frac{A+A^T}{2})x=(A+A^T)x. $$ Summarizing, you just need to prove the formula for $A$ symmetric.
PS Your book is "right", the gradient must be a column vector. When $m=1$ is better to think of the jacobian matrix as the transposed gradient. You will get the reason for that in future classes.