I have the following problem at hand, for which I am unsure if my reasoning was right.
Assume we have N datapoints $\mathbf{x}_n \in \mathbb{R}^{D\times 1}$ with corresponding scalar outputs $y_n$. We predict $y_n$ using the model: $$ \hat{\mathbf{y}} = e^{\mathbf{x}_n^T\boldsymbol{\theta}+b} $$ where $b$ is a scalar and $\boldsymbol{\theta} \in \mathbb{R}^{D\times 1}$. Assume we want to minimize the weighted sum of squared errors w.r.t. $\boldsymbol{\theta}$ and $b$: $$ \mathcal{L} = \sum_n \omega_n(y_n - \hat{y}_n)^2 $$ for some weights $\omega_n > 0$. Compute the gradient of the above function by first casting $\mathcal{L}$ into matrix notation and then applying the chain rule of differentiation. Note that the exponentiation is done on a scalar - use the notation $e^{\mathbf{A}}$ to denote element-wise exponentiation.
I proceeded as follows:
We begin the conversion by first noting the transformations: $$ y_n \rightarrow \mathbf{y} \in \mathbb{R}^{N\times 1}\\ \hat{y}_n \rightarrow \hat{\mathbf{y}} \in \mathbb{R}^{N\times 1} $$
Next, we construct the diagonal $N\times N$ matrix of weights $w_n$:
$$ \mathbf{W} \in \mathbb{R}^{N\times N}\; \ni: W_{ij} := \begin{cases} w_i > 0 &\text{if}\quad i = j \\ 0 & \text{otherwise} \\ \end{cases} $$
Then, the weighted square loss function is constructed as:
$$ \mathcal{L} = (\mathbf{y} - \hat{\mathbf{y}})^T\mathbf{W}(\mathbf{y} - \hat{\mathbf{y}}) \in \mathbb{R}^{1\times 1} $$
Lastly, we need to explicity write out $\hat{\mathbf{y}}$. We can generate a data matrix $\mathbf{X}$ by simply stacking the column vectors $\mathbf{x}_i$:
$$ \mathbf{X} = (\mathbf{x}_1 ... \mathbf{x}_n) \in \mathbb{R}^{D\times N} $$
giving us $\mathbf{X}^T\boldsymbol{\theta} \in \mathbb{R}^{N\times 1}$. Adding to it the constant vector $\mathbf{b}\in \mathbb{R}^{N\times 1}\; \ni:\; b_i = b\;\forall i$, we finally write:
$$ \hat{\mathbf{y}} = e^{\mathbf{X}^T\boldsymbol{\theta} + \mathbf{b}} \in \mathbb{R}^{N\times 1} $$
Next, we apply the chain rule: $$ \nabla_{\boldsymbol{\theta}, \mathbf{b}}\mathcal{L}(\boldsymbol{\theta}, b) = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}\frac{\partial \hat{\mathbf{y}}}{\partial (\boldsymbol{\theta}, \mathbf{b})} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \left[ \frac{\partial \hat{\mathbf{y}}}{\partial \boldsymbol{\theta}} \;\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{b}} \right] \equiv \left[\nabla_{\boldsymbol{\theta}}\mathcal{L}\; \nabla_{\mathbf{b}}\mathcal{L} \right] $$ The partial derivatives $\frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}$ and $\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{b}}$ are fairly straightforward:
$$ \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} = -2(\mathbf{y} - \hat{\mathbf{y}})^T\mathbf{W} \in \mathbb{R}^{1\times N}\\ \frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{b}} = \frac{\partial}{\partial \mathbf{b}}e^{\mathbf{X}^T\boldsymbol{\theta} + \mathbf{b}} = \mathbf{I}\hat{\mathbf{y}} = \hat{\mathbf{y}} \in \mathbb{R}^{N\times 1} $$
However, $\frac{\partial \hat{\mathbf{y}}}{\partial \boldsymbol{\theta}}$ proves rather tricky. The expected dimensions of this gradient are $N\times D$. If we derivate this as would be expected:
$$ \frac{\partial \hat{\mathbf{y}}}{\partial \boldsymbol{\theta}} = \frac{\partial}{\partial \boldsymbol{\theta}}e^{\mathbf{X}^T\boldsymbol{\theta} + \mathbf{b}} = \hat{\mathbf{y}}^T\frac{\partial}{\partial \boldsymbol{\theta}}(\mathbf{X}^T\boldsymbol{\theta} + \mathbf{b}) = \hat{\mathbf{y}}^T\mathbf{X}^T \in \mathbb{R}^{1\times D} $$
which gives us the wrong dimensions. To check, how the real derivative looks, let us forget the matrix notation for now and perform component-wise derivation:
$$ \frac{\partial \hat{y}_i}{\partial \boldsymbol{\theta}} = \hat{y}_i\mathbf{x}_i^T \in \mathbb{R}^{1\times D} $$
which gives us the correct total dimensions of $N\times D$. In matrix notation, this can be written as:
$$ \frac{\partial \hat{\mathbf{y}}}{\partial \boldsymbol{\theta}} = diag(\mathbf{X}^T)\hat{\mathbf{y}} = \begin{bmatrix} \mathbf{x}_1^T & & 0 \\ & \ddots & \\ 0 & & \mathbf{x}_N^T \end{bmatrix}\hat{\mathbf{y}} \in \mathbb{R}^{N\times D\times 1} \equiv \mathbb{R}^{N\times D} $$
where $diag(\mathbf{X}^T) \in \mathbb{R}^{N\times D\times N}$ (rank 3 tensor). The product to $\frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}$ is now defined and produces the final result with dimensions $1\times D$.
Therefore, we have:
$$ \nabla_{\boldsymbol{\theta}}\mathcal{L} = -2(\mathbf{y} - \hat{\mathbf{y}})^T\mathbf{W}diag(\mathbf{X}^T)\hat{\mathbf{y}} \in \mathbb{R}^{1\times D} \\ \nabla_{\mathbf{b}}\mathcal{L} = -2(\mathbf{y} - \hat{\mathbf{y}})^T\mathbf{W}\hat{\mathbf{y}} \in \mathbb{R}^{1\times 1} $$
My problem is especially with the last partial derivative - I could not reason around it analytically - I just looked at how the component-wise derivation would proceed and then construct respective matrices such that matrix products were defined. Is my reasoning correct and is there an analytical way of arriving at the last partial derivative?
Thank you kindly for your help and answers.
When computing the derivative $\frac{\partial\hat{\mathbf{y}}}{\partial\mathbf{\theta}}$, we can apply the chain rule to get:
$$ \frac{\partial\hat{\mathbf{y}}}{\partial\left(\mathbf{X}^\intercal\mathbf{\theta} + \mathbf{b}\right)} \cdot \frac{\partial\left(\mathbf{X}^\intercal\mathbf{\theta} + \mathbf{b}\right)}{\partial\mathbf{\theta}}. $$
Note that
By putting the two terms together, we get:
$$ \frac{\partial\hat{\mathbf{y}}}{\partial\mathbf{\theta}} = \mathrm{diag}(\hat{\mathbf{y}}) \cdot \mathbf{X}^\intercal. $$