Gradient of the weighted least-square function, given a non-linear model for the data

107 Views Asked by At

I have the following problem at hand, for which I am unsure if my reasoning was right.

Assume we have N datapoints $\mathbf{x}_n \in \mathbb{R}^{D\times 1}$ with corresponding scalar outputs $y_n$. We predict $y_n$ using the model: $$ \hat{\mathbf{y}} = e^{\mathbf{x}_n^T\boldsymbol{\theta}+b} $$ where $b$ is a scalar and $\boldsymbol{\theta} \in \mathbb{R}^{D\times 1}$. Assume we want to minimize the weighted sum of squared errors w.r.t. $\boldsymbol{\theta}$ and $b$: $$ \mathcal{L} = \sum_n \omega_n(y_n - \hat{y}_n)^2 $$ for some weights $\omega_n > 0$. Compute the gradient of the above function by first casting $\mathcal{L}$ into matrix notation and then applying the chain rule of differentiation. Note that the exponentiation is done on a scalar - use the notation $e^{\mathbf{A}}$ to denote element-wise exponentiation.

I proceeded as follows:

We begin the conversion by first noting the transformations: $$ y_n \rightarrow \mathbf{y} \in \mathbb{R}^{N\times 1}\\ \hat{y}_n \rightarrow \hat{\mathbf{y}} \in \mathbb{R}^{N\times 1} $$

Next, we construct the diagonal $N\times N$ matrix of weights $w_n$:

$$ \mathbf{W} \in \mathbb{R}^{N\times N}\; \ni: W_{ij} := \begin{cases} w_i > 0 &\text{if}\quad i = j \\ 0 & \text{otherwise} \\ \end{cases} $$

Then, the weighted square loss function is constructed as:

$$ \mathcal{L} = (\mathbf{y} - \hat{\mathbf{y}})^T\mathbf{W}(\mathbf{y} - \hat{\mathbf{y}}) \in \mathbb{R}^{1\times 1} $$

Lastly, we need to explicity write out $\hat{\mathbf{y}}$. We can generate a data matrix $\mathbf{X}$ by simply stacking the column vectors $\mathbf{x}_i$:

$$ \mathbf{X} = (\mathbf{x}_1 ... \mathbf{x}_n) \in \mathbb{R}^{D\times N} $$

giving us $\mathbf{X}^T\boldsymbol{\theta} \in \mathbb{R}^{N\times 1}$. Adding to it the constant vector $\mathbf{b}\in \mathbb{R}^{N\times 1}\; \ni:\; b_i = b\;\forall i$, we finally write:

$$ \hat{\mathbf{y}} = e^{\mathbf{X}^T\boldsymbol{\theta} + \mathbf{b}} \in \mathbb{R}^{N\times 1} $$

Next, we apply the chain rule: $$ \nabla_{\boldsymbol{\theta}, \mathbf{b}}\mathcal{L}(\boldsymbol{\theta}, b) = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}\frac{\partial \hat{\mathbf{y}}}{\partial (\boldsymbol{\theta}, \mathbf{b})} = \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} \left[ \frac{\partial \hat{\mathbf{y}}}{\partial \boldsymbol{\theta}} \;\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{b}} \right] \equiv \left[\nabla_{\boldsymbol{\theta}}\mathcal{L}\; \nabla_{\mathbf{b}}\mathcal{L} \right] $$ The partial derivatives $\frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}$ and $\frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{b}}$ are fairly straightforward:

$$ \frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}} = -2(\mathbf{y} - \hat{\mathbf{y}})^T\mathbf{W} \in \mathbb{R}^{1\times N}\\ \frac{\partial \hat{\mathbf{y}}}{\partial \mathbf{b}} = \frac{\partial}{\partial \mathbf{b}}e^{\mathbf{X}^T\boldsymbol{\theta} + \mathbf{b}} = \mathbf{I}\hat{\mathbf{y}} = \hat{\mathbf{y}} \in \mathbb{R}^{N\times 1} $$

However, $\frac{\partial \hat{\mathbf{y}}}{\partial \boldsymbol{\theta}}$ proves rather tricky. The expected dimensions of this gradient are $N\times D$. If we derivate this as would be expected:

$$ \frac{\partial \hat{\mathbf{y}}}{\partial \boldsymbol{\theta}} = \frac{\partial}{\partial \boldsymbol{\theta}}e^{\mathbf{X}^T\boldsymbol{\theta} + \mathbf{b}} = \hat{\mathbf{y}}^T\frac{\partial}{\partial \boldsymbol{\theta}}(\mathbf{X}^T\boldsymbol{\theta} + \mathbf{b}) = \hat{\mathbf{y}}^T\mathbf{X}^T \in \mathbb{R}^{1\times D} $$

which gives us the wrong dimensions. To check, how the real derivative looks, let us forget the matrix notation for now and perform component-wise derivation:

$$ \frac{\partial \hat{y}_i}{\partial \boldsymbol{\theta}} = \hat{y}_i\mathbf{x}_i^T \in \mathbb{R}^{1\times D} $$

which gives us the correct total dimensions of $N\times D$. In matrix notation, this can be written as:

$$ \frac{\partial \hat{\mathbf{y}}}{\partial \boldsymbol{\theta}} = diag(\mathbf{X}^T)\hat{\mathbf{y}} = \begin{bmatrix} \mathbf{x}_1^T & & 0 \\ & \ddots & \\ 0 & & \mathbf{x}_N^T \end{bmatrix}\hat{\mathbf{y}} \in \mathbb{R}^{N\times D\times 1} \equiv \mathbb{R}^{N\times D} $$

where $diag(\mathbf{X}^T) \in \mathbb{R}^{N\times D\times N}$ (rank 3 tensor). The product to $\frac{\partial \mathcal{L}}{\partial \hat{\mathbf{y}}}$ is now defined and produces the final result with dimensions $1\times D$.

Therefore, we have:

$$ \nabla_{\boldsymbol{\theta}}\mathcal{L} = -2(\mathbf{y} - \hat{\mathbf{y}})^T\mathbf{W}diag(\mathbf{X}^T)\hat{\mathbf{y}} \in \mathbb{R}^{1\times D} \\ \nabla_{\mathbf{b}}\mathcal{L} = -2(\mathbf{y} - \hat{\mathbf{y}})^T\mathbf{W}\hat{\mathbf{y}} \in \mathbb{R}^{1\times 1} $$

My problem is especially with the last partial derivative - I could not reason around it analytically - I just looked at how the component-wise derivation would proceed and then construct respective matrices such that matrix products were defined. Is my reasoning correct and is there an analytical way of arriving at the last partial derivative?

Thank you kindly for your help and answers.

1

There are 1 best solutions below

0
On

When computing the derivative $\frac{\partial\hat{\mathbf{y}}}{\partial\mathbf{\theta}}$, we can apply the chain rule to get:

$$ \frac{\partial\hat{\mathbf{y}}}{\partial\left(\mathbf{X}^\intercal\mathbf{\theta} + \mathbf{b}\right)} \cdot \frac{\partial\left(\mathbf{X}^\intercal\mathbf{\theta} + \mathbf{b}\right)}{\partial\mathbf{\theta}}. $$

Note that

  • the first derivative is an $N \times N$ matrix, with non-zero entries only on the diagonal since the function ($\exp$ in our case) is applied element-wise,
  • the second derivative is $\mathbf{X}^\intercal$, which is a $N \times D$ matrix (as you already observed in your derivation).

By putting the two terms together, we get:

$$ \frac{\partial\hat{\mathbf{y}}}{\partial\mathbf{\theta}} = \mathrm{diag}(\hat{\mathbf{y}}) \cdot \mathbf{X}^\intercal. $$