I have a function $f_k$ defined as $f_k(x_1, \cdots, x_k; \theta_k)$ that outputs a $2^M$ dimensional complex vector where $\theta_k$ is also a $2^M$ dimensional complex vector and $x_i$ are integers. There are multiple ways in which this function can be defined. For example, we could have the followings:
- Identity model:
$$ f_k(x_1, \cdots, x_k; \theta_k) = \mathrm{Norm}(\theta_k) = \frac{\theta_k}{\|\theta_k\|} $$
- Linear basis linear regression:
$$ f_k(x_1, \cdots, x_k; \theta_k) = \mathrm{Norm}(w_1^{(k)}x_1 + \cdots + w_{k-1}^{(k)}x_{k-1} + b^{(k)})$$
Where in the second case $\theta_k$ is a collection of $2^M$ dimensional complex vectors $\theta_k=\{w_1^{(k)}, w_2^{(k)}, \cdots, w_{k-1}^{(k)}, b^{(k)}\}$ and $\| \theta \| = \sqrt{\theta^\dagger \theta}$.
To compute the gradient of a loss function, I have to compute the partial derivative $\frac{\partial f_k}{\partial \theta_k}$. For the first case, I guess I could simply use the formula given here for the 2-norm along with the quotient rule.
However, I’m not sure how to compute the partial derivative for equations like the second case I showed. Furthermore, how would this derivative look? My guess is it needs to be like $\theta_k$, i.e., a collection of $2^M$ dimensional complex vectors, is this true?
For reference, I’m computing equation 13 from the paper Quantum self-learning Monte Carlo with quantum Fourier transform sampler.
Assuming you need the 2-norm, the real gradient could be obtained by rewriting all your $w_{l}^{(k)}=\alpha_{l}^{(k)}+i\beta_{l}^{(k)}$ with $\alpha_{l}^{(k)},\ \beta_{l}^{(k)}$ real.
Then the total differential is $$d(||w_{l}^{(k)}x_l+\dots+b^{(k)}||)=d(||\alpha_{l}^{(k)}x_l+i\beta_{l}^{(k)}x_l+\dots||)=$$ $$=\frac{1}{2||\dots||}d[(\alpha_{l}^{(k)}x_l+i\beta_{l}^{(k)}x_l+\dots)^H(\alpha_{l}^{(k)}x_l+i\beta_{l}^{(k)}x_l+\dots)]=$$ $$=\frac{1}{2||\dots||}d[(\alpha_{l}^{(k)}x_l-i\beta_{l}^{(k)}x_l+\dots)^T(\alpha_{l}^{(k)}x_l+i\beta_{l}^{(k)}x_l+\dots)]=$$ by the linearity property and the product rule for the differentials $$=\frac{1}{2||\dots||}[(d\alpha_{l}^{(k)}x_l-id\beta_{l}^{(k)}x_l+\dots)^T(\alpha_{l}^{(k)}x_l+i\beta_{l}^{(k)}x_l+\dots)+(\alpha_{l}^{(k)}x_l-i\beta_{l}^{(k)}x_l+\dots)^T(d\alpha_{l}^{(k)}x_l+id\beta_{l}^{(k)}x_l+\dots)]$$ Rearranging the transpose and opening the brackets for each $l$ we have $$=\frac{1}{2||\dots||}[(\alpha_{l}^{(k)}x_l+i\beta_{l}^{(k)}x_l+\dots)^T+(\alpha_{l}^{(k)}x_l-i\beta_{l}^{(k)}x_l+\dots)^T]x_ld\alpha_{l}^{(k)}+$$ $$+\frac{1}{2||\dots||}[-i(\alpha_{l}^{(k)}x_l+i\beta_{l}^{(k)}x_l+\dots)^T+i(\alpha_{l}^{(k)}x_l-i\beta_{l}^{(k)}x_l+\dots)^T]x_ld\beta_{l}^{(k)}+\dots=$$ $$=\frac{1}{||\dots||}(\sum_i\alpha_{i}^{(k)}x_i)^Tx_ld\alpha_{l}^{(k)}+\frac{1}{||\dots||}(\sum_i\beta_{i}^{(k)}x_i)^Tx_ld\beta_{l}^{(k)}+\dots$$ And from the correspondence between gradients $\nabla_z$ and the differential $dg$ for any $g(z_1,z_2,\dots):$ $$dg=(\nabla_{z_1}g)^Tdz_1+(\nabla_{z_2}g)^Tdz_2+\dots$$ we obtain the required gradients for each $l$: $$\nabla_{\alpha^{(k)}_l}f_k=\frac{1}{||\dots||}(\sum_i\alpha_{i}^{(k)}x_i)x_l$$ $$\nabla_{\beta^{(k)}_l}f_k=\frac{1}{||\dots||}(\sum_i\beta_{i}^{(k)}x_i)x_l$$ And $i$ above goes over the corresponding $\alpha$ and $\beta$ parts of $b^{(k)}$ also.