I've been unable to solve $(12.11)$ to $(12.13)$ for the derivation of $z_{n i}$ and $b_j$
I've tried solving this for the last 3 hours, and I always end up with a completely different expression which includes the terms $z_{n i}$ and $b_j$. This is taken from Bishop's Pattern Recognition and machine learning book. Specifically I am trying to obtain the term below, i.e step 12.13. $$ b_{j}=\overline{\mathbf{x}}^{\mathrm{T}} \mathbf{u}_{j} \tag{12.13}$$
I know I need to use the chain rule, but I am not getting the correct result!
$$ J=\frac{1}{N} \sum_{n=1}^{N}\left\|\mathbf{x}_{n}-\tilde{\mathbf{x}}_{n}\right\|^{2} \tag{12.11} $$ Consider first of all the minimization with respect to the quantities $\left\{z_{n i}\right\} .$ Substituting for $\tilde{\mathbf{x}}_{n},$ setting the derivative with respect to $z_{n j}$ to zero, and making use of the orthonormality conditions, we obtain $$ z_{n j}=\mathbf{x}_{n}^{\mathrm{T}} \mathbf{u}_{j} \tag{12.12}$$ where $j=1, \ldots, M .$ Similarly, setting the derivative of $J$ with respect to $b_{i}$ to zero, and again making use of the orthonormality relations, gives $$ b_{j}=\overline{\mathbf{x}}^{\mathrm{T}} \mathbf{u}_{j} \tag{12.13}$$
For context, the $\mathbf x_n$'s are points in $\mathbf R^D$, and the $\widetilde{\mathbf x}_n$'s are approximations of the $\mathbf x_n$'s, given by formula $$ \widetilde{\mathbf x}_n := \sum_{i = 1}^M z_{ni} \mathbf u_i + \sum_{i = M + 1}^Db_i \mathbf u_i,$$ where $\{ \mathbf u_i \}$ is some fixed orthonormal basis of $\mathbb R^D$ and $M$ is some integer between $1$ and $D$.
The coefficients $z_{ni}$ are specific to each point $\mathbf x_n$, whereas the coefficients $b_i$ are shared between all the $\mathbf x_n$'s.
Our task is to find the values of the coefficients $z_{ni}$ and $b_i$ that minimise the mean square error $$ J := \frac{1}{N } \sum_n\|\mathbf x_n - \widetilde{\mathbf x}_n \|^2.$$
We can indeed use the chain rule. \begin{align} \frac{\partial J}{\partial b_i} &= \frac{1}{N} \sum_{n=1}^N \frac{\partial(\| \mathbf x_n - \widetilde{\mathbf x}_n\|^2)}{\partial \mathbf x}\frac{\partial \widetilde{\mathbf x}_n}{\partial b_i} \\ &= \frac 1 N \sum_{n=1}^N 2(\mathbf x_n - \widetilde{\mathbf x}_n)^T \mathbf u_i \\ &= 2 \left( \tfrac 1 N \sum_{n=1}^N \mathbf x_n\right)^T \mathbf u_i - \tfrac 2 N \sum_n \left(\sum_{j = 1}^M z_{nj} \mathbf u_j + \sum_{j = M + 1}^Db_j \mathbf u_j \right)^T \mathbf u_i \\ & = 2 \overline{\mathbf x}^T \mathbf u_i - 2b_i\end{align} where in the final line, I used that fact that $\mathbf u^T_i \mathbf u_i = 1$, and $\mathbf u^T_j \mathbf u_i = 0$ if $j \neq i$.
Note on differentiating the squared norm.
Why is $\frac{\partial \| \mathbf x - \widetilde{\mathbf x} \|^2}{\partial \widetilde{\mathbf x}} = 2(\mathbf x - \widetilde{\mathbf x})^T$?
Let's calculate this, component-wise. If $\widetilde x_k$ is a component of $\widetilde{\mathbf x}$, then
$$\frac{\partial \left(\| \mathbf x - \widetilde{\mathbf x} \|^2\right)}{\partial \widetilde{x}_k} = \frac{\partial \left( \sum_{k'=1}^K (x_{k'} - \widetilde x_{k'})^2\right)}{\partial \widetilde{x}_k} = 2(x_{k'} - \widetilde x_{k'}).$$ (In the final step, only the $k' = k$ term contributes.)