On page 29 in Christopher Bishop's Pattern Recognition and Machine Learning book he gives the following two equations
1.62
$$ \ln p(t|\pmb{x}, \pmb{w}, \beta) = - \frac{\beta}{2} \sum_{n=1}^N \{ y(x_n, \pmb{w}) - t_n \}^2 + \frac{N}{2} \ln \beta - \frac{N}{2} \ln(2\pi) $$
He then goes on to explain that maximizing this with respect to w would result in the following by basically removing everything that is not dependent on $w$ which makes sense...
$$ \pmb{w}_{ML} = \frac{1}{2} \sum_{n=1}^N \{ y(x_n, w) - t_n \}^2 $$
Immediately following this the claim is made that maximizing with respect to $\beta$ gives the following...
$$ \frac{1}{\beta_{ML}} = \frac{1}{N} \sum_{m=1}^N \{ y(x_n, w) - t_n \}^2 $$
and I can't quite get there by the same logic, how would you arrive at this?
I was wrongly assuming that the $\frac{N}{2} \ln \beta$ was part of the summand. Taking the derivative and setting it to 0 gave me the proper outcome.
$$ \begin{aligned} &\frac{\partial}{\partial \beta} \Big\lbrack -\frac{\beta}{2} \sum_{n=1}^N \{ y(x_n, \pmb{w}) - t_n \}^2 + \frac{N}{2} \ln \beta - \frac{N}{2} \ln(2\pi) \Big\rbrack \\ &\frac{\partial}{\partial \beta} \Big\lbrack -\frac{\beta}{2} \sum_{n=1}^N \{ y(x_n, \pmb{w}) - t_n \}^2 \Big\rbrack + \frac{\partial}{\partial \beta} \Big\lbrack \frac{N}{2} \ln \beta \Big\rbrack \\ 0 = &-\frac{1}{2} \sum_{n=1}^N \{ y(x_n, \pmb{w}) - t_n \}^2 + \frac{N}{2\beta}\\ -\frac{N}{2\beta} = &-\frac{1}{2} \sum_{n=1}^N \{ y(x_n, \pmb{w}) - t_n \}^2 \\ \frac{1}{2\beta} = &\frac{1}{2N} \sum_{n=1}^N \{ y(x_n, \pmb{w}) - t_n \}^2 \\ \frac{1}{\beta} = &\frac{1}{N} \sum_{n=1}^N \{ y(x_n, \pmb{w}) - t_n \}^2 \\ \end{aligned} $$