I am learning about nonlinear optimization where I have some data vector, $\mathbf{d}$ and some unknown model $\mathbf{m}$. I can calculate predicted data from some model using a nonlinear function $\mathbf{F}(\mathbf{m})$.
The standard method to solve this nonlinear problem is to minimize some cost function
$$U = ||\mathbf{d}-\mathbf{F}(\mathbf{m})||_2^2 +\mathrm{Some \;Regularization}$$
where $||\cdot||_2 $ is the L2 norm.
The way that this is done is to differentiate $U$ and set the derivative to 0:
$$\frac{\partial U}{\partial \mathbf{m}} = 0$$
The solution is then often stated as:
$$0 = \mathbf{J^Td}-\mathbf{J^T Jm} + \mathrm{Some \; Regularization\; Stuff}$$
where $\mathbf{J}$ is the Jacobian.
Can someone explain where the Jacobian and its transpose come from step-by-step?