Context
In engineering we can perform a parameter estimation on a system when the dynamics are linear w.r.t. the parameters:
$$ f(\ddot{q}, \dot{q}, q) = \tau \\ \downarrow \\ \phi(\ddot{q}, \dot{q}, q)b = \tau $$
Here $q$ is a vector of system coordinates (e.g. positions), $\tau$ a vector of system inputs (e.g. forces) and $f$ some function describing the dynamics. $b$ is a vector of the parameters (e.g. inertias, spring constants).
If we now have a data set of an experiment with that system, with measured states and inputs, we can stack these equations together:
$$ \begin{bmatrix} \phi_1 \\ \vdots \\ \phi_N \end{bmatrix} b = \begin{pmatrix} \tau_1 \\ \vdots \\ \tau_N \end{pmatrix} \\ A b = y $$
Question
We can find the parameters vector $b$ using the Moore-Penrose pseudo inverse (effectively doing a least squares fit):
$$ b = A^\dagger y $$
But how can I find the uncertainty in the parameters in $b$? (As a standard deviation, or a confidence interval.)
I think I found an answer from MATLAB's documentation: https://nl.mathworks.com/help/curvefit/confidence-and-prediction-bounds.html
I noticed that when $b$ has two parameters and $\tau$ is a scalar, we are fitting a surface. I used MATLAB's fitting toolbox to fit a surface to some dummy data, while doing a manual fit on the same data using the method above. This toolbox also outputs a CI95% interval, so I used the documentation to make the same calculations myself.
It boils down to:
$$ e_b = t\sqrt{S} \\ S = \mathrm{diag}((A^T A)^{-1} * \mathrm{MSE}) \\ \mathrm{MSE} = \frac{\mathrm{SSE}}{\nu} \\ \mathrm{SSE} = \sum_i (y_i - A_ib)^2 $$
Here $e_b$ is the error in $b$ such that $b \pm e_b$ describes the confidence interval.
SSE is the sum-squared-error (the sum of squared residuals) and MSE is the mean-squared-error.
$S$ is the diagonal of the estimated covariance matrix of the coefficient estimates.
$\nu$ is the statistical degrees-of-freedom in the data, typically taken as the number of measurements minus the number of fitted parameters.
$t$ is a value of the inverse cumulative student distribution (corresponding to a selected confidence interval and $\nu$).
$A$ is the design matrix of the data (rows of each datapoint, same as defined in my question).
However, I doubt if this will still apply for dynamic systems with multiple equations of motion, i.e. when $\tau$ is a vector with more than a single element. In this case the large vector $y$ will be mixing multiple 'kinds' of values.
Maybe the matrix $(A^T A)^{-1}$ compensates for this in the error computation?
I would appreciate other thoughts on the matter.