Point set registration
Point set registration is the problem of finding the rigid transformation that best aligns a set of points $\{\mathbf{a}_i\}$ to a set of corresponding points $\{\mathbf{b}_i\}$.
More particularly it finds the orthogonal "rotation matrix" $R$ and vector $\mathbf{t}$ that minimize $\sum_i(\mathbf{a}_i-R\mathbf{b}_i-\mathbf{t})^2$
Papers such as this one show that the problem can be simplified by subtracting from each set of points its own centroid. This zeroes $\mathbf{t}$, leaving only $R$ to be solved.
Optimizing $R$ is then called Wahba's problem, which can be solved by singular value decomposition of the matrix $\sum_i\mathbf{a}_i\mathbf{b}_i^T$ into the form $USV^T$, then
$$R=U\begin{bmatrix}1&0&0\\0&1&0\\0&0&\det(U)\det(V)\end{bmatrix}V^T$$
Question
Suppose that each point in the sets to be registered is associated not only with a position but also with a direction (such as a surface normal). Can we extend the closed-form solution above to registration of such oriented point sets?
Instead of minimizing $\sum_i(\mathbf{a}_i-R\mathbf{b}_i)^2$ we would have to minimize $\sum_i((\mathbf{a}_i-R\mathbf{b}_i)^2+F(\mathbf{\hat{p}}_i,\mathbf{\hat{q}}_i))$ where
- $\{\mathbf{\hat{p}}_i\}$ are unit vectors associated with point positions $\{\mathbf{a}_i\}$
- $\{\mathbf{\hat{q}}_i\}$ are unit vectors associated with point positions $\{\mathbf{b}_i\}$
- $F$ is a function that penalizes mismatching directions
If you can limit to the typical case, $$F\left(\hat{\mathbf{p}}_i, \hat{\mathbf{q}}_i\right) = k^2 \left\lVert \hat{\mathbf{p}}_i - \hat{\mathbf{q}}_i \right\rVert^2$$ where $k^2$ is the relative weight of orientation error with respect to position error ($0 \le k \in \mathbb{R}$), then you can use ordered sets of $2 N$ vectors $\mathbf{u}_i$ and $\mathbf{v}_i$ and your original algorithm, via $$\left\lbrace \begin{aligned} \mathbf{u}_i &= \mathbf{a}_i - \mathbf{a}_0 \\ \mathbf{u}_{i+N} &= k \left( \hat{\mathbf{p}}_i - \hat{\mathbf{p}}_0 \right) \\ \mathbf{v}_i &= \mathbf{b}_i - \mathbf{b}_0 \\ \mathbf{v}_{i+N} &= k \left( \hat{\mathbf{q}}_i - \hat{\mathbf{q}}_0 \right) \\ \end{aligned} \right.$$ where the subscript $0$ indicates the centroid, $$\mathbf{a}_0 = \frac{1}{N} \sum_{i=1}^N \mathbf{a}_i , \quad \mathbf{b}_0 = \frac{1}{N} \sum_{i=1}^N \mathbf{b}_i , \quad \hat{\mathbf{p}}_0 = \frac{1}{N} \sum_{i=1}^N \hat{\mathbf{p}}_i , \quad \hat{\mathbf{q}}_0 = \frac{1}{N} \sum_{i=1}^N \hat{\mathbf{q}}_i$$ Obviously, $k = 1$ weighs the error in both vector types equally much. $k \lt 1$ weighs the error in position vectors more than the error in orientation vectors, and $k \gt 1$ weighs the error in orientation vectors more than the error in position vectors.
It can feel odd to treat two different types of vectors in such a manner – concatenating into a single ordered set! – but it does work out correctly, both mathematically and in practical numerical calculations.
The idea is that you work with vectors whose centroid is at origin. Then, the quantity to be minimised $C$ is $$\begin{aligned} C &= \Bigl( \sum_{i=1}^N \left\lVert \mathbf{a}_i - \mathbf{R} \mathbf{b}_i \right\rVert^2 \Bigr) + \Bigl( \sum_{i=1}^N k^2 \left\lVert \hat{\mathbf{p}}_i - \mathbf{R} \hat{\mathbf{q}}_i \right\rVert^2 \Bigr) \\ ~ &= \Bigl( \sum_{i=1}^N \left\lVert \mathbf{a}_i - \mathbf{R} \mathbf{b}_i \right\rVert^2 \Bigr) + \Bigl( \sum_{i=1}^N \left\lVert \left( k\hat{\mathbf{p}}_i \right) - \mathbf{R} \left( k \hat{\mathbf{q}}_i \right)\rVert^2 \right) \Bigr) \\ \end{aligned}$$ and the approach of scaling the orientation vectors by $k$ (with respect to their centroid), and putting both types of vectors in a single ordered set, should be obvious: this manipulation made each position and each orientation identical from the point of the algorithm.
The same approach works for any number of additional vectors per point, as long as a constant relative weight on the error $k$ (per additional vector type) suffices.
Finally, note that the generic form of $F$ is typically a monotonically increasing nonnegative function $f$ of the norm of the difference of the two arguments, $$F\left(\hat{\mathbf{p}}, \hat{\mathbf{q}}\right) = f\left(\left\lVert \hat{\mathbf{p}} - \hat{\mathbf{q}} \right\rVert\right)$$ because anything else indicates an angular dependence, which would mean that the resulting matrix $\mathbf{R}$ would depend on a common rotation matrix applied to both sets; and that makes no geometric sense. If you require that the common rotation does not affect the results, then end up with $F$ having the above form. Monotonically increasing is needed for the minimization to make any sense (as otherwise some larger errors would be preferable to smaller errors in orientation, which again makes no geometric sense), and any common minimum cost $F_\min$ just pops out as a constant $N F_\min$ from the quantity to be minimized and thus has no effect at all on the results; so, you can arbitrarily choose the minimum cost to be zero.
However, it is difficult to reason why a nonlinear $f$ would be preferable over a linear $f$. If there is a perfectly matching rotation matrix $\mathbf{R}$, the linearity or nonlinearity of $f$ is irrelevant. If there is no rotation $\mathbf{R}$ that brings one of the ordered sets equal to the other, then linear least squares may not be the optimum choice for this in the first place.
For example, in classical molecular dynamics simulations, many temperature control mechanisms are such that if the system contains rotating clusters, the control affects the cluster angular velocity and not the temperature per se. To deal with this, the angular velocity of clusters needs to be estimated. A single time step contains too much noise, and over many time steps, the atoms on the surface of the cluster may move about (even leave the cluster, or new atoms impact on it) – and because they are at the outer edge of the cluster, their numerical weight is the largest wrt. angular velocity. This results in such approaches to instead finding first the invariant part of the cluster – analogous to this question, removing any vector pairs that seem to have error in them. Then this same approach works there too, with both atom positions and velocity vectors, and a constant weight k suffices to balance between position and velocity when finding $\mathbf{R}$.
That said, there could be a case where you need a nonlinear $f$, but I haven't seen one; and the times I've thought I needed one, were really cases where I needed to pre-filter my ordered sets of vectors first and then use a linear $f$.