I have 2 rotations R_1 and R_2 in 3D that are described by the unit quaternions q_1 and q_2. I know that the composition R_2(R_1) is described by the unit quaternion q_2 x q_1. Fair enough, and easy enough to implement.
However, calculating the product of 2 quaternions is a bit expensive (16 multiplications, 12 additions if I am not mistaken). I know that there are smarter ways to compute the rotation of a vector v by a unit quaternion q than directly applying the q x [v, 0] x q* formula (see for example https://gamedev.stackexchange.com/questions/28395/rotating-vector3-by-a-quaternion and the discussion in the fist answer, their "Rodriguez" formula is about 35% less expensive to compute).
My question is: are there similar "smarter ways" to compute the product of 2 unit quaternions / composition of 2 rotations in 3D, in such a way as to need fewer operations?
Edit 1: I need truly to be able to compose these rotations, not just to take a "rotation of rotation", as discussed in the comments. This is because, while I "reduce" the problem description to composing 2 rotations, in reality I will be composing the "current rotation" with some "additional new rotation on top of it" many time.
This is not an answer, but an extended comment related to the "faster method" part of the question.
Quaternion multiplication, i.e. Hamilton product between quaternions $\mathbf{q}_2 = r_2 + x_2 \mathbf{i} + y_2 \mathbf{j} + z_2 \mathbf{k}$ and $\mathbf{q}_1 = r_1 + x_1 \mathbf{i} + y_1 \mathbf{j} + z_1 \mathbf{k}$ is $$\begin{aligned} \mathbf{q} = \mathbf{q}_2 \times \mathbf{q}_1 = ~ & ( r_2 r_1 - x_2 x_1 - y_2 y_1 - z_2 z_1 ) \\ ~ = ~ & ( r_2 x_1 + x_2 r_1 + y_2 z_1 - z_2 y_1 ) ~ \mathbf{i} \\ ~ = ~ & ( r_2 y_1 - x_2 z_1 + y_2 r_1 + z_2 x_1 ) ~ \mathbf{j} \\ ~ = ~ & ( r_2 z_1 + x_2 y_1 - y_2 x_1 + z_2 r_1 ) ~ \mathbf{k} \\ \end{aligned}$$ (corresponding to a rotation by $\mathbf{q}_1$ followed by a rotation by $\mathbf{q}_2$), and indeed does consist of $16$ scalar multiplications and $12$ additions or subtractions, but it is also one of the operations that vectorize extremely well (if suitably written) on hardware architectures with single-instruction multiple data support; especially SSE3 and AVX (on AMD64/x86-64, or 64-bit Intel and AMD processors).
For example, using SSE, a single 128-bit register can describe a full quaternion in IEEE 754-2008 binary32 aka single-precision floating point format; using AVX, a single 128-bit register can describe a full quaternion in IEEE 754 binary64 aka double-precision floating point format. Then, the core cost of the Hamilton product is just four vector multiplications and three additions, if one of the multiplicands can be shuffled (negating two components) into four different forms.
Furthermore, since all the quaternion components are within $-1$ and $+1$, a range of $-4$ to $+4$ suffices for the quaternion representation, and one can use Qm.n fixed-point integer representation for the quaternions; real component $f$ is then approximated by integer value $i = \lfloor f \cdot 2^n \rceil$ (where $\lfloor \rceil$ represents rounding to nearest integer), $-2^{m-1} \le i \lt 2^{m+1}$, so that even slow 8-bit microcontrollers can use unit quaternions efficiently (for example for orientation tracking using cheap IMU modules like MPU9150). In some robotics etc. implementations on 32-bit ARM architectures without hardware floating-point support, this can yield very significant speedups (and using Q3.28 format, yields up to four additional bits of precision per component, compared to single-precision floating-point).
Therefore, depending on the hardware (and of course programming language and libraries) used, you probably want to use an optimized unit quaternion (versor) library for the best real-world performance, as many current processors have features that make the theoretically more costly form much more efficient (up to 4× faster on SSE/AVX) than it looks.
I do have written a Python module (in Python) extending tuple to represent versors in plain Python, complementing 3D vector and matrix types (and their respective operations), but my use cases have only involved a few thousand quaternion operations per second at most; so optimizing that code would have had basically zero impact on the efficiency of the Python code.
So, if these unit quaternion operations truly are a bottleneck for your use case – do check, do not just assume it is! –, you might wish to consider asking this question on the programming sub-forum here instead. Personally, for Python, I'd probably implement a basic versor library in C, with both a vectorized
<immintrin.h>and a "naïve" non-vectorized implementation, and both C and Python test suites to verify, that exposes a four-component tuple-like Python data type describing an unit quaternion (and uses the vector components of the same to describe ordinary 3D vectors, for optimal vectorization in exchange for a bit more memory used per vector).