I'm observing a QR code from a device. I can detect precisely where in the screen are each of the 4 corners, and identify them uniquely. I'm trying to find the transformation (rotation + translation) my device did to see that from a standard position, centered on the QR code at a given distance.
I don't know the exact position of the QR code corners relatively to the device since I'm only observing a projection, but for each corner I know a line on which it lies.
I'm first trying to think in the camera coordinated, assuming the camera is fixed and the QR code moving.
Assuming the position of 4 points is known to initially be:
(-a, -a, 1), (a, -a, 1), (a, a, 1), (-a, a, 1)
with a known
which transformation leads to them to be on
αX, βY, γZ, δW
with α, β, γ, δ in [0, +inf] are unknown; and X, Y, Z, W known
?
I'm assuming the QR code shape is not changing, so the transformation should only be rotation + translation. I'm not sure I need the 4 corners since and 3 might be enough but I've knowledge about 4 corners.
In general, one proceeds by solving a system of equations generated by the known world/image correspondences to reconstruct the projection matrix, and then decomposes it as required. These equations are often nonlinear. Since you’ve got access to the intrinsic matrix $K$, though, you can reconstruct both the camera’s location and orientation directly. In the following, image points and vectors are indicated by bold lower-case letters and world/camera coordinates by upper-case. In both cases, a tilde indicates inhomogeneous coordinates.
As always, we can decompose the image projection matrix as $K[R\mid-R\tilde{\mathbf C}]$, where $K$ is the $3\times3$ intrinsic matrix, $R$ is a $3\times3$ rotation matrix, and $\mathbf C$ is the camera’s center in world coordinates. As with any transformation matrix, the columns of $R$ are the images of the world-coordinate basis vectors, so they give the camera’s axes in world coordinates. Since the inverse of an orthogonal matrix is its transpose, this means that the rows of $R$ are the camera-coordinate relative unit direction vectors of the world coordinate axes. Knowing $K$, we can take advantage of this to reconstruct $R$.
Let the corners of the QR code be $\mathbf Q_1$ through $\mathbf Q_4$, in the same order as you’ve got them in the question, and $\mathbf q_1$ through $\mathbf q_4$ the corresponding points in the image. Compute the vanishing points of opposite sides of the QR code image: $$\begin{align}\mathbf{vp}_{\text{hor}} & =(\mathbf q_1\times\mathbf q_2)\times(\mathbf q_4\times\mathbf q_3) \\ \mathbf{vp}_{\text{ver}} & =(\mathbf q_1\times\mathbf q_4)\times(\mathbf q_2\times\mathbf q_3). \end{align}$$ These are the images of the points at infinity of the world $x$- and $y$-axes, respectively. The matrix $K$ represents an affine map between points in the image and lines through the origin in camera coordinates, so $K^{-1}\mathbf{vp}_{\text{hor}}$ and $K^{-1}\mathbf{vp}_{\text{ver}}$ are direction vectors of these axes in inhomogeneous camera coordinates. Normalizing them will therefore give us the first two rows of $R$, and the third row is their normalized cross product. There’s a sign ambiguity that needs to be dealt with, though. If the horizontal vanishing point, for example, is to the “left” of the QR code image, the vector we get by back-mapping points in the wrong direction and needs to have its signs flipped. There are various ways to test for this. A fairly simple way is to scale $\mathbf{vp}_{\text{hor}}$ so that its third component is $1$ and then multiply the result of back-projection by the sign of $\det\begin{bmatrix}\mathbf q_2^T & \mathbf{vp}_{\text{hor}}^T & \mathbf q_3^T\end{bmatrix}$, or the corresponding scalar triple product $-(\mathbf q_2\times\mathbf q_3)\cdot\mathbf{vp}_{\text{hor}}$. The latter might be more convenient in practice since you’ve already had to compute the included cross product. A similar sign correction will need to be made for the vertical vanishing point. $K$ might not be orientation-preserving, either, so you should also multiply by the sign of $\det K$ as well.
With the rotation matrix in hand, all that remains is to compute the camera location $\mathbf C$. There are various ways to do this. One is to observe that $H=KR$ is a homography between the image and the plane at infinity in world coordinates, so $H^{-1}\mathbf q_i=R^TK^{-1}\mathbf q_i$ is the ideal point of the line through $\mathbf Q_i$ and $\mathbf C$. In theory, one simply needs to compute the intersection of two of these lines to find $\mathbf C$. In practice, though, it could be that no pair of them intersects because of numerical inaccuracies. One way to deal with this is to find the intersection of the planes defined by lines through adjacent vertices. The homogeneous vector that represents such a plane can be found as the null vector of the matrix that has these points as rows or via a generalization of the cross product, which will involve computing a $4\times4$ determinant. For instance, the vector $\mathbf \pi_{12}$ for the plane through $\mathbf C$, $\mathbf Q_1$ and $\mathbf Q_2$ is the null vector of $$\begin{bmatrix}\mathbf Q_1^T\\\mathbf Q_2^T\\(R^TK^{-1}\mathbf q_1)^T\end{bmatrix}.$$ For each such plane $\pi$ we get the equation $\mathbf\pi\cdot\mathbf C=0$, and so $\mathbf C$ is the solution to this system of equations, that is, the null vector of the matrix that has these plane vectors for rows. Using three of the four planes is sufficient, but you might get better results by using all four and computing a least-squares solution to the overdetermined system.
Another way to find $\mathbf C$ takes advantage of the fact that for central projection the images of an object on parallel image planes are similar. Since the plane of the QR code is parallel to the world-coordinate $x$-$y$ plane, its normal $\tilde{\mathbf N}=(n_x,n_y,n_z)$ in camera coordinates is the third row of $R$. It’s an easy exercise† to work out that the point at which a ray through the origin given by a direction vector $\tilde{\mathbf V}$ intersects the plane $(n_x,n_y,n_z,d)$ has inhomogeneous coordinates $$-{d\over\tilde{\mathbf N}\cdot\tilde{\mathbf V}}\tilde{\mathbf V}.$$ If you compute these values for each of $K^{-1}\mathbf q_i$ and set the area of the resulting figure equal to the area of the QR code, you will get an equation of the form $ad^2=\operatorname{area}(\text{QR code})$, from which $d=\sqrt{\operatorname{area}(\text{QR code})/a}$. (By construction, we want the positive root of the equation.) Substitute this value of $d$ into the back-projected points and find the intersection of the diagonals. This point is the origin of the world coordinate system, so rotating it by $R^T$ and negating finally produces $\mathbf C$ in world coordinates. (N.B.: I’ve assumed here that the QR code starts off at $z=0$ instead of $z=1$ as you had it, but that’s an easy adjustment to make to the computation.)
All of the above comes with a big caveat: without further information, it’s impossible to find the actual camera position. This is because the distance to the image plane is conflated with internal scale factors (e.g., sensor pixel size) in the intrinsic matrix $K$. Zooming in and out is indistinguishable from shifting the image plane.
One other caveat is that I’ve assume the convention that the camera points in the direction of the negative $z$-axis in camera coordinates. The opposite convention is quite commmon Adapting this solution to that convention is a matter a a few sign changes here and there. That is to say, a simple conversion that’ll likely take several tries to get right ;)
I’m sure that there are other ways to do this, perhaps even more computationally efficient or stable, but the above is a direct and, I hope, easy to follow computation.
Addition: If you’re going the intersecting planes route to compute $\mathbf C$, since you’ve already computed the lines that correspond to the edges of the QR code image, back-mapping those lines and using the results to generate a system of plane equations to be solved is much more efficient than using the vertex images.
Let $\mathbf l_{ij}=\mathbf q_i\times\mathbf q_j$ be the line in the image through adjacent vertices $\mathbf q_i$ and $\mathbf q_j$. This is a covariant vector, so it is transformed by the inverse transpose of the point transformation. The back projection $\tilde{\mathbf L}_{ij}^T=\mathbf l_{ij}^TKR$ is a line at infinity that corresponds to the normal of the plane that projects to $\mathbf l_{ij}$. We know that this plane passes through $\mathbf Q_i$, $\mathbf Q_j$ and $\mathbf C$, so this generates the two linear equations $\mathbf l_{ij}^TKR\tilde{\mathbf C}=\mathbf l_{ij}^TKR\tilde{\mathbf Q}_i$ and $\mathbf l_{ij}^TKR\tilde{\mathbf C}=\mathbf l_{ij}^TKR\tilde{\mathbf Q}_j$. Collect these up and solve them using your favorite method.
†: A handy formula to know is that the matrix of the central projection from a view point $\mathbf{VP}$ onto a plane $\mathbf\pi$ is $$M = \mathbf{VP}\otimes\mathbf\pi-\mathbf{VP}\cdot\mathbf\pi\,I = \mathbf{VP}\mathbf\pi^T-\mathbf \pi^T\mathbf{VP}\,I.$$ Taking the origin as the view point and the image plane $\mathbf\pi=(n_x,n_y,n_z,d)$, $$M=\begin{bmatrix}-d&0&0&0\\0&-d&0&0\\0&0&-d&0\\n_x&n_y&n_z&0\end{bmatrix}$$ and expanding $M\mathbf v$ produces the formula used above for the intersection of the ray $\mathbf v$ and the plane $\mathbf\pi$. The formula is also fairly easy to work out by examining similar triangles.