Iʼm trying to find the intrinsic and some of the extrinsic parameters of an ideal pinhole camera, based on four chosen points for which I know both world and image co‑ordinates.
I realize camera solving is a very commonly asked question, but the solutions Iʼve read about so far donʼt quite fit my problem:
Many “camera resectioning” or “camera calibration” algorithms involve solving from unknown world points and require multiple images from different camera stations, whereas I only have a single image but I do know the world co‑ordinates of chosen points.
The “Perspective-n-Point” problem is closer, but this one assumes you know the camera intrinsics, which I donʼt. However, I do have several known constraints that I suspect make my problem easier.
Often solutions involve bundle adjustment to try to solve imperfections in real-world cameras. In my case I can assume there are none.
Intuitively I suspect thereʼs enough information to obtain a solution, but if there isnʼt, Iʼd like to know why and perhaps what constraints could be added to make it solvable. $$$$
Context
Iʼve developed software to produce adventure games, which typically operate in stationary third-person views per each play area loosely called a “room.” My system tracks everything in realistic world co‑ordinates and as such has to be told the camera parameters that were used to render each “room.” Typically the scenes are produced in 3D DCC software such as Maya, and in that case the camera parameters are trivially known because the artist actually set them in the DCC software.
Usually this works great, however right now I am trying to incorporate some artwork from an existing game which was drawn by hand, so in this case I have no camera parameters. Iʼm hoping to get a rough estimate on parameters by picking a large orthogonal rectangle on the floor and estimating contextually what the lengths of its edges should be in my world units.
Obviously the accuracy trying to extract a perspective transform from a hand painting is going to be terrible, but as long as it doesnʼt explode into some strange degeneracy, it beats spending hours of trial and error per picture in Maya. Some amount of error should go unnoticed due to how the artwork is used.
In solving, we can assume the camera has no rotation, no lateral translation ($x = 0$), and is above the floor ($y > 0$). We can assume that the points are on the floor ($y = 0$) and that they form a rectangle whose closest edge is parallel to the image plane. I donʼt think it matters but we could also set the closest points to $z = 0$ and the camera at $z < 0$.
One important thing is that we canʼt assume the principal point is $(0,0)$ in the image, and in fact it usually wonʼt be, at least not in $y$. The artists often manipulated the vanishing point in a manner equivalent to a substantial optical center shift, typically in the vertical axis. (Think of an old bellows camera, with the film plate mounted on rails so that you can shift around the film plate without moving the lens.) $$$$
The Math
Iʼll examine the problem from the perspective of how my system would transform world co‑ordinates to image co‑ordinates if it did know the camera parameters. The world is left-handed Y-up.
Let $N$ and $F$ be the near and far clipping distances, and $\Theta_H$ and $\Theta_V$ be the horizontal and vertical field of view.
Let $S_H$ and $S_V$ be the horizontal and vertical optical center shifts, in multiples of the respective image dimension.
Let $A$ be the world space point of the camera aperture.
Consider a chosen point $W$ in world space.
Given the camera parameters above, we can find a non-homogeneous co‑ordinate, $C$, in normalized clip space, wherein the image covers $[-1,+1]$ in $x$ and $y$ and $[0,1]$ in $z$.
The following are constant for a given perspective transform:
$$\begin{align} w = \frac{1}{tan\left( \frac{\Theta_H}{2} \right)},\tag{1}\\[2ex] h = \frac{1}{tan\left( \frac{\Theta_V}{2} \right)},\tag{2}\\[2ex] Q = \frac{F}{F - N}.\tag{3} \end{align}$$
Using that we transform $W$ to $C$ as follows, assuming there is no rotation on the camera:
$$\begin{align} C_x = \frac{w(W_x - A_x)}{W_z - A_z} - 2S_H,\tag{4}\\[2ex] C_y = \frac{h(W_y - A_y)}{W_z - A_z} - 2S_V,\tag{5}\\[2ex] C_z = \frac{Q(W_z - A_z) - QN}{W_z - A_z}.\tag{6} \end{align}$$
We know $A_x$ and $W_y$ to always be zero. We donʼt know the clip space depth ($C_z$), nor do we immediately need $N$ and $F$, which we can easily pick from the image later if we know the other parameters. This suggests we can try to solve the simpler:
$$\begin{align} C_x = \frac{w\,W_x}{W_z - A_z} - 2S_H,\tag{7}\\[2ex] C_y = \frac{-h\,A_y}{W_z - A_z} - 2S_V.\tag{8} \end{align}$$
We have several $(W_x,0,W_z)$ and several matching $(C_x,C_y)$, and from that want to find $(0,A_y,A_z)$, $S_V$, and one of $w$ or $h$. If we have one of $w$ or $h$, we can find the other easily as the image aspect ratio is known. Ideally weʼd also like to know $S_H$, but itʼs often zero, we know from context when itʼs zero, and when itʼs not zero we can instead solve a larger image in which it is zero, if need be.
I kind of suspect that my constraints may have turned this into a relatively straightforward linear algebra problem. Alas, this is the point where I must don my math dunce cap.
You have the perspective image of two pairs of parallel edges (all parallel to the floor), which will pairwise intersect on the horizon. Unfortunately, the other pair is parallel to the image plane so that we only obtain the central point on the horizon (which runs horizontally across the image); the eye should be perpendicularly in front of that point. This means we need to resort to other lines - the diagonals. If we assume the rectangle is a square, the two diagonals intersect the horizon in two points $P$ and $Q$. The Thales circle ovre $PQ$ is the locus of all points that "see" $P$ and $Q$ correctly at a 90° angle. Hence half the distance between $P$ and $Q$ is the correct distance between eye and screen. Now project back from the known eye position to a suitable (apparently horizontal) plane. The vertical position of that plane is then a matter of scale only.