Obtaining camera positions/extrinsics from 6 DoF object positions (interpret head movement as camera movement)

59 Views Asked by At

In short: I have a video of a person speaking and for each frame I have a 6 degree of freedom pose of the head (i.e. rotation and position). I also have the intrinsics of the camera the video was taken with. The goal is to obtain the camera extrinsics ([R¦T]-matrix) for each frame from this information, such that the head movement is interpreted as camera movement instead.

The rotation of the camera can be obtained by inverting and reflecting the head rotation, but I am stuck with figuring out how to find the camera "position" (the T part of the [R¦T] matrix). Here's a very crude sketch of the situation:

Context: I am experimenting with Neural Radiance Fields, which are machine learning models that use images of a scene taken from different positions to learn an implicit geometry of the scene. This process involves shooting rays through the scene and necessitates that these rays are in a canonical space. My problem could be reformulated as trying to transform the rays hitting the face from the monocular camera into a canonical space, or in more vague terms, making the scene seem static (not taking into account the expressions of the person).

Papers/models used: I obtain the face pose using this paper: https://github.com/vitoralbiero/img2pose

Edit: My current approach is to invert the rotation except for the z-axis, then take the dot product of the head position and this newly obtained rotation matrix to obtain the camera pose. When I try to use this camera matrix to overlay a face over the original picture, the position is slightly off, but so close that it's frustrating at the same time. I can't show these renderings due to privacy reasons, so here's another sketch: