Consider a video of someone holding up a photo of a face in front of a camera and translating and rotating it a bit. Imagine that we know the pixel coordinates of various features in the picture. Eye corners, the corners of the mouth etc.
All the intrinsic and extrinsic camera parameters are known.
Is there a way using multiple view geometry, or projective geometry, to realize that we're looking at a photo and not an actual face?
Maybe one can compute a homography between two frames with different head poses in the video, and distinguish between homographies based on feature points on a photo and feature points on an actual face since in the latter case they are not entirely planar?
A camera projecting from a planar photograph to a planar speed image is essentially a protective transformation. One property of these is that they preserve incidences. If three points lie on a line in one image, they do so in the other. If three lines meet in a single point in one, then so in the other. So you could conceivably find features of the face satisfying these property in one frame of the video, and then check other frames.
But finding exactly collinear points or concurrent lines might be tricky. Another preserved quantity is the cross ratio. Usually you are talking about the cross ratio of four points on a line. But thanks to projective duality you can define one of four lines through a point using essentially the same homogenous formula. You can use this to tackle points in general position: pick five well recognizable features and treat them as points. Connect four of them to a fifth and you get four concurrent lines. Compute their cross ratio. For a video of a photo the quantity will remain fixed.
Yet another possibility: given four reliably located features you could obtain the transformation matrix from one frame to another. Using this you can transform the scene from one frame to the coordinate system of the other, then compare the two to spot differences between them. Sum of absolute values of pixel differences is the naive thing that comes to mind, but you probably can do a lot better by trying some alternative quantities.
All of the above should have zero errors in a perfect geometric world, so any non-zero value would disprove the hypothesis of this being a video of a photo. However the world is not perfect, and even a photo would be subject to different lighting, different signal noise, different quantisation. So you'd never get exact zero, and I'm far from sure whether you would be able to find a reasonable threshold to reliably differentiate between a pictures subject to these error sources and an actual person with a fairly flat face and very little rotation.
Intuitively the most non-flat aspect if most faces is probably the nose. So as a human I'd look for that, which side of it I see, and what it occludes. In terms of algorithmic feature detection the nose and the areas surrounding it might be poor on sharp recognizable features so I'm not sure how well this approach could be automated.
If you know so much about the camera, it makes me wonder whether you have some control over the environment where the video is taken. If you could change the lighting, that might go a long way distinguishing these cases. A real face will look very different when lit once from one and then from the other side.