I'm reading the OpenCV documentation on what mathematical model they use for a camera. The below quoted text can be found on this website, scrolling down to the section "Detailed Description". I do not quite understand the alternative description that's given after the matrix equation.
The functions in this section use a so-called pinhole camera model. In this model, a scene view is formed by projecting 3D points into the image plane using a perspective transformation.
$$s \; m' = A [R|t] M'$$
or
$$s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = \begin{bmatrix}f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 &1\end{bmatrix} \begin{bmatrix} r_{11} & r_{12} & r_{13} & t_1 \\ r_{21} & r_{22} & r_{23} & t_2 \\ r_{31} & r_{32} & r_{33} & t_3 \end{bmatrix} \begin{bmatrix} X \\ Y \\ Z \\ 1 \end{bmatrix}$$
where:
- ($X$,$Y$,$Z$) are the coordinates of a 3D point in the world coordinate space
- ($u$,$v$) are the coordinates of the projection point in pixels
- $A$ is a camera matrix, or a matrix of intrinsic parameters
- ($c_x$,$c_y$) is a principal point that is usually at the image center
- $f_x$,$f_y$ are the focal lengths expressed in pixel units.
Thus, if an image from the camera is scaled by a factor, all of these parameters should be scaled (multiplied/divided, respectively) by the same factor. The matrix of intrinsic parameters does not depend on the scene viewed. So, once estimated, it can be re-used as long as the focal length is fixed (in case of zoom lens). The joint rotation-translation matrix [R|t] is called a matrix of extrinsic parameters. It is used to describe the camera motion around a static scene, or vice versa, rigid motion of an object in front of a still camera. That is, [R|t] translates coordinates of a point (X,Y,Z) to a coordinate system, fixed with respect to the camera. The transformation above is equivalent to the following (when z≠0 ):
$$\begin{array}{l} \begin{bmatrix} x \\ y \\z \end{bmatrix}= R \begin{bmatrix} X \\ Y \\Z \end{bmatrix} + t \\ x' = x/\color{red}{z} \\ y' = y/\color{red}{z} \\ u = f_x*x' + c_x \\ v = f_y*y' + c_y \end{array}$$
It's stated that both representations are equivalent, but then why is there a division by $\color{red}{z}$ (marked red)?
As far as I can tell, $\begin{bmatrix} x \\ y \\z \end{bmatrix}$ is equivalent to the right part of the matrix equation, namely $[R|t] M'$. All that's left to be done is multiply with the matrix $A$, which would look like that:
$$s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = A \begin{bmatrix} x \\ y \\z \end{bmatrix} = \begin{bmatrix}f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 &1\end{bmatrix} \begin{bmatrix} x \\ y \\z \end{bmatrix}= \begin{bmatrix}xf_x & 0 & zc_x \\ 0 & yf_y & zc_y \\ 0 & 0 &z\end{bmatrix}$$
Now to make both representations equivalent, I'd have to divide both sides of the equation by $z$, but then there would be a $/z$ on the left hand side of the equation which isn't there in the matrix equation.
Sure, the scaling factor $s$ that doesn't have a fixed value anyway could be combined with the $/z$, which means the additional $/z$ on the left side is not necessarily a problem. But why is this division applied in the first place? It appears to be entirely arbitrary and out of place to me. What's the point?
I could not find an explanation in the text.
The reason this division by $z$ is applied in the first place is exactly as you've discussed -- it is there to make both representations equivalent.
Your understanding is correct that $\begin{bmatrix} x \\ y \\z \end{bmatrix}$ is notation for $[R|t] M'$. However, your last calculation is a bit off: it should be $$s \begin{bmatrix} u \\ v \\ 1 \end{bmatrix} = A \begin{bmatrix} x \\ y \\z \end{bmatrix} = \begin{bmatrix}f_x & 0 & c_x \\ 0 & f_y & c_y \\ 0 & 0 &1\end{bmatrix} \begin{bmatrix} x \\ y \\z \end{bmatrix}= \begin{bmatrix}xf_x + zc_x \\yf_y + zc_y \\ z\end{bmatrix}\tag1 $$ Now the whole purpose of these calculations is to return $u$ and $v$. So in order for the LHS of (1) to look like the RHS of (1) it's necessary to convert the $z$ on the RHS to $1$, which you do by dividing the RHS by $z$ (actually factoring it out). This explains why $u$ must equal $(x/z)f_x + c_x$ and $v$ must equal $ (y/z)f_y + c_y$.
Another derivation of these equations, under simpler assumptions and different notation, can be found here: https://en.wikipedia.org/wiki/Pinhole_camera_model