In the paper "Attention is all you need" the authors have chosen a function to encode the position of a word in a sequence (section 3.5). The following encoding is chosen:
$ PE(pos, 2dim) = sin(pos / 10000 ^ {2dim/d_{model}} ) $
For the purposes of this question this function can be simplified to:
$ PE(pos) = sin(pos) $
The text states that "for any fixed offset $k$, $PE(pos+k)$ can be represented as a linear function of $PE(pos)$". This did not seem obvious due to me due to the nonlinearity of the sine function. Other resources like Attention is all you need Explained mention this property but do not go deeper into it.
I attempted to use linear regression techniques in Python to derive this, but was unable to find a fitting linear transform. As $k$ increases and the sine waves resulting from the $PE(pos)$ function get out of sync, the correlation of the transformation and the truth decreases.
Did I misapprehend the statement in the paper, or is my code or understanding of the underlying math here faulty?
Upon closer inspection, the article defines the function $\operatorname{PE}$ separately for even and odd dimension as \begin{eqnarray*} \operatorname{PE}(\text{pos},2d) &=&\sin(\text{pos}/c^d),\\ \operatorname{PE}(\text{pos},2d+1) &=&\cos(\text{pos}/c^d), \end{eqnarray*} for some constant $c=10000^{\frac{2}{d_{\text{model}}}}$. The trigonometric identity $$\sin(\alpha+\beta)=\sin(\alpha)\cos(\beta)+\cos(\alpha)\sin(\beta),$$ then yields the identity $$\operatorname{PE}(\text{pos}+k,2d)=\operatorname{PE}(\text{pos},2d)\cos(k/c^d)+\operatorname{PE}(\text{pos},2d+1)\sin(k/c^d),$$ which the authers seem to call a linear function of $\operatorname{PE}(\text{pos})$.