How to fast sovle a matrix linear system $f(A)x= b$, where $A$ is a tridiagonal matrix as: $$ \begin{bmatrix} a & b & 0 & 0 & ... & 0 & 0 & 0 & 0\\\\ c & d & c & 0 & ... & 0 & 0 & 0 & 0\\\\ 0 & c & d & c & ... & 0 & 0 & 0 & 0\\\\ 0 & 0 & c & d & ... & 0 & 0 & 0 & 0\\\\ \cdots & \cdots & \cdots & \cdots & \cdots &\cdots& \cdots &\cdots&\cdots\\\\ 0 & 0 & 0 & 0 & ... & d & c & 0 & 0 \\\\ 0 & 0 & 0 & 0 & ... & c & d & c & 0 \\\\ 0 & 0 & 0 & 0 & ... & 0 & c & d & c \\\\ 0 & 0 & 0 & 0 & ... & 0 & 0 & b & a \\\\ \end{bmatrix} $$ and f(x) is a polynomial function, for example $f(A) = 1 + A + A^2 + A^3$.
When the matrix $A$ is very large (1 million $\times $1 milion) (of course it is stored as a sparse matrix), the calculation cost is very high. For example, I encountered the problem of insufficient memory (64G) when solving the above equation.
It seems that using diagonalization is a good idea, but it is still difficult to calculate the diagonalization of such a large matrix.
There are some answers gave me a suggestion that I can use fast cosine transform to solve the equation. I want to know how this should be done. I would be very grateful if you can provide the corresponding literature or python code.