Regression
Suppose I have data points in a matrix $X \in \mathbb{R}^{n \times m}$ as well as labels $\mathbb{R}^n$, where $n$ is the number of my data points and $m$ is the number of features per data point. For a new data point $x \in \mathbb{R}^m$ I want to predict a value $\hat{y} \in \mathbb{R}$.
Linear Regression
A simple way to do so is to assume that the data is created by a linear function:
$$y = x^T \cdot w$$
where $w \in \mathbb{R}^m$ are parameters which have to be learned from the data we've collected so far.
A simple way to learn the parameters $w$ is
$$w = (X^T X)^{-1} X^T y$$
Quadratic transformation sanity check
Now it is possible to add some features to the data points. For example, say we have $x \in \mathbb{R}$ and we transform the feature by $$\Phi(x) = (x, x^2)$$
Let
$$X = \begin{pmatrix}-1 \\ 0\\ 1\end{pmatrix}\;\;\; y = \begin{pmatrix}1\\0\\1\end{pmatrix}$$
and thus
$$\Phi(X) = \begin{pmatrix}-1 & 1 \\ 0 & 0\\ 1 & 1\end{pmatrix}$$
Now we can get $w$ by
$$ \begin{align} w &= (\Phi(X)^T \Phi(X))^{-1} \Phi(X)^T y\\ &= \begin{pmatrix}2 & 0\\ 0 & 2\end{pmatrix}^{-1} \begin{pmatrix}-1 & 1 \\ 0 & 0\\ 1 & 1\end{pmatrix}^T \begin{pmatrix}1\\0\\1\end{pmatrix}\\ &= \frac{1}{2} \cdot \begin{pmatrix}-1 & 0 & 1\\1 & 0 & 1\end{pmatrix} \begin{pmatrix}1\\0\\1\end{pmatrix}\\ &= \begin{pmatrix}0\\1\end{pmatrix} \end{align}$$
Hence the found model is
$$\hat{y} = x^2$$
which is exactly what I had in mind when I tried this example.
Transforming the labels
My first thought about the limitations of this method was that a model like $y = e^{w_1 x}$ could not be fitted. However, if we add a bijective label transformation $\Psi(y) = \log(y)$ we have the problem $\Psi(y) = w_1 x$ which, I guess, can again be solved by a linear regression model.
Question
My question is if it always works like this. So, lets say the data is generated by a polynomial of degree 1337. Could I simply make a feature transforming function $\phi(x) = (1, x, x^2, \dots, x^{1337})$ and expect to get the generating polynomial if I have enough (1338?) points?
I am pretty sure the answer is "yes" in this case, because the prediction is only a linear combination of the transformed features.
However, what about a model $y = w_1\,(1-2e^{w_2 x})$? Is it possible to find a $\Psi, \Phi$ so that one can use the linear regression again?
Could I simply make a feature transforming function $\phi(x) = (1, x, x^2, \dots, x^{1337})$ and expect to get the generating polynomial if I have enough (1338?) points?
Yes, any polynomial of degree $n$ can be written as a linear combination of $\{x^0, x^1, ... , x^n \}$. Linear regression can learn any linear combination of its features, hence any polynomial can be learned using the features you have described. And yes, you will need at least $n+1$ points to fit a polynomial of degree $n$.
However, what about a model $y = w_1\,(1-2e^{w_2 x})$? Is it possible to find a $\Psi, \Phi$ so that one can use the linear regression again?
I strongly think that this is not possible. Do not have a formal proof yet, will edit the answer as soon as I find a neat reason ;).