Let us assume we have a random variable $X$ which undergoes any deterministic transformation to produce a random variable $Y = f(X)$. If I want to find the minimum mean square error (MMSE) estimate of $Y$ given $X$, we know this is the conditional expectation $\hat{Y}_{MMSE}=\mathbb{E}[Y|X]$ which basically is just $f(X)$.
However if I consider the Linear MMSE (LMMSE) we have: $\hat{Y}_{LMMSE} = \mathbb{E}[Y] + Cov(Y,X)Var(X)^{-1}(X- \mathbb{E}[X])$.
My main question is that, is there a relationship between both estimates? I read on wikipedia that the second can be considered as the first order Taylor series of the first, but I am not too sure why or how we can make that link when we have random variables, and I cannot find a formal demonstration on that. I am mainly asking to understand if its correct to assume then that the linear form is actually giving us an estimate of the deterministic function f() above in this case, for example when it is nonlinear.
I don't think that the connection with a first order Taylor series is anything deeper than both being linear approximations of a non-linear quantity. The Taylor estimator of $Y$ would more likely be something like $$ \hat{Y} = f(\mathbf{E}X) + f'(\mathbf{E} X) (X - \mathbf{E} X). $$
Indeed, there is a very important difference here - the Taylor estimate is local - that is, it is accurate close to $\mathbf{E} X$ but is not even trying to be optimal for values of $X$ further from the mean. On the other hand, the LMMSE estimator is designed to be global - that is, a linear function that approximates $f$ over the whole space of possible values of $X$, weighted by the distribution of $X$.
The better analogy here is linear regression. Suppose that we have $n$ observations $(x_i, y_i)$ from the linear model $$ Y = \beta_0 + X\beta + \epsilon. $$
The standard estimates of $\beta_0, \beta$ are then \begin{align*} \hat{\beta}_0 = \bar{y} - \hat{\beta} \bar{x} \qquad \hat{\beta} = \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2}. \end{align*}
Given a new $X$, our predicted corresponding value of $Y$ is then \begin{align*} \hat{Y} &= \hat{\beta}_0 + X\hat{\beta} \\ &= \bar{y} + \frac{\sum_{i=1}^n(x_i - \bar{x})(y_i - \bar{y})}{\sum_{i=1}^n(x_i - \bar{x})^2} (X - \bar{x}). \end{align*}
This is the sample version of the best predictor of $Y$ given $X$. The population version is essentially this given infinite samples, which becomes \begin{align*} \hat{Y} &= \mathbf{E} Y + \frac{\mathbf{E}(X - \mathbf{E} X)(Y - \mathbf{E} Y)}{\mathbf{E}(X - \mathbf{E} X)^2} (X - \mathbf{E}X) \\ &= \mathbf{E} Y + \frac{\mathrm{Cov}(X, Y)}{\mathrm{Var}(X)} (X - \mathbf{E}X). \end{align*}