I'm currently in a probability class learning about parameter estimation using the maximum likelihood estimator. The problem is as follows: we have a list of independent observations Y y[1]...y[n], that came from some probability distribution $f_Y(y,\lambda) $ with an unknown parameter $\lambda $. (For example, exponential, Gaussian, Poisson, etc.)
We want to estimate the parameter $\lambda$ by maximizing the likelihood that we see the observations we do. Since all observations are independent, we have probability $P(Y,\lambda)= \prod_{i=1}^{n} f_Y(y_i,\lambda)$. To maximize this, we take the derivative with resepect to $\lambda$ and set to 0. $$ \hat\lambda= \arg \max_{\lambda} \left[ P(Y,\lambda) \right]$$
Something I noticed: for every example of this I've seen so far (only about 2 or 3 now), the end result is always the same: the value of the parameter is whatever makes the mean of your observation vector equal $E[f_Y(y,\lambda)] $. For example, for an exponential distribution, we get $$\hat\lambda=\frac{1}{\frac{1}{n}\sum_{i}y_i} = \frac{1}{\mu_Y} $$ This makes intuitive sense, because for an exponential distribution, the expected value is $1/\lambda $. My question is this: can you always assume that the mean of your observations is the mean of your probability distribution and just solve for the unkown parameters using that assumption? Just because it works for the few cases I've seen, I don't know whether this can be generalized to any probability distribution. I'm completely new to these topics, so any additional info would be appreciated.
Thanks in advance!
You're noticing that in some cases the MLE is equal to the result of setting the expected value of the observations (e.g. $EY=\frac{1}{\lambda}$) equal to the observed sample mean ($\bar{y}=\frac{1}{n}\sum_{i=1}^ny_i$) and solving for the parameter (e.g. $\hat{\lambda}=\frac{1}{\bar{y}}$). This latter method is called the method of moments (MOM) and does not generally give the same result as MLE. (However, there is a kind of connection between MLE and a generalized MOM.)
As an example of how the two may differ, consider $X_1,X_2,X_3$ i.i.d. Uniform$(0,\theta)$. Then $\hat{\theta}_\text{MLE}=\max\{X_1,X_2,X_3\}$, whereas $\hat{\theta}_\text{MOM}=2\bar{X}$.
NB: The MOM estimator may sometimes be nonsensical; e.g., in the above example, if the observations are $(X_1,X_2,X_3)=(1,1,10)$, then $\hat{\theta}_\text{MOM}=2\bar{X}=2\frac{1+1+10}{3}=8$, even though a value of $10$ was observed!