examples of unbiased, biased, high variance, low variance estimator

535 Views Asked by At

I have just learnt variance and bias in machine learning and statistics. I still don't understand examples of function that estimates distribution with high bias/variance, or low bias/variance.

If function overfitts distribution that means that it has a high variance, but according to MSE loss formula it shouldn't be so, because of my logic: if it fits every data point then MSE loss is zero, hence bias and variance are all zeroes, that contradicts my knowledge.

Please help me to answer this question, and also give me examples of estimator of distribution with high/low bias/variance.

1

There are 1 best solutions below

0
On

So I think your question could be understood as examples of high and low variance estimators rather than distributions. I believe you may be confusing, though I could be wrong, sampling distributions and the distribution of residuals.

The bias/variance tradeoff is sort of a false construction. Adding bias does not improve variance. Adding information improves variance, but also is the source of bias. I am also going to provide an example where the high variance estimator is superior to the low variance estimator, in the more common sense understanding of the idea.

My first example has a real-world analog, but it might be useful to just treat it as an abstraction since we are in the math forum.

The task is to locate the center of location and scale parameter for an industrial process where the current value of the mean is bound over the open set $(-1,1)$ with a known density for $\mu$ of $$\Pr(\mu)= \begin{cases} 1+\mu & \text{if } -1>\mu\ge{0} \\ 1-\mu & \text{if } 0<\mu<1 \end{cases} $$ and an unknown variance. In other words, as the object vibrates, it goes out of perfect calibration and the true mean moves around until recalibrated according to this density.

For a prior, I used $$\Pr(\mu,\sigma)= \begin{cases} (1+\mu)/\sigma & \text{if } -1>\mu\ge{0} \\ (1-\mu)/\sigma & \text{if } 0<\mu<1. \end{cases} $$

Please note that I did not check to verify that the posterior integrated to unity, but it should have. You definitely should verify the validity of the prior as this prior does not integrate to one. I used the MAP estimator because programming a hill-climbing algorithm is computationally cheap, and finding the posterior mean would have slowed this down, and I didn’t want to take the time.

You definitely should perform such an integration before using such a prior. Because $\sigma^{-1}$ is a known reference prior, I cheated a bit.

The Bayesian estimator is biased. I also should have used the posterior mean as its loss function is the same as for the sample mean. The posterior mean is generally more efficient, from a Frequentist perspective, and there would be less bias because of the shape of the distributions involved.

The goal, however, was to show you what is going on.

The first image is of the sampling distribution of the estimator of the scale parameter. If the posterior mean of the variance had been used, it would have been narrower and slightly to the right. It may have been somewhat different in shape as well.

variance

Even though there was no information provided by the prior for the location of the variance, it should be observed that in providing information about the mean, it had the effect of regularizing the area of the posterior estimate of the variance.

The sampling distribution of the mean should be the triangle created by the underlying process. Note that the sampling distribution of the MAP estimator goes above one, which should be the asymptotic vertex of the set of means. The Frequentist estimator is somewhat like a lump.

enter image description here

The distribution of the actual set of means used in the simulation is a triangle, roughly, but too short by a bit.

If looked at together, the regularization of the Bayesian method tends to pull estimates to the center, making it a bit too tall, and the lack of regularization flattens the Frequentist estimator, with some estimates of the mean outside the viable range.

three

The concern of the posterior point estimates is not to create a sampling distribution but to estimate a location. In this case, the true mean for each sample was drawn from the distribution above. The real question is $\mu-\hat{\mu}$, the offset of the estimate from the true value. There is a slight improvement in precision with the Bayesian estimator over the Frequentist estimator.

two

Now, let us consider another problem that has been simplified, but is related to a real problem in finance. I also made a few small changes to make it applicable to another real problem in another field.

The problem happens on a roulette wheel, numbered 0 to 40 with no 00. The wheel is spun in a room that you cannot see, then two coins are tossed. If the coin comes up “heads,” then the result is reported as $mod_{40}(\theta+1)$, else it is reported as $mod_{40}(\theta-1).$ We will assume it is a fair coin.

So, if $\theta=3$, the sample space is $\{(2,2),(2,4),(4,2), (4,4)\}.$ The minimum variance unbiased estimator is the sample mean. The sample means that map to the samples above are $\{2,3,3,4\}.$ The population variance is $(1+0+0+1)/4=1/2.$ If you were to gamble on the outcome with a $1:1$ payout, then your expected value is \$0.00.

The Bayesian estimator depends on the likelihood function. The prior would be $$\Pr(\theta=k)=\frac{1}{41},0\le\theta\le{40},\theta\in\mathbb{Z}.$$ The likelihood would be that $\theta$ has a fifty percent chance of being plus or minus one unit, modulo 40, and a zero percent chance everywhere else.

When the sample is $(2,2)$ then the posterior gives a fifty percent mass to $1$ and $3$ each. When it is $(4,4)$, there is fifty percent mass on $3$ and $5$. Otherwise, one hundred percent of the mass is on $3$. The rational Bayesian procedure in the tied case is to toss a fair coin and let the coin decide the point estimator. In that scenario, $1/8$th of the time the estimator will be off by two units. The variance would be $(2^2)/8+(2^2)/8=1.$

The Bayesian estimator would be correct 75% of the time, but very wrong 25% of the time. It is unbiased, but it does not minimize the variance because there is no support in the posterior for $\bar{x}$ when the observations are equal.

If you were gambling, the expected rate of winning using the Frequentist odds system, would be that the Bayesian would win \$4 over eight bets, $1+1+1+1+1+1-1-1$.

Finally, you can see the information loss between the median and the mean for data drawn from a standard normal distribution. The median makes less use of information and as such, has a higher variance sampling distribution.

medianandmean

The mean-variance trade off is about long term performance over many samples and is not about specific performance in a given sample. The RMSE is an unrelated discussion.