I have this graph here:
https://i.stack.imgur.com/2Zxjx.jpg
https://github.com/martin-varbanov96/fmi-fall-2016/blob/master/monte_python/hw_1/b1.py
It is representing Markov-chain sampling algorithm for estimating PI with 500 trials.
My question is: -Why very small value of delta and very big values for delta yield a less precise result compared to the medicore values?
This method is a Metropolis-Hastings algorithm for sampling from the uniform distribution on the square with side length 2. This assigns a density of $1/4$ to each point inside the square and $0$ to each point outside the square. To build a MH algorithm to sample from this distribution, first you need a chain which can eventually reach any point in the square. Your underlying chain (drawing points uniformly from a square of side length $2 \delta$ centered at the current point) is such a chain. Then you choose a corresponding acceptance probability, so that the resulting chain is reversible with respect to the desired equilibrium distribution. In this case, this reversibility is attained if you simply always accept a candidate inside the square and reject a candidate outside the square.
Now, the convergence rate of a MCMC algorithm, including Metropolis-Hastings, largely hinges on the decorrelation time of the chain being sufficiently short. This requires the proposal chain to be both "ambitious", so that there is some chance for a significant distance to be covered by the proposal chain in a short time, and "conservative", so that a reasonable fraction of the proposed transitions are accepted. The larger $\delta$ is, the more ambitious (and hence less conservative) the proposal chain is. Thus the convergence rate is best for intermediate values of $\delta$.
Making this rigorous is not so simple, because this problem is in continuous space. It requires you to estimate the largest nontrivial eigenvalue of the transition probability kernel, which in this case is the integral operator $(Pf)(x)=\frac{4\delta^2-A(x)}{4\delta^2} f(x)+\frac{1}{4\delta^2} \int_{\| x-y \|_\infty \leq \delta,\| y \|_\infty \leq 1} f(y) dy$, where $A(x)=\int_{\| x-y \|_\infty \leq \delta,\| y \|_\infty \leq 1} 1 dy$.
In this case of estimating $\pi$ you can get a limited idea of what's going on by imagining starting your chain at the center of the square and measuring how long it will take to get out of the circle. With small enough steps, most steps will get accepted, so it will take like $O \left ( \frac{1}{\delta^2} \right )$ time to get out of the circle (the square is because steps tend to cancel each other out, since the mean velocity is zero on most of the square). With large steps, the first step will tend to get us out of the circle, but a step will only be accepted with probability $1/\delta^2$, so it will take $O(\delta^2)$ steps for us to get out of the circle. If we assume the "constants" in these $O()$ expressions are comparable, then these will balance out when $\delta=1$, so that the best $\delta$ should be in the general vicinity of $1$. Of course this assumption might be completely wrong, but it is actually quite often reasonable, as you see from your data.