Getting negative value in variance formula calculation

152 Views Asked by At

I am interested in calculating variance formula for this random variable: $$ F=\sum\limits_{i=1}^{N} \left ( \frac{f_i}{n} - w_i \right )^{2} $$

Let me give you some description about the experimental situation. Let us assume that we performed an experiment and that we have $n$ data values bounded in the interval $(x_\text{min},x_\text{max})$. Now we divide that interval into $N$ bins and calculate frequencies for each bin $f_i$ for that data set. Here $w_i$ is the expected relative frequency for the $i^\text{th}$ bin.

This is the first step of my calculations: \begin{align} V(n) &= Var\left [ F \right ] \\ &= Var\left [ \sum_{i=1}^{N}\left ( \frac{f_i}{n}-w_i \right )^{2} \right ] \\ &= E\left [ \left ( \sum_{i=1}^{N}\left ( \frac{f_i}{n}-w_i \right )^{2} \right )^2 \right ]-\left ( E\left [ \left ( \sum_{i=1}^{N}\frac{f_i}{n}-w_i \right )^{2} \right ] \right )^{2} \\ &= E\left [ \sum_{i=1}^{N}\sum_{j=1}^{N}\left ( \frac{f_i}{n}-w_i \right )^{2}\left ( \frac{f_j}{n}-w_j \right )^{2} \right ]-\left ( E\left [ F(n) \right ] \right )^{2} \\ &= \sum_{i=1,j=1}^{N,N}E\left [ \left ( \frac{f_i}{n}-w_i \right )^{2}\left ( \frac{f_j}{n}-w_j \right )^{2} \right ]-\left ( E\left [ F(n) \right ] \right )^{2} \end{align} I have calculated $E[F]$ separately, $$ E\left [ F \right ]=\frac{1}{n}\sum_{i=1}^{N}\left ( w_i-w_i^{2} \right ) \;, $$ and this formula gives correct results consistent with simulation results.

Now, \begin{align} \left ( \frac{f_i}{n}-w_i \right )^{2}\left ( \frac{f_j}{n}-w_j \right )^{2} &= \frac{f_i^{2}f_j^{2}}{n^{4}}-\frac{2w_jf_i^{2}f_j}{n^{3}} \\ &+ \frac{w_j^{2}f_i^{2}}{n^{2}}-\frac{2w_if_if_j^{2}}{n^3} \\ &+ \frac{4w_iw_jf_if_j}{n^2}-\frac{2w_iw_j^{2}f_i}{n} \\ &+ \frac{w_i^2f_j^2}{n^2}-\frac{2w_i^2w_jf_j}{n}+w_i^2w_j^2 \end{align}

So, \begin{align} V(n) &= \frac{1}{n^4}\sum_{i,j}^{.} E[f_i^2f_j^2]-\frac{2}{n^3}\sum_{i,j}^{.}w_jE[f_i^2f_j] \\ &+ \frac{1}{n^2}\sum_{i,j}^{.}w_j^2E[f_i^2]-\frac{2}{n^3}\sum_{i,j}^{.}w_iE[f_if_j^2] \\ &+ \frac{4}{n^2}\sum_{i,j}^{.}w_iw_jE[f_if_j]-\frac{2}{n}\sum_{i,j}^{.}w_iw_j^2E[f_i] \\ &+ \frac{1}{n^2}\sum_{i,j}^{.}w_i^2E[f_j^2]-\frac{2}{n}\sum_{i,j}^{.}w_i^2w_jE[f_j] \\ &+ \sum_{i,j}^{.}w_i^2w_j^2-\frac{1}{n^2}\left [ \sum_{i}^{.}(w_i-w_i^2) \right ]^2 \end{align}

From the binomial distribution we know that $E(f_i)=nw_i$ and $E(f_i^2)=n(n-1)w_i^2+nw_i$, and from the multinomial distribution we can calculate \begin{align} E(f_if_j) &= n(n-1)w_iw_j \\ E(f_i^{2}f_j) &= n(n-1)(n-2)w_i^{2}w_j+n(n-1)w_iw_j \\ E(f_if_j^{2}) &= n(n-1)(n-2)w_iw_j^{2}+n(n-1)w_iw_j \\ E(f_i^{2}f_j^{2}) &= n(n-1)(n-2)(n-3)w_i^{2}w_j^{2}+n(n-1)(n-2)w_i^{2}w_j \\ &+ n(n-1)(n-2)w_iw_j^{2}+n(n-1)w_iw_j \end{align}

Please refer here for further clarification of above results.

If I put all the values into the equation of $V(n)$, I get this [I have checked the calculations between the above step of the $V(n)$ formula and the below step, and I am sure that there is no mistake]: \begin{align} V(n) &= \left ( \frac{3}{n^2} -\frac{6}{n^3}\right )\sum_{i,j}^{.}w_i^2w_j^2+\left ( \frac{4}{n^3}-\frac{2}{n^2} \right )\sum_{i,j}^{.}w_iw_j^2 \\ &+ \left ( \frac{1}{n^2}-\frac{1}{n^3} \right )\sum_{i,j}^{.} w_iw_j-\frac{1}{n^2}\left [ \sum_{i}^{.}\left ( w_i-w_i^2 \right ) \right ]^2 \end{align}

In more simplified form this will be, $$ V(n) = \left ( \frac{2}{n^2} -\frac{6}{n^3}\right )\sum_{i,j}^{.}w_i^2w_j^2+\left ( \frac{4}{n^3} \right )\sum_{i,j}^{.}w_iw_j^2 - \left ( \frac{1}{n^3} \right )\sum_{i,j}^{.} w_iw_j $$

The problem is, when I put ${w_i}$ values from my simulation, I get negative values of the variance, because the final formula turns out to be $$ V(n)=\frac{1.2209 \times 10^{-6}}{n^2}-\frac{0.9960}{n^3} \;. $$ But variance can't be negative, because from the simple definition, we know that it is a weighted sum of $(x-\bar{x})^2$. I think I made some trivial mistake. Also, I checked the steps of my calculations many times, and I could not find any error.

Here, $N=5,000$ and the maximum value of $w_i$ is in the order $10^{-2}$.

Some interesting observations::

When I derived the formula for $N=10$ ($i.e.$ for 10 bins) then,

$$ V(n)=\frac{0.4185}{n^2}-\frac{0.4257}{n^3} $$

and this formula totally matches with experimental results.

If I denote $A=\sum_{i,j}^{.}w_i^2w_j^2$; $B=\sum_{i,j}^{.}w_i^2w_j$; $C=\sum_{i,j}^{.}w_iw_j$; $D=\sum_{i,j}^{.} w_i^2$; $E=\sum_{i,j}^{.} w_i$ then values of those quantities for number of bins,=10,5000,10000,20000 are attached here under,

Values of A,B,C,D,E

As you can see when number of bin increases A,B,D all becomes negligible with respect to C and E. This is basically main reason of getting negative value in variance formula.

I will be grateful if someone will provide a satisfactory answer.

1

There are 1 best solutions below

5
On

The random variable $F$ (the notation $F(n)$ doesn't make sense) is defined using a summation on $N$ "$f_i$" samples. To evaluate the mean and variance of $F$, you will need a summation on $M$ "$F_j$" samples, i.e. $NM$ samples of $f_{ij}$ in total. This does not appear in your computation.