approximate mean absolute deviation using histogram

315 Views Asked by At

Given a sequence $A= a_1, a_2, \cdots,a_i, \cdots,a_n$, where $a_i \in \mathbb{R}$.

We can calculate the MAD(mean absolute deviation) via $G(A) = \frac{1}{n} \sum_{i=1}^n|a_i - \textbf{median}(A)|$, where $|.|$ is absolute value.

On the other hand, we can approximate the calculation of $G(A)$ using a $k$ bin histogram.

Compute a histogram of $A$ with $k$ bin and obtain $W=(w_1, \cdots, w_i, \cdots, w_k)$ and edges $E=(e_0, e_1, \cdots, e_k)$.

using $W, E$, we can compute an approximate version of $G(A)$, called $S(W, E)$.

I am wondering how can one bound the difference between $G(A)$ and $S(W, E)$?

And apparently, $k$ is an important factor here. With $k > n$, the two would be equvilant, i probably shouldn't use the histogram approach

But normally we would choose $k<<n$, what would be the consequences?

As Ian pointed out, there are different ways to define $S(W, E)$, e.g. 1, the exact MAD of the distribution with PDF given by the histogram. Or 2, collapse each bin into its midpoint and use a discrete distribution instead.

My goal is to find an as accurate as possible approximation, so whichever strategy is accurate would be the ideal implementation. Meanwhile, I also want to have some error bound, so if whichever is easy to have an error bound is also appreciated.

1

There are 1 best solutions below

0
On

Say you define the approximate MAD by $\tilde{M}=\frac{1}{n} \sum_{i=1}^n |m_{j(i)}-m|$ where

  • $b_j$ is the midpoint of bin $j$
  • $j(i)$ sends $i$ to the index of the bin where $a_i$ is located
  • $m$ is (for simplicity) the exact median. Again for simplicity let us say $n$ is odd so $m=a_{(n+1)/2}$.

For simplicity let us also say that the median is one of the edges, say $m=e_{k^*}$. Then

Then

$$\tilde{M}=\frac{1}{n} \sum_{i=1}^{(n-1)/2} -(m_{j(i)}-m) + \frac{1}{n} \sum_{i=(n+1)/2}^n (m_{j(i)}-m).$$

Meanwhile

$$M=\frac{1}{n} \sum_{i=1}^{(n-1)/2} -(a_i-m) + \frac{1}{n} \sum_{i=(n+1)/2}^n a_i-m.$$

It follows that the difference is

$$\frac{1}{n} \sum_{i=1}^{(n-1)/2} -m_{j(i)}+a_i + \frac{1}{n} \sum_{i=(n+1)/2}^{n} a_i-m_{j(i)}.$$

Thus in absolute value it is less than the average bin radius. This bound is not at all guaranteed to be tight, though.