Given a sequence $A= a_1, a_2, \cdots,a_i, \cdots,a_n$, where $a_i \in \mathbb{R}$.
We can calculate the MAD(mean absolute deviation) via $G(A) = \frac{1}{n} \sum_{i=1}^n|a_i - \textbf{median}(A)|$, where $|.|$ is absolute value.
On the other hand, we can approximate the calculation of $G(A)$ using a $k$ bin histogram.
Compute a histogram of $A$ with $k$ bin and obtain $W=(w_1, \cdots, w_i, \cdots, w_k)$ and edges $E=(e_0, e_1, \cdots, e_k)$.
using $W, E$, we can compute an approximate version of $G(A)$, called $S(W, E)$.
I am wondering how can one bound the difference between $G(A)$ and $S(W, E)$?
And apparently, $k$ is an important factor here. With $k > n$, the two would be equvilant, i probably shouldn't use the histogram approach
But normally we would choose $k<<n$, what would be the consequences?
As Ian pointed out, there are different ways to define $S(W, E)$, e.g. 1, the exact MAD of the distribution with PDF given by the histogram. Or 2, collapse each bin into its midpoint and use a discrete distribution instead.
My goal is to find an as accurate as possible approximation, so whichever strategy is accurate would be the ideal implementation. Meanwhile, I also want to have some error bound, so if whichever is easy to have an error bound is also appreciated.
Say you define the approximate MAD by $\tilde{M}=\frac{1}{n} \sum_{i=1}^n |m_{j(i)}-m|$ where
For simplicity let us also say that the median is one of the edges, say $m=e_{k^*}$. Then
Then
$$\tilde{M}=\frac{1}{n} \sum_{i=1}^{(n-1)/2} -(m_{j(i)}-m) + \frac{1}{n} \sum_{i=(n+1)/2}^n (m_{j(i)}-m).$$
Meanwhile
$$M=\frac{1}{n} \sum_{i=1}^{(n-1)/2} -(a_i-m) + \frac{1}{n} \sum_{i=(n+1)/2}^n a_i-m.$$
It follows that the difference is
$$\frac{1}{n} \sum_{i=1}^{(n-1)/2} -m_{j(i)}+a_i + \frac{1}{n} \sum_{i=(n+1)/2}^{n} a_i-m_{j(i)}.$$
Thus in absolute value it is less than the average bin radius. This bound is not at all guaranteed to be tight, though.