Since theres some contoversy about the definition of the weighted median, I wonder if my doing is even possible:
I have a large 2d matrix
[[ 1.1, 7.8, 3.3, 4.9], #<- row 1
[ 6.1, 9.8, 5.3, 7.9], #<- row 2
[ 4.1, 4.8, 3.3, 7.1], #<- row 3
...
[ 1.1, 7.4, 3.1, 4.9],
[ 7.1, 3.8, 7.3, 8.1],
[ 19.1, 2.8, 3.2, 1.1]] #<- row n
and a set of weights for every position in each row. (each row is 4 elements long, so there will be 4 weights)
[0.64, 0.79, 0.91, 0]
Now: how does one calculate a weighted median for every single row, when the rowelements are weighted by the weights list (warning: NOT multiplied). Like this: the first element always has an impact of 0.64, the second of 0.79, ... and the last one has no impact (zero). By impact we mean a measure of occourence.
How to do that efficiently?
Do the following for each row:
Here is some python code to illustrate what I mean. The below script prints the median of the first row of the matrix.
Corrected code:
In detail, here's what the code does for the example presented.
First, get the normalized weight array
wgt, which is[0.274, 0.338, 0.389, 0]. You can verify that these weights sum to $1$.Next, we take the (first) row
mat[:,0]and producelist(zip(mat[:,0],wgt)), which is a list of tuples consisting of the row-entry paired with its weight. That is,list(zip(mat[:,0],wgt))is the arrayWe sort this so that the row-entries are increasing to get
row_tups, which isI interpret the weights in the following way: 0% of the population has value 1.1, another 27.4% of the population has value 1.1, another 38.9% of the population has value 4.1, and another 33.8% of the population has value 6.1.
To compute the median value, I therefore go through the values in increasing order until I hit a total of 50%. In particular, we go through the loop to end up with $$ \text{prev_val} = 1.1, \quad \text{prev_wgt} = 0.274, \quad \text{next_val} = 4.1, \quad \text{next_wgt} = 0.389. $$
next_wgtis the first weight for which the running total of weights (in this case, $0 + 0.274 + 0.389$) exceeds $0.5 = 50\%$. At this point, I took the median to be the weighted average of the values $0.274$ and $0.389$.However, I think that the more appropriate behavior is to simply state that the median is $0.389$.
Original code: