weighted median, but manually typed weights, not frequencies

308 Views Asked by At

Since theres some contoversy about the definition of the weighted median, I wonder if my doing is even possible:

I have a large 2d matrix

[[ 1.1,  7.8,  3.3, 4.9], #<- row 1
[ 6.1,  9.8,  5.3, 7.9],  #<- row 2
[ 4.1,  4.8,  3.3, 7.1],  #<- row 3
... 
[ 1.1,  7.4,  3.1, 4.9], 
[ 7.1,  3.8,  7.3, 8.1],  
[ 19.1,  2.8,  3.2, 1.1]] #<- row n

and a set of weights for every position in each row. (each row is 4 elements long, so there will be 4 weights)

[0.64, 0.79, 0.91, 0]

Now: how does one calculate a weighted median for every single row, when the rowelements are weighted by the weights list (warning: NOT multiplied). Like this: the first element always has an impact of 0.64, the second of 0.79, ... and the last one has no impact (zero). By impact we mean a measure of occourence.

How to do that efficiently?

1

There are 1 best solutions below

4
On

Do the following for each row:

  1. Normalize the weights.
  2. Sort the row, keeping track of the corresponding weight.
  3. Go through the list and keep a running tally of the weights. When you hit 0.5, obtain the median using the two corresponding entries.

Here is some python code to illustrate what I mean. The below script prints the median of the first row of the matrix.

Corrected code:

import numpy as np

wgt = [0.64, 0.79, 0.91, 0] #get normalized weights
wgt = np.array(wgt)
wgt /= sum(wgt) 

mat = [[ 1.1,  7.8,  3.3, 4.9],\
[ 6.1,  9.8,  5.3, 7.9],\
[ 4.1,  4.8,  3.3, 7.1],\
[ 1.1,  7.4,  3.1, 4.9],\
[ 7.1,  3.8,  7.3, 8.1],\
[ 19.1,  2.8,  3.2, 1.1]]
mat = np.array(mat)
n,m = np.shape(mat)

row_tups = sorted(list(zip(mat[:,0],wgt))) #<-sort the elements in increasing order,
i = 0                                        #keeping track of which elt is paired 
wgt_sum = 0                                  #with which wgt
while wgt_sum < 0.5:                      #<-stop when we hit cumulative wgt of 0.5,          
    wgt_sum += row_tups[i][1]                #corresponding to 50th percentile
    i += 1
med = row_tups[i-1][0]
print(med)

In detail, here's what the code does for the example presented.

First, get the normalized weight array wgt, which is [0.274, 0.338, 0.389, 0]. You can verify that these weights sum to $1$.

Next, we take the (first) row mat[:,0] and produce list(zip(mat[:,0],wgt)), which is a list of tuples consisting of the row-entry paired with its weight. That is, list(zip(mat[:,0],wgt)) is the array

[(1.1, 0.274),
 (6.1, 0.338),
 (4.1, 0.389),
 (1.1, 0.0)]

We sort this so that the row-entries are increasing to get row_tups, which is

[(1.1, 0.0),
 (1.1, 0.274),
 (4.1, 0.389),
 (6.1, 0.338)]

I interpret the weights in the following way: 0% of the population has value 1.1, another 27.4% of the population has value 1.1, another 38.9% of the population has value 4.1, and another 33.8% of the population has value 6.1.

To compute the median value, I therefore go through the values in increasing order until I hit a total of 50%. In particular, we go through the loop to end up with $$ \text{prev_val} = 1.1, \quad \text{prev_wgt} = 0.274, \quad \text{next_val} = 4.1, \quad \text{next_wgt} = 0.389. $$ next_wgt is the first weight for which the running total of weights (in this case, $0 + 0.274 + 0.389$) exceeds $0.5 = 50\%$. At this point, I took the median to be the weighted average of the values $0.274$ and $0.389$.

However, I think that the more appropriate behavior is to simply state that the median is $0.389$.


Original code:

import numpy as np

wgt = [0.64, 0.79, 0.91, 0] #get normalized weights
wgt = np.array(wgt)
wgt /= sum(wgt) 

mat = [[ 1.1,  7.8,  3.3, 4.9],\
[ 6.1,  9.8,  5.3, 7.9],\
[ 4.1,  4.8,  3.3, 7.1],\
[ 1.1,  7.4,  3.1, 4.9],\
[ 7.1,  3.8,  7.3, 8.1],\
[ 19.1,  2.8,  3.2, 1.1]]
mat = np.array(mat)
n,m = np.shape(mat)

row_tups = sorted(list(zip(mat[:,0],wgt))) #<-sort the elements in increasing order,
i = 0                                        #keeping track of which elt is paired 
prev_wgt = 0                                 #with which wgt
next_wgt = 0
wgt_sum = 0
next_val = 0
while wgt_sum <= 0.5:                      #<-stop when we hit cumulative wgt of 0.5, 
    prev_val,prev_wgt = next_val,next_wgt    #corresponding to 50th percentile
    next_val,next_wgt = row_tups[i]
    wgt_sum += next_wgt
    i += 1
med = (prev_wgt*prev_val + next_wgt*next_val)/(prev_wgt + next_wgt)
print(med)