Calculating quantiles of weighted array

1.8k Views Asked by At

First the proviso I'm only an aspiring mathematician. Secondly forgive how I've articulated this question in Python, but it is essentially a computational problem I'm looking to solve.

I wonder if there's an efficient way to calculate quantile breaks for weighted array (i.e. Numpy array, so possibly mathematical vector?). By weighted array, consider array x = [x₁, x₂, .., xn] which has a corresponding array of weights w = [w₁, w₂, .., wn]. In my current workflow I unpack x into new array xw in which each element xⁱ is repeated wⁱ times, and I then calculate its distribution statistics (e.g quartiles). But the unpacking is very computationally expensive so I'm exploring possibility of an alternative.

To describe the problem in Python (also see here):

import numpy as np
import random

## random log distribution
x = np.random.lognormal(mean = 7, sigma = 0.7, size = 10000)
x = np.int64(np.ceil(x))

View histogram:

import matplotlib
import matplotlib.pyplot as plt
tmp = plt.hist(x, bins=1000, alpha=0.6, color='c', ec='c', log=True)

enter image description here

Apply weights w to array x:

def weighted_array(arr, weights):
    repeated_array = list(map(lambda f, i: [f] * int(i), arr, weights))
    return np.array([item for sublist in repeated_array for item in sublist])
    
weighted_array([6,4,2], [1,3,5])   # for example
#> array([6, 4, 4, 4, 2, 2, 2, 2, 2])

## For simplicity let's weight x array by itself (i.e. w = x)
xw = weighted_array(x, x)
len(xw)
#> 14092084

stats = np.quantile(xw, [.05, .25, .5, .75, .95])
print(stats)
#> [ 563. 1101. 1771. 2854. 5523.]

The process of generating xw is very expensive for large arrays of large numbers and easily exhaust system memory. So I wonder if there is a mathematical way to calculate stats from the original x and w arrays without having to apply the weights to generate xw?

Thanks in advance!

3

There are 3 best solutions below

2
On BEST ANSWER

For simplicity, I'll assume that interpolation isn't needed, and that it suffices to find the individual nearest to the $q^\text{th}$ quantile point, where $0 \leqslant q \leqslant 1.$

Suppose that the population consists of $N$ individuals, sorted in ascending order of the values of some attribute. Suppose that there are $r$ different attribute values, and that $m_i$ individuals have the $i^\text{th}$ value of the attribute, for $i = 1, 2, \ldots, r.$ Then $m_1 + m_2 + \cdots + m_r = N.$

Represent the $k^\text{th}$ individual as the centre of a notional continuous interval $[k - \tfrac12, k + \tfrac12),$ for $k = 1, 2, \ldots, N.$ Then the entire population occupies the interval $[\tfrac12, N + \tfrac12),$ and the $q^\text{th}$ quantile point is at $Nq + \tfrac12.$ We simplistically replace this with the nearest integer, rounding down in the ambiguous case when $Nq$ is an integer. Thus we take the $q^\text{th}$ quantile to be individual number $\left\lfloor{Nq}\right\rfloor + 1,$ for $q \in [0, 1),$ or number $N,$ in the special case $q = 1.$

Define the partial sums $M_i = m_1 + m_2 + \cdots + m_i,$ for $i = 0, 1, \ldots, r.$ These form a strictly increasing sequence $M = (M_0, M_1, \ldots, M_r),$ where $M_0 = 0$ and $M_r = N.$ For $k = 1, 2, \ldots, N,$ therefore, there exists a unique positive integer $i = f(k, M) \leqslant r$ such that $M_{i-1} < k \leqslant M_i.$ That means that the $k^\text{th}$ individual in the population has the $i^\text{th}$ attribute value.

In terms of this function $f,$ if $s$ is the list of attribute values sorted into ascending order, then the $q^\text{th}$ quantile value of the attribute is (ignoring the special case $q = 1$): $$ s[f(\left\lfloor{Nq}\right\rfloor + 1, M)]. $$

Here's a toy Python 3 module that does the job. I haven't tried it on any large arrays. For all I know, the way I've coded it may use tons of resources. (You'll surely need to recode it anyway, for instance to use interpolation.)

"""Compute quantiles: see https://math.stackexchange.com/q/3721765."""

__all__ = ['weighted']

import math, operator, itertools

class weighted(object):
    """
    Structure of repeated attribute values in ascending order.
    """
    
    def __init__(self, x, w):
        """
        Create sorted data from unsorted attribute values and their "weights".
        """
        self.xs, self.ws = zip(*sorted(zip(x, w), key=operator.itemgetter(0)))
        self.subtotals = list(itertools.accumulate(self.ws))
        self.N = self.subtotals[-1]
    
    def individual(self, q):
        """
        Identify individual member of population nearest to the q'th quantile.
        """
        return math.floor(q * self.N) + 1 if q < 1 else self.N
    
    def attribute(self, k):
        """
        Compute attribute index of k'th individual member of the population.
        """
        for i, M in enumerate(self.subtotals):
            if M >= k:
                return i
    
    def quantile(self, q):
        """
        Compute q'th quantile value of the attribute.
        """
        return self.xs[self.attribute(self.individual(q))]

def main():
    print('median = {}'.format(weighted([6, 4, 2],[1, 3, 5]).quantile(.5)))

if __name__ == '__main__':
    main()

Version 0.2

This is still a toy implementation. In particular, it still might be hugely inefficient (I haven't given any thought to that question), and it still hasn't been tested on any large datasets. What is nice about it is that the new class multilist is obviously capable of being considerably elaborated. (No doubt I'll tinker with it a lot, but there isn't likely to be any good reason to post my tinkerings here.)

I'm not sure how to post code in Maths.SE, so the indentation of the code isn't quite consistent.

"""Lists of items with multiplicity, analogous to multisets."""

__all__ = ['individual', 'multilist', 'quantile']

import math, itertools

def individual(q, N):
    """
    Number (1 to N) of individual near q'th quantile of population of size N.
    """
    return math.floor(q*N) + 1 if q < 1 else N

def quantile(x, q):
    """
    Compute the q'th quantile value of the given *sorted* (N.B.!) multilist x.
    """
    return x[individual(q, len(x))]

class multilist(object):
    """
    List of elements with multiplicity: similar to a multiset, whence the name.
    
    The multiplicity of each element is a positive integer. The purpose of the
    multilist is to behave like a list in which each element occurs many times,
    without actually having to store all of those occurrences.
    """

def __init__(self, x, w):
    """
    Create multilist from list of values and list of their multiplicities.
    """
    self.items = x
    self.times = w
    self.subtotals = list(itertools.accumulate(self.times))

def __len__(self):
    """
    Get the number of items in a list with multiplicities.
    
    The syntax needed to call this function is "len(x)", where x is the
    name of the multilist.
    """
    return self.subtotals[-1]

def __getitem__(self, k):
    """
    Find the k'th item in a list with multiplicities.
    
    If the multiplicities are m_1, m_2, ..., m_r (note that Python indices
    are 1 less, running from 0 to r - 1), and subtotals M_0, M_1, ..., M_r,
    where M_i = m_1 + m_2 + ... + m_i (i = 0, 1, ..., r), then we want the
    unique i (but the Python code uses i - 1) such that M_{i-1} < k <= M_i.
    
    The syntax needed to call this function is "x[k]", where x is the name
    of the multilist, and 1 <= k <= len(x).
    """
    for i, M in enumerate(self.subtotals):
        if M >= k:
            return self.items[i]

def sorted(self):
    """
    Return a sorted copy of the given multilist.
    
    Note on the implementation: by default, 2-tuples in Python are compared
    lexicographically, i.e. by the first element, or the second in the case
    of a tie; so there is no need for parameter key=operator.itemgetter(0).
    """
    return multilist(*zip(*sorted(zip(self.items, self.times))))

def main():
    data = multilist([6, 4, 2], [1, 3, 5]).sorted()
    print('median = {}'.format(quantile(data, .5)))

if __name__ == '__main__':
    main()
1
On
import numpy as np
your_data    = [ 1.7 , 2.2 , 3.9 ]
your_weights = [ 2 , 1 , 5 ]
xw = np.repeat( your_data , your_weights )

You should obtain that your xw is

[ 1.7 , 1.7 , 2.2 , 3.9 , 3.9 , 3.9 , 3.9 , 3.9 ]

Unfortunately numpy doesn't have built in weighted functions for everything, but you can put things together in this way.

1
On

I'm not sure if best approach, but this is what I came up with.

import numpy as np
import random
import pandas as pd

d = pd.DataFrame(data = {
    'x': np.random.lognormal(mean = 7, sigma = 0.7, size = 1000),
    'wgts': np.int64(np.random.sample(1000) * 1000)
    })

# Sorted-Cumulative-Weight-Breaks™ method

quantile = sum(d['wgts']) / 4
d = d.sort_values('x')                            # sort
d['cum_wgts'] = np.cumsum(d.wgts)                 # cumulative weights
d['qtl'] = np.int64(d.cum_wgts / quantile)        # quantile binning
d['new_qtl'] = np.invert(d.qtl.eq(d.qtl.shift())) # to filter at breaks
quartiles2 = d[d.new_qtl].x

My original method:

def weighted_array(arr, weights):
    repeated_array = list(map(lambda f, i: [f] * int(i), arr, weights))
    return np.array([item for sublist in repeated_array for item in sublist])
    
xw = weighted_array(d.x, d.wgts)
quartiles1 = np.quantile(xw, [0, .25, .5, .75, 1])

Results comparison:

print(np.int64(quartiles1))
print(np.int64(quartiles2))
#> [  170   679  1161  1860 12613]
#> [  170   679  1161  1860 12613]

Views much appreciated.