First the proviso I'm only an aspiring mathematician. Secondly forgive how I've articulated this question in Python, but it is essentially a computational problem I'm looking to solve.
I wonder if there's an efficient way to calculate quantile breaks for weighted array (i.e. Numpy array, so possibly mathematical vector?). By weighted array, consider array x = [x₁, x₂, .., xn] which has a corresponding array of weights w = [w₁, w₂, .., wn]. In my current workflow I unpack x into new array xw in which each element xⁱ is repeated wⁱ times, and I then calculate its distribution statistics (e.g quartiles). But the unpacking is very computationally expensive so I'm exploring possibility of an alternative.
To describe the problem in Python (also see here):
import numpy as np
import random
## random log distribution
x = np.random.lognormal(mean = 7, sigma = 0.7, size = 10000)
x = np.int64(np.ceil(x))
View histogram:
import matplotlib
import matplotlib.pyplot as plt
tmp = plt.hist(x, bins=1000, alpha=0.6, color='c', ec='c', log=True)
Apply weights w to array x:
def weighted_array(arr, weights):
repeated_array = list(map(lambda f, i: [f] * int(i), arr, weights))
return np.array([item for sublist in repeated_array for item in sublist])
weighted_array([6,4,2], [1,3,5]) # for example
#> array([6, 4, 4, 4, 2, 2, 2, 2, 2])
## For simplicity let's weight x array by itself (i.e. w = x)
xw = weighted_array(x, x)
len(xw)
#> 14092084
stats = np.quantile(xw, [.05, .25, .5, .75, .95])
print(stats)
#> [ 563. 1101. 1771. 2854. 5523.]
The process of generating xw is very expensive for large arrays of large numbers and easily exhaust system memory. So I wonder if there is a mathematical way to calculate stats from the original x and w arrays without having to apply the weights to generate xw?
Thanks in advance!

For simplicity, I'll assume that interpolation isn't needed, and that it suffices to find the individual nearest to the $q^\text{th}$ quantile point, where $0 \leqslant q \leqslant 1.$
Suppose that the population consists of $N$ individuals, sorted in ascending order of the values of some attribute. Suppose that there are $r$ different attribute values, and that $m_i$ individuals have the $i^\text{th}$ value of the attribute, for $i = 1, 2, \ldots, r.$ Then $m_1 + m_2 + \cdots + m_r = N.$
Represent the $k^\text{th}$ individual as the centre of a notional continuous interval $[k - \tfrac12, k + \tfrac12),$ for $k = 1, 2, \ldots, N.$ Then the entire population occupies the interval $[\tfrac12, N + \tfrac12),$ and the $q^\text{th}$ quantile point is at $Nq + \tfrac12.$ We simplistically replace this with the nearest integer, rounding down in the ambiguous case when $Nq$ is an integer. Thus we take the $q^\text{th}$ quantile to be individual number $\left\lfloor{Nq}\right\rfloor + 1,$ for $q \in [0, 1),$ or number $N,$ in the special case $q = 1.$
Define the partial sums $M_i = m_1 + m_2 + \cdots + m_i,$ for $i = 0, 1, \ldots, r.$ These form a strictly increasing sequence $M = (M_0, M_1, \ldots, M_r),$ where $M_0 = 0$ and $M_r = N.$ For $k = 1, 2, \ldots, N,$ therefore, there exists a unique positive integer $i = f(k, M) \leqslant r$ such that $M_{i-1} < k \leqslant M_i.$ That means that the $k^\text{th}$ individual in the population has the $i^\text{th}$ attribute value.
In terms of this function $f,$ if $s$ is the list of attribute values sorted into ascending order, then the $q^\text{th}$ quantile value of the attribute is (ignoring the special case $q = 1$): $$ s[f(\left\lfloor{Nq}\right\rfloor + 1, M)]. $$
Here's a toy Python 3 module that does the job. I haven't tried it on any large arrays. For all I know, the way I've coded it may use tons of resources. (You'll surely need to recode it anyway, for instance to use interpolation.)
Version 0.2
This is still a toy implementation. In particular, it still might be hugely inefficient (I haven't given any thought to that question), and it still hasn't been tested on any large datasets. What is nice about it is that the new class
multilistis obviously capable of being considerably elaborated. (No doubt I'll tinker with it a lot, but there isn't likely to be any good reason to post my tinkerings here.)I'm not sure how to post code in Maths.SE, so the indentation of the code isn't quite consistent.