Given some statistical measures, reconstruct a list of numbers

54 Views Asked by At

Suppose I have a list of numbers $(y_1, y_2, y_3, \dots, y_N)$ with these properties:

$$ \sum_{i=1}^{N}y_i = 13\, 776\, 663, $$ $$ \bar{y} = \dfrac{1}{N} \sum_{i=1}^{N}y_i = 17\,135, $$ $$ s^2 = \dfrac{1}{N-1} \sum_{i=1}^{N}(y_i - \bar{y})^2 = 139\,147^2. $$

That list has these numbers:

  • The lowest is $19$.
  • The $5$th percentile is $336$
  • The $25$th percentile is $800$
  • The median is $1\,668$
  • The $75$th percentile is $5\,050$
  • The $95$th percentile is $30\,295$
  • The highest is $2\,627\,319$

These percentiles give you some idea about the distribution of the numbers. I can construct a list that has that mean and standard deviation, it doesn't matter if any $y_i$ is less than zero or if it does not follow the described distribution. The problem I face is to construct a list with mean $\bar y$ and standard deviation $s$ subject to the condition that every $y_i$ has to be greater than zero (it doesn't have to follow that distribution and it doesn't have to have those numbers).

So I am looking for a way to do that. If anybody has any ideas about this, I'm happy to hear them!

3

There are 3 best solutions below

2
On BEST ANSWER

The solution is probably not unique, and you would want to do it numerically. I would use the approach found in Datasaurus dataset. The first step is to find $N$. From the first two equations you get $N\approx804$. Since $N$ is not exactly an integer, the first indication that I have that these numbers are just an approximation. The last equation gives you $\bar{y^2}$. Now choose $y_1=19$ and $y_{408}=2627319$. You can now recalculate $\bar y$ and $\bar{y^2}$ without those values. Put $203$ values on the median and the other $203$ remaining at a value such that the average (or the sum) is your desired value. Obviously, $\bar{y^2}$ is going to be wrong. Move one value from the median down, somewhere in the lower 5th percentile. To get the same average, you must move at least one value from the higher dataset upward. Check if moving one value or moving two values higher will improve your $\bar{y^2}$. You need to repeat this procedure until all your conditions are met.

0
On

So definitely every number is positive, because the lowest is 19.

This seems tractible to me, assuming it's possible. My recommendation is to simply start with an arbitrary list satisfying the bottom list of conditions. These can be thought of as fixed "milestones". Then simply move the other numbers around until you satisfy the mean and standard deviation.

By moving different elements (ie the largest elements, smallest elements, or ones in the middle) to move around), and moving them up versus down, you can increase or decrease the mean and standard deviation as necessary. With some thought (or some experimentation), you'll be able to figure out what to do from here.

0
On

Ok, I managed to find an answer to my question. I wanted to do it numerically and I used Python. Here is the code:

import statistics as stat
import random
import sys
from scipy.optimize import fsolve, root
import matplotlib.pyplot as plt

random.seed(210)

N = 804
y_mean = 17_135
y_sd = 139_147
median = 1668
lowest = 19
highest_95 = 30295

l_nums = []
rango1 = range(lowest, median)
rango2 = range(median, highest_95)

for _ in range(N // 2):

    numero = random.choice(rango1)
    l_nums.append(numero)

for _ in range(N // 2 - 3):

    numero = random.choice(rango2)
    l_nums.append(numero)

l_nums.append(2627319)

print(len(l_nums))    
print(stat.mean(l_nums), stat.stdev(l_nums))

#sys.exit('!')

def equations(x):

    a = sum(l_nums)
    b = sum(map(lambda x: (x - y_mean)**2, l_nums))

    f = [a + x[0] + x[1] - y_mean * N, 
         b + (x[0] - y_mean)**2 + (x[1] - y_mean)**2 - y_sd**2 * (N - 1)]

    return f

x_sol = root(equations, [5e8, 5e8], method='lm')

#print(x_sol)
print(x_sol.fun)
print(x_sol.x)

l_nums.extend(x_sol.x)
print(len(l_nums))    
print(stat.mean(l_nums), stat.stdev(l_nums))

I explain my code. First, find $N$, in this case $N = 804$. Create two lists of numbers, one between $19$ and the median, the other between the median and $30295$. In Python

rango1 = range(lowest, median)
rango2 = range(median, highest_95)

From rango1, draw $N/2$ numbers randomly and put them in a list. Then, from rango2, draw $N/2 -3$ numbers randomly and add them to that list. Now you have a list with $801$ numbers. Good. As you can see, the highest number is $2\,627\,319$, add it.

l_nums.append(2627319)

To find the last two numbers, you have to solve two equations

$$\frac{x+y+a}{N}=\bar y,$$

$$\dfrac{(x-\bar y)^2+(y-\bar y)^2+ b}{N-1}=s^2.$$

That is done with Scipy. In my case, I have to add the line random.seed(210) in order to get the exact results, which depends on the operative system and the computer. Without that line, the results are close.