Segment a data group into 3 that have lowest possible equal standard deviation

67 Views Asked by At

Let's say I have 9000 customers with different number of orders. I would like to segment them into 3 consistent groups so that I will run different marketing campaigns for each of them.

So that I have to split into 3 different groups which has close (similar) standard deviation between each other.

One group can have 1000 customer other may be 500 maybe the other one has 7500. But I would like to choose even ( similar ) audience

What is the best way of doing this?

1

There are 1 best solutions below

0
On

An easy way to get three groups of about the same size and with about the same mean and standard deviation is as follows: (1) Sort the data from smallest to largest, (2) Take observations 1, 4, 7, 10, etc. for Group 1, observations 2, 5, 8, 11, etc. for Group 2, and observations 3, 6, 9, 12, etc. for Group 3.

Here is an example, using $n = 90$ to illustrate. (This is from R statistical software. Numbers in brackets are indexes of the first number in each row of output.)

Simulate 90 observations:

x = round(rnorm(90, 100, 15), 1); x
 [1] 103.6  97.0  84.5 107.2  88.1 120.1  92.0 102.2  97.8 101.6
[11]  79.7 127.0 124.6  95.3  96.8 102.6 101.4  86.8  83.3 136.2
[21] 117.8  90.9 111.9  76.9 109.1 110.6 117.3 118.1 124.4  82.9
[31]  89.8 114.3 114.2 116.3 105.2  96.1 120.5  97.3 100.4 124.7
[41]  89.6 109.5 105.1 106.4  92.1  98.2 110.0 120.3  88.4 109.2
[51]  77.6 101.4  78.9  80.2  86.0  78.1  74.4  73.9  93.3  82.2
[61] 116.2  92.3 129.9  81.7  89.3  83.5  95.5 126.4 127.1  95.1
[71]  80.3  83.3  89.7  84.3  91.1  82.6  96.2 110.3  86.4  97.5
[81]  99.2  77.5 109.1  79.6  80.9  92.2 116.8 121.3  84.1 135.9

The overall sample mean and standard deviation are:

mean(x);  sd(x)
[1] 99.51778
[1] 15.97716

Sort observations:

x0 = sort(x); x0
 [1]  73.9  74.4  76.9  77.5  77.6  78.1  78.9  79.6  79.7  80.2
[11]  80.3  80.9  81.7  82.2  82.6  82.9  83.3  83.3  83.5  84.1
[21]  84.3  84.5  86.0  86.4  86.8  88.1  88.4  89.3  89.6  89.7
[31]  89.8  90.9  91.1  92.0  92.1  92.2  92.3  93.3  95.1  95.3
[41]  95.5  96.1  96.2  96.8  97.0  97.3  97.5  97.8  98.2  99.2
[51] 100.4 101.4 101.4 101.6 102.2 102.6 103.6 105.1 105.2 106.4
[61] 107.2 109.1 109.1 109.2 109.5 110.0 110.3 110.6 111.9 114.2
[71] 114.3 116.2 116.3 116.8 117.3 117.8 118.1 120.1 120.3 120.5
[81] 121.3 124.4 124.6 124.7 126.4 127.0 127.1 129.9 135.9 136.2

Separate into three groups, each with mean about 100 and standard deviation about 16.

x1 = x0[seq(1, 88, by=3)];  x1;  mean(x1);  sd(x1)
 [1]  73.9  77.5  78.9  80.2  81.7  82.9  83.5  84.5  86.8  89.3
[11]  89.8  92.0  92.3  95.3  96.2  97.3  98.2 101.4 102.2 105.1
[21] 107.2 109.2 110.3 114.2 116.3 117.8 120.3 124.4 126.4 129.9
[1] 98.83333
[1] 15.91439

x2 = x0[seq(2, 89, by=3)];  x2;  mean(x2);  sd(x2)
 [1]  74.4  77.6  79.6  80.3  82.2  83.3  84.1  86.0  88.1  89.6
[11]  90.9  92.1  93.3  95.5  96.8  97.5  99.2 101.4 102.6 105.2
[21] 109.1 109.5 110.6 114.3 116.8 118.1 120.5 124.6 127.0 135.9
[1] 99.53667
[1] 16.24751

x3 = x0[seq(3, 90, by=3)];  x3;  mean(x3);  sd(x3)
 [1]  76.9  78.1  79.7  80.9  82.6  83.3  84.3  86.4  88.4  89.7
[11]  91.1  92.2  95.1  96.1  97.0  97.8 100.4 101.6 103.6 106.4
[21] 109.1 110.0 111.9 116.2 117.3 120.1 121.3 124.7 127.1 136.2
[1] 100.1833
[1] 16.2856

Addendum: The method above has a very slight bias for the sample means to increase from Group 1 to Group 3, which will be less pronounced for larger sample sizes. Perhaps (depending on the population sampled) it may also produce a slightly smaller standard deviation for Group 2.

Choosing the groups at random (without replacement) eliminates the bias, but may yield larger (random) variations among the group sample means and standard deviations. To implement this method, I randomized the indices and picked every third index for Group 1, and so on. (In R %% stands for modular arithmetic: a method of sampling every third value according to the randomly permuted indexes.)

y = x0;  i = 1:90  
ip = sample(i)                        # randomly permute indexes
y1 = x0[ip%%3==0];  mean(y1);  sd(y1)
[1] 100.4333
[1] 16.29746
y2 = x0[ip%%3==1];  mean(y2);  sd(y2)
[1] 96.94667
[1] 13.51152
y3 = x0[ip%%3==2];  mean(y3);  sd(y3)
[1] 101.1733
[1] 18.01955

For example here is Group 3 according to this method:

y3
 [1]  74.4  77.6  78.9  79.6  80.2  80.9  82.9  83.5  84.5  86.4
[11]  90.9  91.1  95.5 100.4 101.4 101.4 102.2 106.4 110.0 110.3
[21] 114.3 116.2 116.3 117.8 118.1 120.1 120.5 121.3 135.9 136.2