I have asked questions about Number of unique integer in randomly generated arrays. Suppose we have $10^6$ random generated numbers, we should have about $~6*10^5$ unique numbers. I wrote a custom sort algorithm based on this assumption.
Pseudocode :
P := 0
RecursiveSort(Array A):
If length of A <= 1:
return A
Find maximum and minimum of A.
Create Array B of length A.
P := P + length A // For evaluating purpose
For each number I in A:
I := (I - min) / (max - min) // Rescaling I based on max and min of A
K := I multiply by the length of A // 0<=K<=Length of A
K := round(K) // Make K an index in Array B
Put I in B[K]. // Each B[K] may hold several I
Create Array Result of length 0.
For each array L in B:
For each number I in RecursiveSort(Array L):
Put I in Result
Return Result
After several runs, I noticed that P / length Awas approximately 1.73 for uniform distribution and 2.19 for normal distribution. Why is that so?
For uniform distribution, after each iterations the size of Array A reduce by $e$ based on the assumption, shouldn't P / length A be approximately 1.58 ($ 1 + e^{-1} + e^{-2} + e^{-3} + .. + e^{-n}$) ?
Here is actual python code:
import numpy as np
import random
A = []
lim = 1000000
#Generating normal distribution array.
# a_normal = np.random.normal(lim * 1000 / 2, lim * 1000 / 6, lim)
# a_normal = np.random.normal(lim * 1000 / 6, lim * 1000 / 4, lim)
# a_normal = np.random.normal(0, lim , lim)
# A = a_normal.tolist()
for _ in range(lim):
A.append(random.randrange(lim * 100, lim * 1000 - 1))
P = 0
def custom_sort(A):
if (len(A)) <= 1:
return A
mi = min(A)
ma = max(A)
global P
P += len(A)
if ma == mi: # if all numbers are the same
return A
B = [[] for _ in range(len(a))]
for I in A:
K = int((i - mi) / (ma - mi) * (len(a) - 1))
B[K].append(I)
result = []
for L in B:
for I in custom_sort(L):
result.append(I)
return result
sort(A)
print(P / len(A))
This is a bucket sort with $k$ (the number of buckets) equal to the input length. It runs in linear time if the input data is uniformly distributed, since all the buckets are small after the first pass.
You can optimize a bucket sort for any input distribution by using that distribution's CDF as the bucketing function; if the input matches the expected distribution, then the CDF values will be uniformly distributed between 0 and 1.
So e.g. if you expect input data to be normally distributed, you could start by computing the mean and variance of the data and then bucketing using a normal CDF with those parameters.