Normal Distribution parameters with only 30% of the values

22 Views Asked by At

For a statistics project, I am seeking a method to determine the most precise parameters for a normal distribution based on a dataset for which I have the following information:

  • The total number of rows of my dataset
  • The only n% first rows of my dataset
  • The only n% last rows of my dataset I am using Python to address this issue and I am wondering if there is a library similar to sklearn that offers a normal distribution model. Below is the Python code I have developed so far:
import numpy as np
import pandas
from scipy.stats import norm
from sqlalchemy import create_engine
import matplotlib.pyplot as plt
from sklearn.ensemble import IsolationForest

sqlEngine = create_engine('You dont need this one :)')

cnx = sqlEngine.connect()

df_total = pandas.read_sql("SELECT prix, kilometrage FROM liens_leparking WHERE (marque = 'RENAULT' AND pays = 'FRANCE' AND annee >= 2015 AND gamme REGEXP 'CLIO') ORDER BY prix ASC;", con=cnx)
df_lower = df_total.head(3500)
df_upper = df_total.tail(3500)
df = pandas.concat([df_lower, df_upper], ignore_index=True)
clf = IsolationForest(contamination=0.02)
outliers = clf.fit_predict(df)
df = df[outliers != -1]

mean, std = norm.fit(df['prix'])
plt.hist(df['prix'], bins=40, density=True)

distribution_normale = norm(loc=mean, scale=std)


xmin = df['prix'].min()
xmax = df['prix'].max()
print(xmin, xmax)
x = np.linspace(xmin, xmax, 100)
p = distribution_normale.pdf(x)
plt.plot(x, p, 'k', linewidth=2)

plt.title("Fit results: mean = %.2f,  std = %.2f" % (mean, std))
plt.xlabel('Prix')
plt.ylabel('Fréquence')
plt.legend(['Normal Distribution', 'Observed Data'])
plt.show()

Output from plt.show() : link to the image

As you can see the parameters aren't great for now, any help is welcome here !