I have read a couple of posts, and did not see the exact interpretation, I apologize in advance if this is not in the right location
Purpose: I am preparing a paper on the distribution of litter densities along the shore line in freshwater environnements. The data is collected by hand and the individual pieces are classified and counted by volunteers. These programs exist in many countries and there are fairly large data sets.
The units are expressed as 'pieces of trash/meter(or foot) of shoreline.
Assumptions:
- The data is collected in the same manner
- The volunteers have the same motivations
- There is no (under counting or over counting)
- Accuracy is basically the same across the spectrum
- The math is correct
The graph below represents the graph of two sets of Data:
- MCBP is regional results for Lake Geneva (Switzerland) n=100 samples
SLR results from the 'The Swiss Litter Report' n=365 samples The following code was used to calculate the distributions and present the graphs from a DataFrame in pandas/python 3.6:
df['Density] = df['Total']/df['Length'] df['Logs'] = df['Density'].apply(np.log)#<- skewed data(get it close to norm) mu, sigma = stats.norm.fit(df['Logs']) #repeat for df2 to get the second curve #Build histograms for the two data sets #plot the two disributions where x = df['Logs'] #and y = stats.norm.pdf(x, loc=mu, scale=sigma)
The resulting two distributions
mu for the the SLR disribution is 0.1564617, which is equal to the 5th percentile of the MCBP distirbution.
I am interpreting this as meaning:
- There is a 5% oprobability that a sample from MCBP will be less than the average from SLR.
- There is a 95% probability that a sample taken from the MCBP region will be greater than the national average
- In general I can expect litter densities to be greater in the MCBP region than in the SLR region
Is this interpretation correct? (It does correlate with observations)
Thanks for your help
You interpretation is correct as long as your assumption that the data is log-normal distributed is correct and that your estimated parameters for those distributions are correct.
Now, the accuracy of your estimated parameters depends on the number of data points you have. You can do a t-test to answer your third point with a given level of confidence.