I have an experiment that shoots particles at a wall. It hits some regions with a higher probability than others. I don't know what the underlying probability density function is. I assume it's some smooth curve and I would like to know its maximum value.
I can get an estimate by firing lots of particles and recording where they land. Then make a histogram of the data and calculate the maximum value from that. The problem is I don't know what the optimum number of bins to use for the histogram is. Additionally, I don't know how accurate my estimate is. Do you know if there is a formula or method I can use to calculate the optimum number of bins for a given number of samples? Do you know if we can approximate error bars for our estimate?
I have attached some Python code to help explain what I am doing. In this example, the underlying probability density function is the normal distribution, with mean, $\mu=0$, and variance $\sigma^2=1$. In this example, I know the exact solution is $1/\sqrt{2\pi\sigma^2}$, however, in general, I don't know the formula for the probability density function so we need to estimate its maximum.
import numpy as np
import matplotlib.pyplot as plt
mu = 0
sigma = 1
len_particles_array = 5
num_particles_array = 10**np.arange(1, len_particles_array + 1)
len_bin_array = 16
num_bins_array = 2**np.arange(len_bin_array)
max_number_density_array = np.zeros((len_particles_array, len_bin_array))
analytic_solution = 1 / np.sqrt(2 * np.pi * sigma**2)
for i, num_particles in enumerate(num_particles_array):
particle_positions = np.random.normal(mu, sigma, num_particles)
for j, num_bins in enumerate(num_bins_array):
number_density, bins = np.histogram(particle_positions, num_bins, density=True)
max_number_density_array[i, j] = np.max(number_density)
fig = plt.figure()
ax = fig.add_subplot(111)
for i, num_particles in enumerate(num_particles_array):
ax.loglog(num_bins_array, max_number_density_array[i, :], \
label = '# particles = ' + str(num_particles))
xlims = ax.get_xlim()
ax.loglog(xlims, [analytic_solution, analytic_solution], 'k--', \
label = r'Exact solution = $1\ /\ \sqrt{2\pi\sigma^2}$')
ax.set_xlim(xlims)
ax.set_xlabel('Number of bins')
ax.set_ylabel('Estimate for the maximum of the probability density function')
ax.legend()
plt.savefig('Estimate for the maximum of the probability density function.png')
plt.show()
Here is the outputted figure. It shows the estimate is dependent on the number of bins I use for the histogram and the number of sample particles I use.

To be clear, it seems you don't want the mode, which is the $x$-value at which the density $f(x)$ reaches its maximum; you want $\max_{x \in R}f(x),$ where $R$ is the real line. Using the standard normal distribution as an example, you want $\varphi(0) = 1/\sqrt{2\pi} = 0.3989 \approx 0.4,$ where $\varphi$ denotes the standard normal density function.
In R software, $\varphi$ is
dnorm; default parameters are $\mu=0, \sigma=1:$If, you have the density function, you can often use methods of calculus to find the maximum value of a a function.
If you have data generated from a distribution and do not know the density function of the distribution, you might come close by trying histograms, as you suggest. That will work best if you have a large sample. A difficulty is that choice of bins (intervals) for a histogram is arbitrary, so different histograms will typically have various tallest bars of various heights.
Let's look at a sample of size $n = 1000$ from the distribution $\mathsf{Norm}(\mu=0, \sigma=2),$ for which the desired max is known to be $0.1995,$ to four places.
Specifically, if you have data (1000 observations), but you don't know from what continuous distribution they were sampled, then you might look at histograms similar to the ones shown below.
I have cheated by plotting the density curve, which we're pretending is unknown, on each histogram. You can see that some histograms work better than others for guessing the maximum height of the density curve. [R code (tediously repetitive) for the figure is shown at the end.]
However, kernel density estimators (KDEs) provide a somewhat more useful path. A KDE is the density of a mixture of distributions from a suitable family, that approximates the population density using a sample from that distribution. (For more on KDEs, see Wikipedia. For R documentation on
density, see this. Also, there are some pages on this site that discuss the convergence of KDEs to their target PDFs, with varying degrees of formality; typeKDEafter the 'magnifying glass' at the top of the page to get a list. (All give insights into KDEs, but I don't see that any deal directly with your question of finding the maximum.)In R, a KDE consists of 512 $(x,y)$ pairs that can be connected by short line segments to show an estimate of the desired density curve. There are many possible variations of the KDEs in R, but I have chosen to use the default KDE, which I have found works very well in most cases. Here is text output from the
densityprocedure in R.Notice that the maximum height (y-coordinate) of the KDE) is $0.1951,$ which is very close to the actual maximum height $0.1995$ of the density function of $\mathsf{Norm}(0, 2).$
Here is the default histogram of the values in
x, along with this actual density function (black) along with its KDE (red) based on my sample of size $n = 1000.$Notes: (1) If I wanted to know the 'mode' (the x-value at which the y-value is maximized, I could do a grid search for it, as below. The result is that the mode of the KDE is a $0.209 \approx 0.$
(2) Comparisons: (a) If we have $n=1000$ observations
xknown to have been randomly sampled from some normal distribution, then we should be able to estimate the maximum height by usingdnormin R, with the sample mean $\bar X$ estimating $\mu$ and the sample SD $S$ estimating $\sigma.$ For the samplexabove, this method gives the maximum height as $0.1957.$(b) This is not much different from the result $0.1951$ above from the KDE. Above, the KDE method uses normal data, but does not explicitly assume normality.
(c) Also, from above we know that a histogram with about 20 bins (lower left in first figure) has its tallest bar near the max. This information can be displayed by using densities from a non-printing histogram (argument
plot=F); the result is $2.0.$ This method does not assume normality, but may require some trial and error with histograms that have various numbers of bins. (In a real application, you wouldn't have the actual density curve in the background to guide you.)