How to derive the probability density function (PDF) of a continuous random variable from a set of data?

1.6k Views Asked by At

I am interested to derive an expression for the probability density function (PDF) of a continuous random variable from a given set of data. To further explain, let us consider that we have the data of time spent by visitors to a web page for a 24 hours period. At certain hours, say during the busy hours of day, the time spent on the web page is short. However, in the afternoon the time spent is long. I would like to derive an expression for the PDF of the continuous random variable X representing the time spent by the visitor, such as,

$$ f_X(x)= \begin{cases} 24x-x^2, \quad x > 0\\ 0, \quad\quad\quad\quad \text{otherwise.} \end{cases} $$

This is only an assumed PDF. I have tried to search but did not find an appropriate answer to this question. Most of the books on probability teach you how to derive probability values when given a PDF and all other sorts of things. However, the PDF is always given or assumed. So, my questions are:

  1. Do we always assume or try to map a suitable PDF from the set of popular distributions, such as Gaussian, exponential, log normal and so on for a given set of data? If yes, is there any standard way to do this?

  2. Is it possible to derive a mathematical equation for the PDF of the random variable from a given set of sample data? If yes, how this could be done? Is there any branch of Statistics and Probability Theory dealing with this?

I would much appreciate any answers to these questions. Pointers to any resources or books or chapters will also be helpful.

Thanks in advance for help.

2

There are 2 best solutions below

2
On BEST ANSWER

Do we always assume or try to map a suitable PDF from the set of popular distributions?

No, the manner of pdf depends on the (real) situation. Your pdf should meet two requirements:

  • It must be defined between 0 and 24.
  • The pdf has a maximum at noon.

The second requirement is met at your pdf. Let $x=12$ be the noon then f(x) has a maximum at noon. The first requirement we can defined by $0<x\leq 24$. Additionally we have the property that $\int_{-\infty}^{\infty} f(x) \ dx =1$. To fulfill this condition we multiply the function by a constant $c$ and determine then the value of $c$.

$$c\cdot \int_0^{24} 24x-x^2 \, dx=1$$

It comes out that $c=\frac1{2304}$. Thus one possible pdf is

$$f_X(x)=\begin{cases}\frac1{2304}\cdot \left( 24x-x^2\right), \ 0<x\leq 24 \\ 0, \ \text{elsewhere} \end{cases} $$

Other suitable pdf´s are possible.

3
On

Question 2) is one of the basic field of investigation of statistics, in particular sampling and distribution fitting.

Concerning question 1) there is such a plethora of distributions, derived from a wide set of theoric and applicative scenarios, that it is highly "un-probable" that you might need a new one.

And in fact your parabolic PDF, which shall actually read as $$ PDF \propto {x \over {24}}\left( {1 - {x \over {24}}} \right) \propto \xi \left( {1 - \xi } \right) $$ is just a particular case of a Beta Distribution.