Creating Custom Random Number Generator

1.2k Views Asked by At

My statistics are rusty, but here's what I'm trying to do. Creating an application around football and have this distribution around rushing yards per attempt.

http://farm3.static.flickr.com/2580/3813734078_7801aab534_o.png

I'm just looking at the NFL average in the chart. Basically, I want to create a random number generator that gives me a number along that distribution. I'll want to do the same for passing too. Since this is football, I'd like to generate from roughly -10 to 100.

I reverse engineered the plot in Excel, but my memory is absolutely failing on how I can create or fit a known distribution to this plot. I left my math skills behind in college and have just been doing general programming since.

2

There are 2 best solutions below

1
On BEST ANSWER

You don't need a known distribution, just a computable function that approximates the curve closely enough. You can get fancier but if you generate pairs of uniform random variables using an acceptable PRNG, scaled so one is on your -10,100 scale and the other is 0 to the max value of your approximation function. Call the first one x and the second y, if f(x) > y then accept x as the next value. Otherwise discard both x and y and generate a new pair. The distribution of accepted x values will follow the curve.

Efficiency can be poor if f has long thin tails. Than you need to generate x from a non-uniform distribution g that approximates f but dominates it and y on [0,1), accept when y < g(x)/f(x). But that's just for efficiency, it will work as described above.

0
On

This answer assumes that you have a collection of samples representing the gains in yards.

Use Kernel density estimation with a Gaussian kernel, to approximate the actual probability density function for the gains.

For the range of values that you want to generate $(-10, 100)$, go through and evaluate each value against the estimate to calculate the probability and store that result in an array location corresponding to the generated value. You can make this process as coarse or as fine grained as needed.

Create a new zero valued array and starting at the lowest value aggregate the values from the previous array ($cdf[i] = cdf[i-1] + pdf[i]$) at each index in order to produce an estimate of the cumulative density function.

(If on the final value the cdf does not equal one, you'll need to normalize the cdf array by going back through and dividing each entry by the last so that the sum adds up to one.)

As Scaramouche pointed out in the comments to your question, you can sample a value from the unit interval uniformly using a pseudo-random number generator in the programming language of your choosing.

From that value, you can then search over the cumulative distribution array to find the nearest match and use the corresponding array index to then calculate the value you want to report back between $(-10, 100)$.