Correlation, Independence, and Useful Data Points in a set.

46 Views Asked by At

I'll try to be as specific as possible here. This is a problem I'm trying to solve at work. There are two questions:

Question 1: How can I prove I am within a range 95% of the time with 99.99% confidence using a discrete and dependent data set?

Question 2: How many useful data points can I get from a dependent data set to do PDFs with?

Setup:

I have GPS data from flight tests. I can figure out the autocorrelation between each data point and find points that are uncorrelated. Can I used these data points to do the statistics in question 1?

1 data point per flight test ensures independent data, this is best case scenario but it costs more money to do that. I want to get as many useful data points per flight test as possible.

Any ideas?

1

There are 1 best solutions below

0
On

Suppose the population distribution is $\mathsf{Gamma}(shape=5, rate=1/10),$ which might arise if you were adding five independent exponential waiting time distributions (each with rate $1/10$ or mean $10).$

A simple and traditional way to get an idea of the density function from data is to make a histogram. A more modern and sophisticated way is to use a 'kernel density estimator' of the data. (You can google that if you're interested.)

For samples of sizes $n = 50, 500,$ and $5000$ from this distribution, I'll show a histogram of the data and a KDE of the data (red curve) along with the true density of $\mathsf{Gamma}(5, 1/10)$ in black. As you can see, larger samples tend to give better approximations of the density function. [In practice, I suppose the true density function might not be known.]

enter image description here

Here is the R code that produced the figure, in case it is of any use.

par(mfrow=c(3,1))
x1 = rgamma(50, 5, 1/10)
hist(x1, prob=T, ylim=c(0,.02), col="skyblue2", main="Sample of 50")
  lines(density(x1), col="red", lwd=2)
  curve(dgamma(x, 5, 1/10), add=T, lwd=2)

x2 = rgamma(500, 5, 1/10)
hist(x2, prob=T, ylim=c(0,.02), col="skyblue2", main="Sample of 500")
  lines(density(x2), col="red", lwd=2)
  curve(dgamma(x, 5, 1/10), add=T, lwd=2)

x3 = rgamma(5000, 5, 1/10)
hist(x3, prob=T, ylim=c(0,.02), col="skyblue2", main="Sample of 5000")
  lines(density(x3), col="red", lwd=2)
  curve(dgamma(x, 5, 1/10), add=T, lwd=2)
par(mfrow=c(1,1))

Addendum on ACF: Roughly speaking, the autocorrelation for lag $g=10$ of a sequence $W_i,$ with $i = 1, \dots, 100,$ is the sample correlation of $(W_1, W_2, \dots W_{90})$ and $(W_{10}, W_{11}, \dots, W_{100}).$ However, in finding this correlation, sample means and variances for all 100 observations are used. The ACF function of $W_i$ consists of autocorrelations for lags $g = 0$ (autocorrelation 1), $g =1, 2, 3, \dots.$ You can google 'autocorrelation' for details.

The program below simulates a Markov Chain $W_i$ over $m= 10,000$ steps, and makes an ACF plot of the series $W_1, \dots W_{10000}$ for various lags. For simplicity, $W_i$'s take only values 0 and 1; the state space of the chain is $\{0,1\}.$ [This is a simulated weather process for sun=0 and rain=1 during the rainy season in a Mediterranean climate: e.g., TelAviv, San Francisco, Santiago.]

It seems that autocorrelations decay to insignificance after about lag 10. [Knowing that it rains today has essentially no predictive value for rain ten days from now.] Then the second ACF plot of thinned data $W_1, W_{11}, \dots W_{9991}$ shows essentially no autocorrelations (beyond $g=0).$ Autocorrelations within the broken blue horizontal bands are taken as insignificant.

m = 10000;  w = numeric(m);  n = 1:m
alpha = 0.1;  beta = 0.2  # weather change probabilities
w[1] = 0                  # start with a sunny day
for (i in 2:m)  {
   if (w[i-1]==0)  w[i] = rbinom(1, 1, alpha)
   else            w[i] = rbinom(1, 1, 1 - beta)  
   }
par(mfrow=c(1,2))
  acf(w)
  j = seq(1,m, by=10)
  thin = w[j]
  acf(thin)
par(mfrow=c(1,1))

enter image description here