Probability of observing a false correlation and confidence limits

Question

Probability of observing a false correlation and confidence limits

187 Views Asked by Bumbble Comm At 27 Mar 2026 - 2:11

In oil and gas exploration/development it is common to use acustic impedance derived from reflection seismic surveys to predict the porosity measured in wells drilled in the reservoir.

I often use tables such as the one below (from a paper) to test for spuriousness of correlation:

0.87    0.78    0.72    0.67    0.63    0.57    0.49    0.39    0.32
0.75    0.58    0.47    0.40    0.34    0.25    0.16    0.09    0.05
0.62    0.40    0.28    0.20    0.15    0.08    0.03    0.00    0.00
0.50    0.25    0.14    0.08    0.05    0.02    0.00    0.00    0.00
0.39    0.14    0.06    0.02    0.01    0.00    0.00    0.00    0.00
0.28    0.07    0.02    0.01    0.00    0.00    0.00    0.00    0.00
0.19    0.02    0.00    0.00    0.00    0.00    0.00    0.00    0.00
0.10    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00
0.04    0.00    0.00    0.00    0.00    0.00    0.00    0.00    0.00

where values in the table give the probability of observing the absolute value of the sample correlation coefficient, r, being greater than some constant R, given the true correlation (ρ) is zero, in other words the probability of a spurious correlation.

These values are calculated with the expression (in both papers):

p=Pr(|r|≥R)=|t|≥((R√(n-2))/√(1-R^2 ))

where n is the sample size, or the number of locations (wells) where both reservoir property (porosity) and seismic attribute (acoustic impedance) are available, and t is distributed as a Student's t- critical value, with n-2 degrees of freedom.

For the columns in this table n is, respectively:

5   10  15  20  25  35  50  75  100

For the rows, R (the magnitude of the spurious sample correlation) is, respectively:

0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9

This table is used to assess the chance that the sample correlation, r, is actually false or uncorrelated with the reservoir property being predicted. Quoting again:

For example, given 5 wells, and an r = 0.7, there is,a 19% probability that
the correlation is false.

I've used this method for years and (I thought) I understood well the theory and application.

However I recently read this statement on a paper that expands on the original paper where the table was published:

...however there is another aspect of the correlation coefficient that should be considered — the confidence limits of the true correlation coefficient. For this example, the 95% confidence limits are from a minimum r of -0.48 (P97.5) and a maximum r of 0.98 (P2.5). Because the minimum r is negative, we cannot say with confidence that there is any correlation and we should reject this attribute as a predictor. Considering one seismic attribute and a sample correlation of 0.7, we need 9 samples before the minimum r is positive, but its value is only 0.07, with a 4% chance that the correlation is false.

This is my question: where is this coming from. Neither the original paper nor the recent one one published the data at each well, just the tables, so how can the author of the latter estimate the 95% confidence limits?

All I could think of is bootstrapping the 95% confidence interval around the mean r ... except that even for that they'd need at least the one sample (the 7 wells) to get the mean.

Is there any other way to get at that just using the values in the table?

Original Q&A

There are 2 best solutions below

Bumbble Comm On 28 Aug 2014 - 4:55

The sample size and the sample correlation are enough information to get a confidence distribution for the population correlation PROVIDED that the population has a bivariate normal distribution. In order to assess that, you'd need more than just those two statistics; preferably you'd want to look at all of the data. With only seven data points, you'd have to see something pretty extreme before you'd reject a hypothesis of bivariate normality. I don't actually know what the state of the art is in testing for bivariate normality.

The $4\%$ chance that the correlation is false probably really means that you would have to allow false positives at least $4\%$ of the time to say that this result is significant.

If you post this question to stats.stackexchange.com, someone might be able to say more than this.

**Bumbble Comm** · Accepted Answer

I found an explanation with explained working example on this site: http://www.tc3.edu/instruct/sbrown/stat/correl.htm

They even have an excel spreadsheet.

Here's the workign example:

A sample of 25 points shows a linear correlation coefficient of 0.84. What is the 95% confidence interval for the correlation coefficient in the population? (Again, to keep things simple I’m giving you the sample statistics instead of the raw data, and we’ll assume that the requirements are met. But in real life, always check the requirements before computing a confidence interval.)

The solution is a wild ride; hang on!

(a) From 1−α = 0.95. find α/2 = 0.025. Use a table, use TI-83/84/89 invNorm, or use Excel NORMSINV( ) to find that the 95% confidence interval is bounded by z = ±1.96.

(b) That critical z of ±1.96 bounds the confidence interval in the standard normal distribution with σ=1; for this one you must multiply by the standard deviation of the Fisher Z, which is 1/√(n−3). For n = 25 points that is σ = 1/√(25−3) = 0.213. Multiplying by the 1.96 from (a) gives E = 0.418. E is the error of the estimate, which is half the width of the confidence interval for Fisher’s transformed Z.

(c) Now use equation 3 and r = 0.84 to compute Z = 1.221. This is the Fisher Z for this particular sample. Using the result from (b), the confidence interval for the transformed Z is 1.221 ± 0.418, which is 0.803 to 1.639.

(d) Plug those Fisher-Z endpoints into equation 4. Z = 0.803 yields ρ = 0.666, and Z = 1.639 yields ρ = 0.927.

Conclusion: If 25 points have a linear correlation coefficient of 0.84, then you’re 95% confident that the population’s linear correlation coefficient is between 0.666 and 0.927.

Remark: The sample statistic 0.84 is not at the middle of the confidence interval, because the sample r values have a skewed distribution around the population correlation coefficient ρ.

So using their spreadsheet I get for my example of n=5 and r=0.7

conf level  0.95    
n   5   
r*  0.7 
        
critical z  1.960   
std dev of Z    0.707   
E   1.386   
Z* from r*  0.867   
Z conf int  -0.519  2.253
r conf int  -0.477  0.978

So the paper was correct in givingthe an interval for r of -0.48 (P97.5) and 0.98 (P2.5).

Also, at 8 wells the minimum r is still negative:

conf level  0.95    
n   8   
r*  0.7 
r conf int  -0.009  0.941

and at 9 it becomes positive, but only 0.067

conf level  0.95    
n   9   
r*  0.7 
r conf int  0.067   0.931

so correct again.

Probability of observing a false correlation and confidence limits

There are 2 best solutions below

Related Questions in PROBABILITY

Related Questions in CORRELATION

Related Questions in HYPOTHESIS-TESTING

Trending Questions

Popular # Hahtags

Popular Questions