Increase correlation of whole set part by part

Question

Increase correlation of whole set part by part

72 Views Asked by Bumbble Comm At 27 Mar 2026 - 3:07

I have data-set of $n$ elements (pixels) with three attributes $A$, $B$ and $C$ where strong correlation exists between attributes $A$ and other two attributes $B$ and $C$ and correlation between $A$ and $B$ is stronger than the one between $A$ and $C$ attributes. Each element has all attributes (no NAs) and attributes are independent. I'm trying to modify values of attributes $C$ in order to boost its correlation and to achieve $cor(A,C)>cor(A,B)$

My idea is that if I randomly split data-set to $k$ non-overlapping parts and to calculate $cor(A,B)_k$ and $cor(A,C)_k$. Then I try to modify values of attribute $C_k$ until I achieve that $cor(A,C)_k>cor(A,B)_k$. My hypothesis is that if I achieve independently that for each part $k$ $cor(A,C)_k>cor(A,B)_k$ that in the end I will have that correlation on the whole set is also greater $cor(A,C)>cor(A,B)$. I don't want to temper with the way of calculating correlation in order to artificially boost it, I just use it as a measure of how well I have modeled attributes $B$ and $C$. My problem is that my statistics background is poor and I'm not sure if this is possible or should I include some additional conditions for each $k$ part, in order to achieve greater correlation over the whole data-set.

I'm working with some rasters and I have successfully programmed this through R, but since I'm getting bad results I'm not sure if I made some mistake while coding or my starting premise is wrong.

EDIT 2

For those who are more into remote sensing, here is the explanation of $A$, $B$ and $C$ attributes and idea behind proposed method:

Attribute $A$ is surface soil moisture (SSM) and attributes $B$ and $C$ is temperature vegetation dryness indexes (TVDI) calculated in two different ways. Without getting into details, SSM and TVDI are calculated from different remote sensing sources, but there exists strong correlation between these two.

TVDI is usually determined globally using same parameters for whole area of interest (attribute $B$). But my point is that if area of interest is very large (like the whole continent or whole world), it might be better to divide area to smaller parts and locally calculate TVDI for each part and then merge it all back to one raster.

I'm trying to develop new method (attribute $C$) where TVDI is calculated with local parameters, using moving kernel (for an example kernel with center of 100x100 pixels with buffer of 50 pixels, similar to focal raster operations). So when kernel is positioned at some part of raster matrix, I use all data that falls inside the kernel+buffer (in an example that is 150x150 pixels) and I calculate TVDI just for the central part of the kernel. The second part of the idea is that I increase size of buffer gradually (which after enough iteration actually covers whole area of interest) and to use the correlation as a measure of what is the best buffer size. In the meantime I consulted with some of my colleagues, so maybe correlation is not the best way of determining the best buffer size, but I didn't get any better results using RMSE either.

Whatever I do, the measurement of how well I have modeled both global and local TVDI is correlation calculated for the whole area of interest. Basically, I'm calculating correlation of corresponding pixels of SSM raster, global TVDI and the local TVDI which represents all center parts of kernels merged back to one large raster.

So I'm getting that independently, each 100x100 pixels center of locally calculated TVDI has greater correlation with SSM than the corresponding 100x100 part of global TVDI, but when I merge all local TVDI into one large raster and then calculate correlation, it is way smaller than the one for global TVDI.

I don't know how is it possible that correlation is stronger part by part, but when I merge all parts into one, correlation gets lower.

This is the basic problem, I need someone to explain me if this is statistically okay and what I need to achieve for each part that in the end will get me the increase of correlation over the whole set.

Thanks in advance!

Original Q&A

There are 1 best solutions below

**Bumbble Comm** · Answer 1 · 2019-10-05 20:23:59

I have two objections to your proposed procedure. First, if A and C are barely correlated, I don't see a legitimate way to find subsets that behave as you wish. Second, I don't see how you could expect A and C 'micromanaged' in this way would reliably have a higher correlation in future studies (by you or others).

There are various ways to 'tamper' with data to get give the (usually false) impression of a higher then natural correlation. Consider independent data vectors $A$ and $C$ below.

Initially, variables $A$ and $C$ happen to have modest (but insignificant) negative correlation. Then I throw away almost half of the observations, which are relatively far away from the 45-degree line, getting a much higher correlation $(r \approx 0.67)$ between the remaining points $A_1$ and $C_1.$ (You may think I'm joking, but I once refereed a paper that tried this. I was suspicious that the edges of the submitted data cloud looked 'too flat'.)

This method will work most of the time. Maybe I misunderstood something and your method is OK. But can you explain how it is OK and mine is not?

par(mfrow=c(1,2));  set.seed(2019)
A = rnorm(200, 100, 15); C = rnorm(200, 100, 15)
cor(A,C)
[1] -0.02522233
plot(A,C, pch=20)
 abline(15,1, col="red"); abline(-15,1, col="red")
cond = (C>=A-15 & C<=A+15)
mean(cond)
[1] 0.53

A1 = A[cond]; C1 = C[cond] 
cor(A1, C1)
[1] 0.6702914
plot(A1, C1, pch=20)
par(mfrow=c(1,1))

Increase correlation of whole set part by part

There are 1 best solutions below

Related Questions in STATISTICS

Related Questions in CORRELATION

Trending Questions

Popular # Hahtags

Popular Questions