I need to make sure that this function is a metric:
$d(i,j) = 1-\textrm{corr}(i,j)$
where $\textrm{corr}(x,y)$ is the Pearson correlation coefficient which ranges from $[-1,1]$. With this scaling I can satisfy the first three properties of a metric:
- $d(i,j)>0$
- $d(i,i)=0$
- $d(i,j)=d(j,i)$
but I'm not sure how to check the triangle inequality:
$d(i,k) \leq d(i,k)+d(j,k)$.
Should I employ some property of correlation coefficients?
For being a metric $d(i,j)=0$ must imply $i=j$. But if $i,j$ are linearly related then also $corr(i,j)=1$ and $d(i,j)=0$. So first you have to partition the space of random variables into disjoint equivalence classes where the equivalence relation would be defined as: $x\sim y$ if $x$ and $y$ are linearly related. In that case you need to define the function $d([i],[j])=1-corr(i,j)$, where $[k]$ is the equivalence class containing $k$. You can show that this function is also well-defined.
Now consider the regression of $i$ on $j$ and then $j$ on $k$. Let the regression equations be $$i=a_{i,j}+\frac{r_{i,j}s_{j}}{s_{i}}j+\epsilon_{i,j}$$ $$j=a_{j,k}+\frac{r_{j,k}s_{k}}{s_{j}}k+\epsilon_{j,k}$$ where $r_{m,n}$ is the correlation between variables $m,n$ and $s_l$is the standard deviation of the variable $l$. Then one can show that $$\frac{r_{i,k}s_k}{s_i}=\frac{r_{i,j}s_j}{s_i}\frac{r_{j,k}s_k}{s_j}$$ Hence $corr(i,k)^2\leq corr(i,k)=corr(i,j)corr(j,k)\leq\frac{cor(i,j)^2+corr(j,k)^2}{2}\leq\frac{cor(i,j)+corr(j,k)}{2}$ So nothing can be told about triangle inequality.