I have two variables $X$ and $Y$ given as tuples of $(x, y)$, and I want to see if there is a relationship between the two variables. I can do so by finding the correlation coefficient.
However, I found that by selecting an arbitrary subset of the data (e.g. $(x, y) | x > k$ ), I can get a higher correlation coefficient and a stronger result. Is doing so mathematically sound? I have no a priori reason to believe that certain data points are "more important" than others, to put it simply.
That sounds like data manipulation to me. If we only look at some portion of the data the correlation is stronger? How do we justify ignoring the rest of the data?
Imagine that we have three points $(1,1), (2,4)$, and $(3,9)$ and we try and fit them to the linear equation $y=ax+b$ and we arrive at a very nice correlation coefficients: $0.9897$. If we remove any point than we arrive at a correlation coefficient of $1!$ because every two points fits on some line. This is indeed "mathematically justified" and not surprising that as we remove data we can get a better fit but if the goal is to find a model that fits the data it would better to report that we can only achieve a correlation coefficient of $0.9897$ when trying to fit that data to a linear equation. Maybe (just maybe) this data fits some other model nicer.
Is it mathematical justified? I guess... Is it good way to arrive at a mathematical model that fits your data... I doubt it.