Number of distinct scatterplots from $p$ variables in a data set

2.3k Views Asked by At

Consider the following quote from the text An Introduction to Statistical Learning:

In practice, we often encounter data sets that contain many more than two variables. In this case, we cannot easily plot the observations. For instance, if there are p variables in our data set, then p(p − 1)/2 distinct scatterplots can be made, and visual inspection is simply not a viable way to identify clusters.

What exactly do the authors mean by the fact that $$\frac{p(p-1)}{2}$$ distinct scatterplots can be made? The quote is not referring to any specific data or any specific example, so this is the only context.

I understand that this question would be a better post for the Cross Validated Stack Exchange; however, this site is more popular and more active, so I thought I would post it here. Nevertheless, it is still math.

Thanks in advance!

2

There are 2 best solutions below

0
On BEST ANSWER

$p(p-1)/2$ is the number of ways you can choose two features (without respect to order) from $p$ variables.

So if you had four variables $(a,b,c,d)$ you could have $(4 \times 3)/2 = 6$ scatterplots: $a-b$, $a-c$, $a-d$, $b-c$, $b-d$ and $c-d$. Choosing the variables in the reverse order gives the same scatterplot: $a-c$ is the same as $c-a$.

0
On

The number of 2-D plots that can be make from $p$-variate data is ${p \choose 2} = \frac{p(p-1)}{2},$ the number of ways to choose two variable from among $p$ without regard to order.

Some software packages make 'matrix plots' which show a 'matrix' of 2-D scatterplots. Output from the function pairs in R statistical software shows matrix plots for Fisher's famous iris data, which has $p = 4$ measurements on each specimen. The second plot in the link shows ${4 \choose 2} = 6$ possible plots involving two different variables. Because some people like to see the plots with $x$ and $y$ variables interchanged, the first plot shows 12 plots.