Looking for anomalities in the relation between two variables

26 Views Asked by At

I was given the following data set which contains basic information about individuals. To deal with it I do the following in R:

library(foreign)
data=read.spss('compudat.sav')
compudat=as.data.frame(data) 
attach(data)

I'm told to study if there is any anomality in the relation between the weight and the height of the individuals by extracting the two variables, ordering them by the first and then by the second, grouping them into a matrix and then actually studying if there are anomality in the data.

bivar = compudat[,c("PESO","ALTURA")] # selecting weight and height
bivar[with(bivar,order(PESO)),]       # order by weight
bivar[with(bivar,order(ALTURA)),]     # order by height

I don't have a clear idea of what should I be looking for to detect these anormalities?

Any ideas?

1

There are 1 best solutions below

0
On BEST ANSWER

One standard way to detect abnormalities is to see if any weight or height, by itself, has a probability of less than $0.05$ of occurring, assuming a normal distribution with the same mean and standard deviation as your data. The problem with this approach is that it completely ignores the bi-variate nature of your data.

So, another way of doing this is to normalize the heights and weights (feature normalization), compute the statistic $d=\sqrt{h^2+w^2},$ and perform the same analysis on this new statistic. This method has its drawbacks as well. You can choose the statistic you want to compute - maybe you want the $L_1$ distance instead of the Euclidean $L_2$ distance I just used. To decide this, you simply must know your data, and what it means.

But even this method isn't going to catch some abnormalities. Maybe what's "usual" in this data set is for a point to lie near a line - that is, there's a decent correlation between the two variables (surely this is possible for these two variables). In that case, you might want to compute the statistic of distance from the best fit line.