Divide by standard deviation and subtract mean or divide by max?

590 Views Asked by At

I am trying to do feature scaling to normalize features across different samples. So for example one sample has a feature in the range $[0,10]$ and another sample has the same feature to be in the range $[0,100]$.

The book that I am reading suggests to subtract the mean and then divide by the standard deviation. But why don't we just divide the different feature ranges with the maximum value? So we divide the fist sample features with $10$ and the next with $100$. Why do the former over the latter?

1

There are 1 best solutions below

1
On BEST ANSWER

The purpose of normalization is to place all features on the same scale.

  • Subtracting feature mean and dividing by feature standard deviation will lead to all features having mean 0 and standard deviation 1. This is known as "$z$-scaling".

  • If all your features range from 0 to some maximum value, then dividing by the feature max will lead to all features ranging from 0 to 1. This is often dubbed "max-scaling". (If a feature's values range from $a$ to $b$, you could achieve the same by subtracting $a$ and then dividing by $b-a$.)

Both of these strategies can be considered normalization, but they have different concepts of what normalization means. The former approach ($z$-scaling) ensures that each feature is centered at zero with the same spread; the latter (max-scaling) ensures that each feature is bounded between 0 and 1. But max-scaling doesn't work well when you have a feature with a skewed distribution, because you may end up bunching all your values to one end or the other, so you haven't really achieved a consistent centering of your features. Similarly if a feature has outliers at both ends, then max-scaling leads to compression of the bulk of your data into a tiny range, so you've failed to achieve consistent spread. Max-scaling is fine in the absence of skew or outliers.

On balance, if you want to apply a uniform normalization strategy to your features, the $z$-scaling approach produces more consistent results. (But it won't solve every problem with your data. If you are careful about cleaning up your data, you should first attempt to handle gross outliers before doing any normalization. And maybe transform your features to reduce skewness.)