Comparing statistical features extracted from time series using correct distance metrics

18 Views Asked by At

I would like to cluster 400 car rental demand time series (small positive valued) based on the following 7 statistical features:

entropy, number of mean crossings, 95th percentile, root mean squared, variance, kurtosis, max value

These features were shortlisted from around 20 statistical features after removing the highly correlated statistical features.

what is the correct distance metric to compare two sets of these statistical feature vectors? Currently, I standardize (mean=0,var=1) each of the feature so that they are comparable. Then, I use the euclidean distance to compute distances between standardized feature vectors.

Is this a reasonable approach?

Are there better distance metrices for these statistical features?

1

There are 1 best solutions below

1
On

When it come to clustering, the choice of a good / the best distance function is seldom driven by mathematical considerations:
Depending on the use case and the goal of the clustering, different criteria might be used to define when two samples are close to each others. In case of time series it might be more important whether the follow the same trends (going up and down at the same times), or that they are close to each others, or delay might not matter as long as the shape is similar, ...
Without this domain knowledge, it is hard to distinguish a good from a bad distance.

But: In case you decided that your 7 features are the important factors and that they are equally important for defining the distance, your approach would be a good start.

Improvements:

  • Mean and variance are not very robust, when it comes to outliers. In case outliers might be a problem, you could try to filter them out before computing the standardization parameters (mean, variance)
  • Currently, you assume that each feature is equally important without considering correlations. In case of highly correlated data (95% percentile and max might be a candidate for correlation), the underlying information would be over-represented in the distance function. In that case, the Mahalanobis distance does not only take the variance, but also the covariance into account.