I would like to cluster 400 car rental demand time series (small positive valued) based on the following 7 statistical features:
entropy, number of mean crossings, 95th percentile, root mean squared, variance, kurtosis, max value
These features were shortlisted from around 20 statistical features after removing the highly correlated statistical features.
what is the correct distance metric to compare two sets of these statistical feature vectors? Currently, I standardize (mean=0,var=1) each of the feature so that they are comparable. Then, I use the euclidean distance to compute distances between standardized feature vectors.
Is this a reasonable approach?
Are there better distance metrices for these statistical features?
When it come to clustering, the choice of a good / the best distance function is seldom driven by mathematical considerations:
Depending on the use case and the goal of the clustering, different criteria might be used to define when two samples are close to each others. In case of time series it might be more important whether the follow the same trends (going up and down at the same times), or that they are close to each others, or delay might not matter as long as the shape is similar, ...
Without this domain knowledge, it is hard to distinguish a good from a bad distance.
But: In case you decided that your 7 features are the important factors and that they are equally important for defining the distance, your approach would be a good start.
Improvements: