Generation of unique normalized and weight value from a set with different number of elements

61 Views Asked by At

I am researching about data mining and my math skills are weak. Therefore, I need the help of people who understand better than I on this subject.

To generate a specific formula for my research, I need to define some steps, and for this, I need to take a group of elements, specifically, three elements and have as return a single normalized value.

For example:

f1 = (105, 312, 75). These elements can range from zero to values greater than one million.

I tried to use fnorm = (X - min(X)) / (max(X) - min(X)) and calculated the arithmetic mean for the group of elements, but the representation of the result does not agree with what I would like.

Comparing two sets of these elements, for example:

f1 = (105, 312, 75)

f2 = (22, 23, 1)

When I apply fnorm and the arithmetic mean of the values, I get the final result -> f1 = 0.3755 and f2 = 0.65.

These results are not complete for me, because even though they are normalized, the elements of f1 are greater than f2. Therefore, I need f1 to be more representative, have greater weight than f2. Another point is that if there is a value 0 within the group, it cannot zero the whole result the other different elements of zero are relevant to the outcome.

I hope I explained it correctly but if there's any confusion please let me know. Thanks.

1

There are 1 best solutions below

2
On

You won't get your desired result if you normalize a vector with respect to itself, consider e.g. the vectors: \begin{align} f_1&= \begin{bmatrix} a&b&c \end{bmatrix}\\ f_2&= \begin{bmatrix} a+100&b+100&c+100 \end{bmatrix}. \end{align} These will get the same scores, whereas you would want the second to score higher.

The way to correct this is to normalize with respect to the vector with the highest possible values. Consider any vector $\mathbf x = [x_1,x_2,x_3]$, then define the average as $$\text{avg}(\mathbf x) = \frac13\sum\limits_{k=1}^3 x_k, $$

i.e. simply the arithmetic mean. This makes sense, as you say, when each element in the vector has the same weight. Otherwise use a weighted mean.

Consider now a dataset of many vectors $\mathbf x^{(i)}$, where the superscript refers to the vector number $i$, e.g. $\mathbf x^{(5)}$ is the fifth vector in your database. Then you have to normalize with respect to the one with the highest average, i.e. you first want to find $$M = \max\limits_i\big\{\text{avg}\big(\mathbf x^{(i)}\big)\big\}.$$

Finding the maximum is not difficult, it does not need some big optimization problem. It simply boils down to calculating $\text{avg}(\mathbf x^{(i)})$ for all $i$ and simply choosing the largest.

Now we wish to calculate the normalized norm. It is simply given by $$\|\mathbf x^{(i)}\| = \frac{\text{avg}\big(\mathbf x^{(i)}\big)}{M}.$$

Remarks. In this case you will always have the value of $1$ for the vector with the largest mean.

Furthermore, using this method may require you to recalculate the norm of every element once you get new data. More precisely, if you ever get a vector with larger mean than the previous largest. This can be avoided if you a priori know a maximum possible value of the arithmetic mean, in which case you simply use that value instead of $M$ (and then only the maximum possible vector would have the normalized value of $1$).