Normalizing traits/statistics based on trait count for each trait type

1.2k Views Asked by At

My goal is to create an app that calculates rarity scores of traits. The rarity score is a formula like so:

Rarity score = 1/(%Chance Of occurrence)

Let's say I have a trait that has 10% chance of occurring.

The rarity score for this trait will be:

10 = 1/(10%).

This score will be without trait normalization.

What I am trying to find out is how the process of trait normalization (or rarity normalization) is done.

From my research the normalization takes into account the amount of traits in a specific trait type.

Let's say we have two trait types:

Trait_Type: Hair-Color
Value: Green 1% Score 100
Value: Blue 99% Score: 1

Trait_Type: Shirt-Color
100 traits all having 1% chance of occurrence.

When we use the rarity calculator above all values of shirt colors will get the same 100 score as the score of a green hair-color.

This is not accurate, when we have 100 traits (or many traits) obviously they will have lower percentages granting each trait a higher score.

In reality each shirt-color isn't really worth because all have a 1% chance of occurring.

On the other hand the Green background color is really worth.

My goal is to introduce these differences and add trait count for each trait_type into account so when we score those traits the green will show way higher than a shirt-color.

The information I know is:

The chance of a trait happening.
The rarity score of it.
All the data about trait count (Trait type count, traits amount inside the trait etc..)

The farthest I got is:

Vanilla_score = 1/(%Chance of trait happening) Normalized_score = (Vanilla_score*Avg number of traits per trait_type)/traits in category

This will not result in an accurate enough score.

If we take a trait_type called: Flair Value: hijab Avg Trait_count per category: 13.1875 Trait_category_count: 16 Trait_count_for_flair_category: 40

The trait has a 0.44~% chance of occurring. With the vanilla score it will give it a value of: 243.87

With this method the normalized score will be: 80.4

On the site I want to replicate the score is: 35.87

What are other calculations that can be done to take into consideration the traits per trait_category into account?

(If any data is missing let me know and I will add it.) Reference links:

Trait Normalization (at the end of website)

Explanation about current used formula