I'm trying to understand how various aspects of a movie contribute to its gross revenue. I want to rank a movie's attributes in that sense - the attributes that most strongly determine the revenue are ranked higher.
Let $A_1,\ldots,A_n$ be a list of attributes of a movie and let the possible values of $A_i$ be $a_{i1},a_{i2},\ldots$. Many of these attributes (like primary genre) are categorical and some of them (like rating) are continuous.
Approach 1: Consider $A_1$: I can form groups of movies having the same value of $A_1$, e.g. all movies with $A_1=a_{12}$ form a group. The other attributes in a group can vary freely. I can then calculate the mean of the revenues of all movies within a group, and then take the variance of means of all groups.
This will give me the "variation in average revenue as we change $A_1$ values". If this variation is high, that means changing $A_1$ significantly affects the average revenue - so $A_1$ should be a highly ranked attribute.
Approach 2: Again consider $A_1$: fix the values of all other attributes $A_2,\ldots,A_n$ and look at movies with the same values for $A_2,\ldots,A_n$ but different values of $A_1$. Find the variance in revenues of such movies - call it "$A_1$ variance". The attributes $A_i$ with highest "$A_i$ variance" will be ranked higher.
Approach 3: Train some ML model (not sure which one) with revenue as target variable and attributes as features. Then look at feature importances to get attribute importances.
A few queries:
- What assumptions do I need to check for approach 1? e.g. minimum size of a group, distribution of other attribute values within a group, etc.
- Are there any potential flaws or gotchas in approaches 1 and 2?
- What approach would you prefer out of the three?