I'm hoping to obtain some feedback on the most appropriate method in undertaking this approach. I have a df that contains revenue data and various related variables. I'm hoping to determine which variables predict revenue. These variables are both binary and non-binary though
I'll display an example df below and talk through my thinking:
import pandas as pd
d = ({
'Date' : ['01/01/18','01/01/18','01/01/18','01/01/18','02/01/18','02/01/18','02/01/18','02/01/18'],
'Country' : ['US','US','US','MX','US','US','MX','MX'],
'State' : ['CA','AZ','FL','BC','CA','CA','BC','BC'],
'Town' : ['LA','PO','MI','TJ','LA','SF','EN','TJ'],
'Occurences' : [1,5,3,4,2,5,10,2],
'Time Started' : ['12:03:00 PM','02:17:00 AM','13:20:00 PM','01:25:00 AM','08:30:00 AM','12:31:00 AM','08:35:00 AM','02:45:00 AM'],
'Medium' : [1,2,1,2,1,1,1,2],
'Revenue' : [100000,40000,500000,8000,10000,300000,80000,1000],
})
df = pd.DataFrame(data=d)
Out:
Date Country State Town Occurences Time Medium Revenue
0 01/01/18 US CA LA 1 12:03:00 PM 1 100000
1 01/01/18 US AZ PO 10 02:17:00 AM 2 40000
2 01/01/18 US FL MI 3 13:20:00 PM 1 500000
3 01/01/18 MX BC TJ 4 01:25:00 AM 2 8000
4 02/01/18 US CA LA 2 08:30:00 AM 1 10000
5 02/01/18 US CA SF 5 12:31:00 AM 1 300000
6 02/01/18 MX BC EN 10 08:35:00 AM 1 80000
7 02/01/18 MX BC TJ 2 02:45:00 AM 2 1000
So the specific variables that influence revenue are Medium, Time Started, and Occurrences. I also have location groups that can be used, such as, Country, State, and Town.
Would a multiple linear regression be appropriate here? Should I standardise the independent variables somehow? Medium will always be either 1 or 2. But should I group Time Started and Occurrences? Times will fall between a 20hr period (8AM - 4AM), while occurrences will fall between 1-10. Should these variable be assigned to dummy variables.