Multiple regression using binary, non binary variables

40 Views Asked by At

I'm hoping to obtain some feedback on the most appropriate method in undertaking this approach. I have a df that contains revenue data and various related variables. I'm hoping to determine which variables predict revenue. These variables are both binary and non-binary though

I'll display an example df below and talk through my thinking:

import pandas as pd

d = ({
    'Date' :    ['01/01/18','01/01/18','01/01/18','01/01/18','02/01/18','02/01/18','02/01/18','02/01/18'],
    'Country' :    ['US','US','US','MX','US','US','MX','MX'],
    'State' :    ['CA','AZ','FL','BC','CA','CA','BC','BC'],
    'Town' :    ['LA','PO','MI','TJ','LA','SF','EN','TJ'],    
    'Occurences' :    [1,5,3,4,2,5,10,2],   
    'Time Started' :    ['12:03:00 PM','02:17:00 AM','13:20:00 PM','01:25:00 AM','08:30:00 AM','12:31:00 AM','08:35:00 AM','02:45:00 AM'],    
    'Medium' :    [1,2,1,2,1,1,1,2],    
    'Revenue' :    [100000,40000,500000,8000,10000,300000,80000,1000],                  
 })

df = pd.DataFrame(data=d)

Out:

       Date Country State Town  Occurences     Time      Medium  Revenue
0  01/01/18      US    CA   LA           1  12:03:00 PM       1   100000
1  01/01/18      US    AZ   PO          10  02:17:00 AM       2    40000
2  01/01/18      US    FL   MI           3  13:20:00 PM       1   500000
3  01/01/18      MX    BC   TJ           4  01:25:00 AM       2     8000
4  02/01/18      US    CA   LA           2  08:30:00 AM       1    10000
5  02/01/18      US    CA   SF           5  12:31:00 AM       1   300000
6  02/01/18      MX    BC   EN          10  08:35:00 AM       1    80000
7  02/01/18      MX    BC   TJ           2  02:45:00 AM      2     1000

So the specific variables that influence revenue are Medium, Time Started, and Occurrences. I also have location groups that can be used, such as, Country, State, and Town.

Would a multiple linear regression be appropriate here? Should I standardise the independent variables somehow? Medium will always be either 1 or 2. But should I group Time Started and Occurrences? Times will fall between a 20hr period (8AM - 4AM), while occurrences will fall between 1-10. Should these variable be assigned to dummy variables.