Can I consider shooting% as an independent variable

359 Views Asked by At

First time poster in the math section (a few posts in the stats section) and I am looking for clarification on a variable query that I have. Basically I enjoy sports and enjoy putting a mathematical answer (where I can as I don't have a great background knowledge) to a problem so I can make an informed opinion on an event.

My current interest is to try and apply linear weights through regression analysis to goals scored in the English Premiership (and to then be able to apply the weights to making projections based on current data using Monte Carlo style analysis).

I have pulled three complete years of data from ESPN (to maintain consistency in the data source) and then have broken them down. I am treating goals scored as the dependent variable and was treating all other data collated as independent (e.g. things such as shots, shots on goal etc.). After breaking down the data that I had I found that I could get my best r2 value (in excess of 0.9), when adding a shooting percentage variable in (e.g. goals scored/shots on goal * 100 to get say 36 instead of 0.36).

My question is; can I treat shooting% as an independent variable though as it is essentially a function of goals scored and shots on goal? Ideally I would like to treat it as independent as it gives me a cleaner result based on the data that I can obtain and do feel it is a reflection of accuracy of the person scoring, but my gut feeling is that it is dependent on the other two variables? I would be grateful to get an opinion on this so that I can then go away and try obtain more data if my logic isn't suitable.

Many thanks,

2

There are 2 best solutions below

1
On

Why don't you make a scatterplot of two different variables at a time and see if there's any correlation. You have several variables

  • goals scored
  • shooting percentage
  • shots
  • shots on goal

and you might consider these variables conditional on other events in the game or characteristics of players.

Your scatter plots will look something like this and then it will be your job to decide if there's a relationship (linear, quadratic, etc.)

2
On

What you are calling dependent and independent variables are better referred to as your target and predictor variables. This makes clear what the relationship is supposed to be between them - you use the predictors to predict the target. The words dependent and independent have particular meanings in mathematics (as alluded to in Henning's comment) and it introduces unnecessary confusion to overload them too much.

Now, if you want to make forecasts then clearly you can't use the percentage of shots on target as an indicator, because you don't know the percentage of shots on target until the game is over! You might consider including a particular team's past history of shots on target as a regressor (e.g. their percentage of shots on target in the last twenty games they played) but you can't use the value from the match whose score you are trying to predict.

If you just want an explanatory model rather than a predictive one, then you could use the percentage of shots on target. However, this is a bit dubious, because of the exact relationship that exists between three of the variables:

$$\textrm{Goals Scored} = \textrm{Attempts on Goal} \times \textrm{Percentage of Shots on Target}$$

In some sense, you already understand why there is a relationship between Goals Scored and Percentage of Shots on Target - it's given by this formula! Indeed, if you included an interaction term between Attempts on Goal and Percentage of Shots on Target, then your regression would pick this out as the only significant predictor.

For these reasons I recommend that you don't include the percentage of shots on target in your model.