I have been trying to formulate a model of soccer matches to help me predict the outcomes. The model I'm trying to formulate involves using Poisson regression to assign attack and defence ratings to different teams.
Let's say that I have a set of results like this:
A v B 2 0
B v C 2 1
A v C 1 1
I'm trying to fit the home and away defence ratings in a vector B such that Y = exp(X*B) where X is a matrix representing the results of the games.
The vector B is of the form:
B = [A_attack,A_defence,B_attack,B_defence,C_attack,C_defence]
From the above table of results the matrix X must look like this:
[1,0,0,-1,0,0] [0,-1,1,0,0,0] [0,0,1,0,0,-1] [0,0,0,-1,1,0] [1,0,0,0,0,-1] [0,-1,0,0,1,0]
Finally Y represents the number of goals in all the matches. In this case Y = [2,0,2,1,1,1].
Now I've been using statsmodels, which is a Python package for doing this kind of thing and I'm running into problems.
In case anyone is familiar with statsmodels the calls I'm using are:
res = sm.GLM(Y, X, family= sm.families.Poisson()).fit(method='bfgs')
Where X and Y are a numpy Matrix and Array respectively, as defined above.
The code will often not converge. There are 20 teams in the Premier League so I need to fit 40 rankings. When the number of rows exceeds ~50 the conversion problems present themselves. For example I often see a Floating point exception: 8 message which I believe means there has been a divide by zero error.
When the method does converge the values are often non-sense, giving negative expect goals in a game.
What I would like to know is, is my modelling mathematically sound? Is there anyway I could tweak the model to make it converge?
The problem here is that your model's parameters cannot be identified. That is to say that the same shift by a constant value in attack and defence ratings will produce the same differences for each row. You can fix this degree of freedom by, e.g., setting defence rating of team $C$ to zero.
Try to estimate the ratings with the following matrix that assumes
C_defence = 0. You should be able to find the ratings now as everything is relative to team $C$ defensive rating:Note that a better solution may be to impose $L_1$ or $L_2$ regularization for model parameters. This will also enable parameter identification. Moreover, it is especially useful when modelling football data which are quite noisy.
Finally, you may want to introduce explicitly an intercept and home team advantage (if applicable) parameter in your model.