Sunday, September 23, 2012

Football model; plots and usage

After reading data, making a predictions display and building a football data model it is time to put this to validate a bit more (regression plots) and put to usage. It appears that the regression plots in the car package were not very informative for this model, except to find unexpected results. The predictions are not good at all.

Roda JC-FC Utrecht

First of all, when I started this post on Friday, Roda JC played against FC Utrecht, 0-1. FC Utrecht goes sub top. Frankly, I am not very surprised. While I would expect Roda JC to win this game (p=0.46), chance for FC Utrecht to win is 0.33, so that is not so strange but is disappointing for my model.
fbpredict(model3,"Roda JC","FC Utrecht")
$details
Roda JC in rows against FC Utrecht in columns 
  0      1      2      3      4      5      6      7      8      9     
0 0.0241 0.0489 0.0494 0.0334 0.0169 0.0068 0.0023 0.0007 0.0002 0     
1 0.0410 0.0830 0.0841 0.0567 0.0287 0.0116 0.0039 0.0011 0.0003 0.0001
2 0.0349 0.0706 0.0714 0.0482 0.0244 0.0099 0.0033 0.0010 0.0002 0.0001
3 0.0198 0.0400 0.0405 0.0273 0.0138 0.0056 0.0019 0.0005 0.0001 0     
4 0.0084 0.0170 0.0172 0.0116 0.0059 0.0024 0.0008 0.0002 0.0001 0     
5 0.0029 0.0058 0.0058 0.0039 0.0020 0.0008 0.0003 0.0001 0      0     
6 0.0008 0.0016 0.0017 0.0011 0.0006 0.0002 0.0001 0      0      0     
7 0.0002 0.0004 0.0004 0.0003 0.0001 0.0001 0      0      0      0     
8 0      0.0001 0.0001 0.0001 0      0      0      0      0      0     
9 0      0      0      0      0      0      0      0      0      0     

$`summary chances`
   Roda JC      equal FC Utrecht 
 0.4580659  0.2126926  0.3291782 

Diagnostic plots

But, I should have looked at some of the diagnostic plots first. Luckily between the stats and the car package we have a nice collection of tools.
library(car)
infIndexPlot(model3)
A number of things are interesting. Just under index 100 is the most influential observation. We can track this to row 97. This represents Groningen making six ! goals against Feyenoord. That's the best FC Groningen ever did against Feyenoord. It was also found to be a remarkable result at that time, so if this were not influential or outlyingI should be worried.
outlierTest(model3)
No Studentized residuals with Bonferonni p < 0.05
Largest |rstudent|:
   rstudent unadjusted p-value Bonferonni p
97 3.634049         0.00027901      0.17075
StartData[97,]
    OffenseClub DefenseClub Goals OffThuis
97 FC Groningen   Feyenoord     6        1
A second 'feature'in the plots is the higher hat value in the first part of the series. This has to do with the variable OffThuis. Removing OffThuis makes the effect go away. Note that the diagonal of the hat matrix is everywhere the same (within numerical accuracy):
mm <- model.matrix(model3)
table(diag(mm %*% solve(t(mm) %*%  mm) %*% t(mm)))
0.0588235294117647 0.0588235294117648 0.0588235294117649 
               179                427                  6 
What would be different are the weights of the data. In the non-gaussian GLM framework these are dependent on the expected values. The slightly higher expected values when the attacking club plays home make the hat matrix in general a bit higher. Interpreting these plots is clearly more difficult in the non linear regression case.

ResidualPlot

Residual plot does not reveal much more than the previous plot. Record 97 has a relatively high residual. The advantage of this plot is that the outliers are more clearly numbered. In the sub-plot bottom right the bunching of the residuals in lines is because the observed data is non-continuous; from bottom to top these are the values for 0 to 7 goals. No goals gives a negative Pearson residual. No surprise there.   
residualPlots(model3)

Predicting

The teams have been playing in the weekend. The model can be used to predict the outcomes. To start, here are the results. The data in gray represent two teams new in Eredivisie in the season, so I got no predictions.
Roda JC      - Utrecht      0-1
PEC Zwolle   - Groningen    1-2
RKC Waalwijk - VVV          1-1
Vitesse      - Heracles     1-1
NEC          - Willem II    0-0
ADO Den Haag - Ajax         1-1
Twente       - Heerenveen   1-0
NAC Breda    - AZ           2-1
PSV          - Feyenoord    3-0
These are the expectations:
         club1           club2      win1     equal      win2
1      Roda JC      FC Utrecht 0.4580659 0.2126926 0.3291782
2 RKC Waalwijk       VVV-Venlo 0.6076020 0.2180298 0.1743364
3      Vitesse Heracles Almelo 0.5723334 0.2275537 0.2000907
4 ADO Den Haag            Ajax 0.1037534 0.1511710 0.7446496
5    FC Twente   SC Heerenveen 0.6605607 0.1558135 0.1822923
6    NAC Breda              AZ 0.2539698 0.2627759 0.4832506
7          PSV       Feyenoord 0.5055082 0.2147899 0.2796468
Needless to say, I am not impressed. Choosing the most probable outcome as result, I have two correct Twente - Heerenveen and PSV - Feyenoord.
Last week flo2speak commented that (s)he tried to do the same in German football and had 30% correct. I am in the same ballpark. Maybe I should try the ordinal regression too. On top of that; I never predict ties, this weekend had four games tied, and season 2011-2012 had 20% of the games tied. Clearly improvement is needed.

Prediction code

To predict for a weekend I made a wrapper around fbpredict in which a data frame can be used. Not the most efficient code, but for this size of calculation that is not a problem. Promoted teams which are lacking data are also removed. 
topred <- read.table(textConnection("
'Roda JC'        'FC Utrecht'
'PEC Zwolle'     'FC Groningen'
'RKC Waalwijk'   'VVV-Venlo'
'Vitesse'        'Heracles Almelo'
'NEC'            'Willem II'
'ADO Den Haag'   'Ajax'
'FC Twente'      'SC Heerenveen'
'NAC Breda'      'AZ'
'PSV'            'Feyenoord'"
),col.names=c('Off','Def'))

morepred <- function(topred) {
  topred <- topred[topred[,1] %in% levels(model3$data$OffenseClub) & 
                   topred[,2] %in% levels(model3$data$OffenseClub) ,]
  ap <- lapply(1:nrow(topred),function(irow) {
        fbp <- fbpredict(model3,as.character(topred[irow,1]),
                                as.character(topred[irow,2]))
        sec2 <- fbp[[2]]
        mydf <- data.frame(club1=topred[irow,1],
                           club2=topred[irow,2],
                           win1=sec2[1],
                           equal=sec2[2],
                           win2=sec2[3])
      })
  dc <- do.call(rbind,ap)
  rownames(dc) <- 1:nrow(dc)
  dc
}
morepred(topred)



3 comments:

  1. I tried to predict the results using only the data from the present year, and got four correct (FC Utrecht, FC Groningen, FC Twente, PSV). Maybe it is an idea to combine both the data of last year and of the current season and give a weight to the current year of, for example, 2. In this way there is still enough information in the model but it also accounts for more recent results...

    ReplyDelete
  2. Absolutely. I was thinking along such lines. A weight scheme could be devised using 2010-2011 data to predict 2011-2012, thereby separating determination of training weights and predictions.

    ReplyDelete
  3. Hi,
    I got this error. How can I fixed it? Thanks
    Error in UseMethod("morepred", mymodel) :
    no applicable method for 'morepred' applied to an object of class "clm"

    ReplyDelete