Sunday, August 23, 2015

Predicting Titanic deaths on Kaggle IV: random forest revisited

On July 19th I used randomForest to predict the deaths on Titanic in the Kaggle competition. Subsequently I found that both bagging and boosting gave better predictions than randomForest. This I found somewhat unsatisfactory, hence I am now revisiting randomForest. To my disappointment this does not result in predictions as good as bagging and boosting.
Note that all code is at the bottom of the post

Data

Data has not changed very much.

Age

Since ipred package has a nice function for obtaining error using cross-validation, getting better predictions for Age when not in the data is the first adaptation. The model parameters to be optimized are mtry and nodesize. The plot shows that mtry=5 and nodesize=4 should give the best predictions.

Using these settings, the following predicted vs observed ages are obtained. I am not really impressed.

Survival model 1

Model building 1

Having complete data, the next step is using cross-validation to select nodesize and mtry for the survival model. The following predictive capability was observed. Note that the error in these models is a bit larger than observed previously with bagging and boosting. However, observing that, does not suggest a remedy. It was chosen to use nodesize=3 mtry=7.

Evaluation Model 1

There are a number of ways to have randomForest give predictions. One can just ask for the categories, or the probability of a category. At this point I am looking at those probabilities, since I think the model might be improved. For this improvement, I do need to understand what is happening. Using the model, the following out of bag probabilities per category are found (pp[,1] is the probability of category 0). This is not ideal. Ideally most of the probabilities are close to 0 and 1. But here there are quite a number where this is not the case. Especially category 1 is not easily found and quite a few of the category 1 are seen as category 0. Hence the question becomes if it is possible to get better defined categories. As a first step, I will try to optimize the point where the cut is made between the two categories.

The plot below shows the number of correct predictions as function of the cut off point. It shows that the whole center region is a possible cut off, except near 0.4. The value 0.5 is not optimal. 

Examination of cut of point

After making this plot I wondered if this shape would be same for other settings of nodesize and mtry.  Since I have a distinct feeling it is all dependent on the luck of the draw, it is repeated a number of times for each setting. Based on this I have chosen that a cut off of 0.55 is appropriate for a a wider range of settings. The best out of box predictions seem to happen with a higher value for mtry and a low value for nodesize. Thinking back on the density plot, it would seem that high nodesize and low mtry has low probabilities in the center region. However, the price for that is quite some errors in out of bag predictions.

Survival Model 2

Model Building 2

Using the cut off of 0.55, again cross validation to select model parameters mtry and nodesize. Again each setting is tried a few times to get an idea of variability of prediction quality. Based on these settings I have chosen nodesize=6 and mtry=6.

Submission

Your submission scored 0.75. Not really as much as I had hoped for.

Code

Note that the code has been reformatted and cleaned after pasting in the blogging application. This should not have caused any coding errors.
# preparation and data reading section
library(randomForest)
library(lattice)
# has cross validation
library(ipred)


# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test  <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 35
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100))
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
tt$A <- factor(grepl('A',tt$Cabin))
tt$B <- factor(grepl('B',tt$Cabin))
tt$C <- factor(grepl('C',tt$Cabin))
tt$D <- factor(grepl('D',tt$Cabin))
tt$E <- factor(grepl('E',tt$Cabin))
tt$F <- factor(grepl('F',tt$Cabin))
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$PC <- factor(grepl('PC',tt$Ticket))
tt$STON <- factor(grepl('STON',tt$Ticket))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)
#end of preparation and data reading

# age section
# get an age without missings
forage <- tt[!is.na(tt$Age) & tt$status=='train',names(tt) %in% 
     c('Age','Sex','Pclass','SibSP',
        'Parch','Fare','Title','Embarked','A','B','C','D','E','F',
        'ncabin','PC','STON','oe')]

totest <- expand.grid(mtry=4:7,nodesize=3:6)

la <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Age ~.,
          mtry=totest$mtry[ii],
          nodesize=totest$nodesize[ii],
          model=randomForest,
          data=forage)
      cc <- c(mtry=totest$mtry[ii],nodesize=totest$nodesize[ii],error=ee$error)
      print(cc)
      cc
    })

sla <- do.call(rbind,la)
sla <- as.data.frame(sla)
xyplot(error ~ mtry, groups= nodesize, data=sla,auto.key=TRUE,type='l')
# chosen 5,4
rfa1 <- randomForest(Age ~ .,
    data=forage,
    ntree=1000,
    mtry=5,
    nodesize=4)
plot(tt$Age,predict(rfa1,tt))
abline(a=0,b=1,col='red')
tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(rfa1,tt[is.na(tt$AGE),])
tt$age <- cut(tt$AGE,c(0,2,5,9,12,15,21,55,65,100))
# end of age section

#final data section
train <- tt[tt$status=='train',]
test <- tt[tt$status=='test',]
#end of final data section

#model selection 1
forSurf <- train[,names(train) %in% 
     c('Survived','age','AGE','Sex','Pclass','SibSP',
        'Parch','Fare','Title','Embarked','A','B','C','D','E','F',
        'ncabin','PC','STON','oe')]

# rfx <- randomForest(Survived ~.,data=forSurf)
totest <- expand.grid(mtry=6:9,nodesize=3:7)

la <- lapply(1:nrow(totest),function(ii) {
      ee <-    errorest(Survived ~.,
          mtry=totest$mtry[ii],
          nodesize=totest$nodesize[ii],
          model=randomForest,
          data=forSurf,
          ntree=1000,
          est.para=control.errorest(k=20)
      )
      cc <- c(mtry=totest$mtry[ii],
         nodesize=totest$nodesize[ii],
         sampsize=totest$sampsize[ii],
         error=ee$error)
      print(cc)
      cc
    })
sla <- do.call(rbind,la)
sla <- as.data.frame(sla)
xyplot(error ~ mtry, groups= nodesize, data=sla,auto.key=TRUE,type='l')
#end of model selection 1

#model evaluation section 1a
rfx <- randomForest(Survived ~.,data=forSurf,nodesize=3,mtry=7,ntree=1000)
pp <- predict(rfx,type='prob')
densityplot(~ pp[,1] | forSurf$Survived,adj=.3)
cuts <- seq(.20,.7,.001) 
plot(y=sapply(cuts,function(cc){
          decide=factor(as.numeric(pp[,1]<cc))
          sum(decide==forSurf$Survived)
        }),
    x=cuts)
#end  of model evaluation section 1a


cuts <- seq(.25,.65,.001) 
# model evaluation 1b
eval2 <- expand.grid(nodesize=seq(4,100,8),mtry=seq(2,8,2),count=1:10)
sach <- lapply( 1:nrow(eval2),function(i) {
      rfx <- randomForest(Survived ~.,
          data=forSurf,
          nodesize=eval2$nodesize[i],
          mtry=eval2$mtry[i],
          ntree=1000)
      pp <- predict(rfx,type='prob')
      nerr=sapply(cuts,function(cc){
            decide=factor(as.numeric(pp[,1]<cc))
            sum(decide==forSurf$Survived)})
      data.frame(
          nerr=nerr,
          cuts=cuts,
          mtry=eval2$mtry[i],
          nodesize=eval2$nodesize[i],
          i=rep(i,length(cuts)))
    })
sach <- do.call(rbind,sach)
xyplot(nerr ~ cuts | nodesize + mtry ,group=i, data=sach,auto.key=FALSE,type='l')
##############
# # chose cuts at .55
##############

#biased prediction
twpred <- function(object,newdata=NULL) {
  preds <- predict(object,newdata,type='prob')
  factor(as.numeric(preds[,1]<0.55),levels=c('0','1'))
}
totest2 <- expand.grid(mtry=seq(2,8,2),nodesize=seq(2,30,4),count=1:10)

la2 <- lapply(1:nrow(totest2),function(ii) {
      ee <-    errorest(Survived ~.,
          mtry=totest2$mtry[ii],
          nodesize=totest2$nodesize[ii],
          model=randomForest,
          data=forSurf,
          ntree=500,
          predict=twpred,
          est.para=control.errorest(k=10)
      )
      cc <- c(mtry=totest2$mtry[ii],
         nodesize=totest2$nodesize[ii],
         i=totest2$count[ii],
         error=ee$error)
      print(cc)
      cc
    })
sla2 <- do.call(rbind,la2)
sla2 <- as.data.frame(sla2)
xyplot(error ~ factor(mtry) | factor(nodesize),
    groups= i, data=sla2,auto.key=FALSE,type='l')
##
#let select mtry=6, nodesize=6
rf2 <-randomForest(Survived ~ .,
    data=forSurf,
    replace=TRUE,
    ntree=2000,
    nodesize=6,
    mtry=6)  

pp <- predict(rf2,test)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='rf.16.aug.csv',
    row.names=FALSE,
    quote=FALSE)

# get a result
# Your submission scored 0.75598

Sunday, August 9, 2015

Predicting Titanic deaths on Kaggle III: Bagging

This is the third post on prediction the deaths. The first one used randomforest, the second boosting (gbm). The aim of the third post was to use bagging. In contrast to the former posts I abandoned dplyr in this post. It gave some now you see now you don't errors.

Data

The data is supposed to be the same as previous.
library(ipred)
library(rpart)
library(lattice)

# read and combine
train <- read.csv('train.csv')
train$status <- 'train'
test  <- read.csv('test.csv')
test$status <- 'test'
test$Survived <- NA
tt <- rbind(test,train)

# generate variables
tt$Pclass <- factor(tt$Pclass)
tt$Survived <- factor(tt$Survived)
tt$age <- tt$Age
tt$age[is.na(tt$age)] <- 35
tt$age <- cut(tt$age,c(0,2,5,9,12,15,21,55,65,100))
tt$Title <- sapply(tt$Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2])
tt$Title <- gsub(' ','',tt$Title)
tt$Title[tt$Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major')] <- 'Mr'
tt$Title[tt$Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona')] <- 'Miss'
tt$Title <- factor(tt$Title)
tt$A <- factor(grepl('A',tt$Cabin))
tt$B <- factor(grepl('B',tt$Cabin))
tt$C <- factor(grepl('C',tt$Cabin))
tt$D <- factor(grepl('D',tt$Cabin))
tt$E <- factor(grepl('E',tt$Cabin))
tt$F <- factor(grepl('F',tt$Cabin))
tt$ncabin <- nchar(as.character(tt$Cabin))
tt$PC <- factor(grepl('PC',tt$Ticket))
tt$STON <- factor(grepl('STON',tt$Ticket))
tt$cn <- as.numeric(gsub('[[:space:][:alpha:]]','',tt$Cabin))
tt$oe <- factor(ifelse(!is.na(tt$cn),tt$cn%%2,-1))
tt$Fare[is.na(tt$Fare)]<- median(tt$Fare,na.rm=TRUE)

Age

The first step is again to predict the missing ages. Even though we have I have all data available in one data.frame, I still think the correct approach is to create the age model using only the training data. Note that I am not too impressed with the age model. Perhaps this should also be optimized.
forage <- tt[!is.na(tt$Age) & tt$status=='train',names(tt) %in% 
   c('Age','Sex','Pclass','SibSP',
   'Parch','Fare','Title','Embarked','A','B','C','D','E','F',
   'ncabin','PC','STON','oe')]

ipbag1 <- bagging(Age ~.,data=forage)
ipbag1
Bagging regression trees with 25 bootstrap replications 

Call: bagging.data.frame(formula = Age ~ ., data = forage)
plot(tt$Age~predict(ipbag1,tt))
tt$AGE <- tt$Age
tt$AGE[is.na(tt$AGE)] <- predict(ipbag1,tt[is.na(tt$AGE),])

Selecting the survival model

ipred, the package in which bagging resides, comes with a nice general purpose cross validation utility. In the end, I decided the two parameters to be optimized are ns; the size of the bags and minsplit: the minimum number of observations that must exist in a node in order for a split to be attempted. Nbagg, the number of bootstrap evaluations, just needs to be big enough. Regarding nbagg, I have the feeling that this particular problem, with relatively few records, it may be needed to have relatively high nbagg in order to have reproducible models.
di1 <- subset(titanic,select=c(
    age,SibSp,Parch,Fare,Sex,Pclass,
        Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived))
dso <- expand.grid(ns=seq(100,300,25),nbagg=c(500),minsplit=1:6)
la <- lapply(1:nrow(dso),function(ii) {
   ee <-    errorest(Survived ~ .,
    ns=dso$ns[ii],
    control=rpart.control(minsplit=dso$minsplit[ii], cp=0, 
       xval=0,maxsurrogate=0),
    nbagg=dso$nbagg[ii],
    model=bagging,
    data=di1,
    est.para=control.errorest(k=20)
    )
    cc <- c(ns=dso$ns[ii],minsplit=dso$minsplit[ii],nbagg=dso$nbagg[ii],error=ee$error)
    print(cc)
    cc
  })
las <- do.call(rbind,la) 
las <- as.data.frame(las)
xyplot(error ~ ns, groups= minsplit, data=las,auto.key=TRUE,type='l')

Predictions

Based on the plot I have chosen for ns=275 and minsplit=5. But, to be honest, in a previous run I had chosen ns=150 and minsplit=2. Obviously from this plot a silly choice. But, given the high variability in this plot between parameters which are relatively similar and the totally different result, I actually think there is relatively much noise in the validation. Thus what is actually seen is that there is relatively little difference between the settings.
Having said that, these new settings got me just over 0.8 in the Kaggle score, while the previous settings were just below.
 bagmod <- bagging(Survived ~.,ns=275,nbagg=500,
    control=rpart.control(minsplit=5, cp=0, xval=0,maxsurrogate=0),
    data=di1)

pp <- predict(bagmod,test)

out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='bag8aug.csv',
    row.names=FALSE,
    quote=FALSE)

Sunday, July 26, 2015

Predicting Titanic deaths on Kaggle II: gbm

Following my previous post I have decided to try and use a different method: generalized boosted regression models (gbm). I have read the background in Elements of Statistical Learning and arthur charpentier's nice post on it. This data is a nice occasion to get my hands dirty.

Data 

Data as before. However, I have added some more variables. In addition, during the analysis it appeared that gbm does not like to have logical variables in the x-variables. One missing value of Fare in the test set gets the median value in order to avoid having missing values in the data. I must say I like using dplyr for this data handing. It allows me to use the same code for training and test with minimum effort.
library(dplyr)
library(gbm)

set.seed(4321)
titanic <- read.csv('train.csv') %>%
    mutate(.,Pclass=factor(Pclass),
        Survived=factor(Survived),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        Title=sapply(Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2]),
        Title=gsub(' ','',Title),
        Title =ifelse(Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major'),'Mr',Title),
        Title =ifelse(Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona'),'Miss',Title),
        Title = factor(Title),
        A=factor(grepl('A',Cabin)),
        B=factor(grepl('B',Cabin)),
        C=factor(grepl('C',Cabin)),
        D=factor(grepl('D',Cabin)),
        E=factor(grepl('E',Cabin)),
        F=factor(grepl('F',Cabin)),
        ncabin=nchar(as.character(Cabin)),
        PC=factor(grepl('PC',Ticket)),
        STON=factor(grepl('STON',Ticket)),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
        train = sample(c(TRUE,FALSE),
            size=891,
            replace=TRUE, 
            prob=c(.9,.1)   ) )

test <- read.csv('test.csv') %>%
    mutate(.,
        Embarked=factor(Embarked,levels=levels(titanic$Embarked)),
        Pclass=factor(Pclass),
  #      Survived=factor(Survived),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        Title=sapply(Name,function(x) strsplit(as.character(x),'[.,]')[[1]][2]),
        Title=gsub(' ','',Title),
        Title =ifelse(Title %in% c('Capt','Col','Don','Sir','Jonkheer','Major'),'Mr',Title),
        Title =ifelse(Title %in% c('Lady','Ms','theCountess','Mlle','Mme','Ms','Dona'),'Miss',Title),
        Title = factor(Title),
        A=factor(grepl('A',Cabin)),
        B=factor(grepl('B',Cabin)),
        C=factor(grepl('C',Cabin)),
        D=factor(grepl('D',Cabin)),
        E=factor(grepl('E',Cabin)),
        F=factor(grepl('F',Cabin)),
        ncabin=nchar(as.character(Cabin)),
        PC=factor(grepl('PC',Ticket)),
        STON=factor(grepl('STON',Ticket)),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1))
)
test$Fare[is.na(test$Fare)]<- median(titanic$Fare)

Age

Age has missing values, thus the first task is to fill those. Since gbm is the method used for the main analysis, I will be used it for age too. This has the added advantage that I can exercise with both a numerical and a categorical variable as response.
One of the things with boosting is that it opens itself to over fitting. Boosting consists of adding trees which are structured to improve fit. At some point the trees will just start boost the noise rather than the structure. Gbm comes with a cross validation (cv) option, which is the preferred way to get the predictive qualities of models, and cv is used to determine the optimum number of trees. But, there is catch, it throws an error if there are variables in the data.frame which are not used in the model. Hence in the code below first the data is selected, and subsequently the model run.
The model parameters, interaction.depth=4, shrinkage=0.0005 come from a bit of experimenting. n.trees has to be high enough that it is clear the optimum number of trees is lower than the number estimated. It seems n.cores=2 works under both windows and linux.
forage <- filter(titanic,!is.na(titanic$Age)) %>%
     select(.,Age,SibSp,Parch,Fare,Sex,Pclass,Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe)

rfa1 <- gbm(Age ~ ., 
    data=forage,
    interaction.depth=4,
    cv.folds=10,
    n.trees=8000,
    shrinkage=0.0005,
    n.cores=2)

gbm.perf(rfa1)
Using cv method...
[1] 6824
It is time here to confess that I have been working on this over several sessions. It appears that when I created the code, 7118 trees were optimum, while I stored that data for a session with 6824 trees. Thus is the way of these methods, unlike traditional statistical methods, they have a different result any time. But, as can be seen from the plot, the difference should be minimal.
titanic$AGE<- titanic$Age
titanic$AGE[is.na(titanic$AGE)] <- predict(rfa1,titanic,n.trees=7118)[is.na(titanic$Age)]
test$AGE<- test$Age
test$AGE[is.na(test$AGE)] <- predict(rfa1,test,n.trees=7118)[is.na(test$Age)]

Survival

During the calculations I learned that the response should be a float containing 0 and 1. With two categories there are various distributions to be used: bernoulli, huberized and adaboost. Using the 10% test data I had set apart, it seemed adaboost worked best for these data. 
gb1 <- filter(titanic,train) %>%
    select(.,age,SibSp,Parch,Fare,Sex,Pclass,
        Title,Embarked,A,B,C,D,E,F,ncabin,PC,STON,oe,AGE,Survived)%>%
      mutate(Survived=c(0,1)[Survived]) # not integer or factor but float
#table(gb1$Survived)
gb1m <-      gbm(Survived ~ .,
         cv.folds=11,
         n.cores=2,
         interaction.depth=5,
         shrinkage = 0.0005,
         distribution='adaboost',
    data=gb1,
    n.trees=10000)
gbm.perf(gb1m)
Using cv method...
[1] 6355

In my code I have used 6000 trees. 
One thing about gbm is that it does not respond with categories. It is a proportion answers for either category.
preds <- predict(gb1m,titanic,
    n.trees=6000,type='response')
density(preds) %>% plot
Thus there is a need or opportunity to determine the cut off point. For this my test set comes in somewhat handy.
preds2<- preds[!titanic$train]
target <- c(0,1)[titanic$Survived[!titanic$train]]
sapply(seq(.3,.7,.01),function(step)
  c(step,sum(ifelse(preds2<step,0,1)!=target)))
     [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9] [,10] [,11] [,12]
[1,]  0.3  0.31  0.32  0.33  0.34  0.35  0.36  0.37  0.38  0.39   0.4  0.41
[2,] 17.0 17.00 17.00 17.00 17.00 17.00 17.00 17.00 18.00 17.00  16.0 16.00
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20] [,21] [,22] [,23] [,24]
[1,]  0.42  0.43  0.44  0.45  0.46  0.47  0.48  0.49   0.5  0.51  0.52  0.53
[2,] 16.00 16.00 15.00 14.00 14.00 14.00 13.00 15.00  15.0 15.00 16.00 16.00
     [,25] [,26] [,27] [,28] [,29] [,30] [,31] [,32] [,33] [,34] [,35] [,36]
[1,]  0.54  0.55  0.56  0.57  0.58  0.59   0.6  0.61  0.62  0.63  0.64  0.65
[2,] 16.00 17.00 17.00 17.00 17.00 18.00  18.0 18.00 18.00 19.00 20.00 20.00
     [,37] [,38] [,39] [,40] [,41]
[1,]  0.66  0.67  0.68  0.69   0.7
[2,] 20.00 21.00 21.00 21.00  21.0
It is a bit messy output, but at 0.48 the least errors are found.

Predictions

This is fairly straightforward. I am not unhappy to report an improvement, bringing me from tail to middle region of the peloton.
pp <- predict(gb1m,test,n.trees=6000,type='response')
pp <- ifelse(pp<0.48,0,1)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='gbm.csv',
    row.names=FALSE,
    quote=FALSE)

Sunday, July 19, 2015

Predicting Titanic deaths on Kaggle

Kaggle has a competition to predict who will die on the famous Titanic 'Machine Learning from Disaster''. It is placed as knowledge competition. Just up there to learn. I am late to the party, it has been been for 1 1/2 year, to end by end 2015. It is a small data set, hence interesting to learn from. It is also a competition with a number of entries which have perfect predictions.
Just for fun, I have been trying to see what I would achieve with simple attempt with randomforest. For those in the competition, this randomforest got me 0.74163, placing me 2781 out of 3064 entries. An acceptable spot, since this is using off the shelf approach. An improvement may follow in a subsequent post.

Data

Data downloaded from Kaggle. It is real world data, hence has the odd missing (in passenger age) and a number of columns with messy data, which might be employed to create additional variables. For the purpose of validation about 90% of the data gets flagged to be training set. test will be the test, set, results of which to be passed back to Kaggle.
  PassengerId Survived Pclass                                                Name
1           1        0      3                             Braund, Mr. Owen Harris
2           2        1      1 Cumings, Mrs. John Bradley (Florence Briggs Thayer)
3           3        1      3                              Heikkinen, Miss. Laina
4           4        1      1        Futrelle, Mrs. Jacques Heath (Lily May Peel)
5           5        0      3                            Allen, Mr. William Henry
6           6        0      3                                    Moran, Mr. James
     Sex Age SibSp Parch           Ticket    Fare Cabin Embarked
1   male  22     1     0        A/5 21171  7.2500              S
2 female  38     1     0         PC 17599 71.2833   C85        C
3 female  26     0     0 STON/O2. 3101282  7.9250              S
4 female  35     1     0           113803 53.1000  C123        S
5   male  35     0     0           373450  8.0500              S
6   male  NA     0     0           330877  8.4583              Q
library(dplyr)
library(randomForest)
library(lattice)
options(width=85)
head(read.csv('train.csv'))

titanic <- read.csv('train.csv') %>%
    mutate(.,Pclass=factor(Pclass),
        Survived=factor(Survived),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        A=grepl('A',Cabin),
        B=grepl('B',Cabin),
        C=grepl('C',Cabin),
        D=grepl('D',Cabin),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
        train = sample(c(TRUE,FALSE),
            size=891,
            replace=TRUE, 
            prob=c(.9,.1)   ) )
test <- read.csv('test.csv') %>%
    mutate(.,Pclass=factor(Pclass),
        age=ifelse(is.na(Age),35,Age),
        age = cut(age,c(0,2,5,9,12,15,21,55,65,100)),
        A=grepl('A',Cabin),
        B=grepl('B',Cabin),
        C=grepl('C',Cabin),
        D=grepl('D',Cabin),
        cn = as.numeric(gsub('[[:space:][:alpha:]]','',Cabin)),
        oe=factor(ifelse(!is.na(cn),cn%%2,-1)),
        Embarked=factor(Embarked,levels=levels(titanic$Embarked))
        )
test$Fare[is.na(test$Fare)]<- median(titanic$Fare)
Age has missing values, hence the first step is to fill those in. In the code above, an age factor has been made, where missings are imputed the largest category. 

Model building

A simple prediction using randomForest.
rf1 <- randomForest(Survived ~ 
        Sex+Pclass + SibSp +
        Parch + Fare + 
        Embarked + age +
        A+B+C+D +oe,
    data=titanic,
    subset=train,
    replace=FALSE,
    ntree=1000)
Call:
 randomForest(formula = Survived ~ Sex + Pclass + SibSp + Parch +      Fare + Embarked + age + A + B + C + D + oe, data = titanic,      replace = FALSE, ntree = 1000, subset = train) 
               Type of random forest: classification
                     Number of trees: 1000
No. of variables tried at each split: 3

        OOB estimate of  error rate: 16.94%
Confusion matrix:
    0   1 class.error
0 454  40  0.08097166
1  95 208  0.31353135

This shows some bias in the predictions. Class one gets predicted as class 0 far too often. Hence I will optimize not only the normal variables nodesize (Minimum size of terminal nodes) and mtry (Number of variables randomly sampled as candidates at each split) but also classwt (Priors of the classes).

titanic$pred <- predict(rf1,titanic)
with(titanic[!titanic$train,],sum(pred!=Survived)/length(pred))

mygrid <- expand.grid(nodesize=c(2,4,6),
    mtry=2:5,
    wt=seq(.5,.7,.05))
sa <- sapply(1:nrow(mygrid), function(i) {
    rfx <- randomForest(Survived ~ 
        Sex+Pclass + SibSp +
        Parch + Fare + 
        Embarked + age +
        A+B+C+D +oe,
    data=titanic,
    subset=train,
    replace=TRUE,
    ntree=4000,
    nodesize=mygrid$nodesize[i],
    mtry=mygrid$mtry[i],
    classwt=c(1-mygrid$wt[i],mygrid$wt[i]))  
    preds <- predict(rfx,titanic[!titanic$train,])
    nwrong <- sum(preds!=titanic$Survived[!titanic$train])
    c(nodesize=mygrid$nodesize[i],mtry=mygrid$mtry[i],wt=mygrid$wt[i],pw=nwrong/length(preds))
    })
tsa <- as.data.frame(t(sa))
xyplot(pw ~ wt | mtry,group=factor(nodesize), data=tsa,auto.key=TRUE,type='l')

What is less visible from this plot is the amount of variation I had in the results. One prediction better or worse really makes a difference in the figure. This is the reason I have increased the number of trees in the model.

Final predictions

Using the best settings from above, gets you to the bottom of the ranking. The script makes the model, writes predictions in the format required by kaggle.
rf2 <- randomForest(Survived ~ 
        Sex+Pclass + SibSp +
        Parch + Fare + 
        Embarked + age +
        A+B+C+D +oe,
    data=titanic,
    replace=TRUE,
    ntree=5000,
    nodesize=4,
    mtry=3,
    classwt=c(1-.6,.6))  

pp <- predict(rf2,test)
out <- data.frame(
    PassengerId=test$PassengerId,
    Survived=pp,row.names=NULL)
write.csv(x=out,
    file='rf1.csv',
    row.names=FALSE,
    quote=FALSE)

Sunday, July 5, 2015

More on causes of death in Netherlands over the years

Last week I had a post 'Deaths in the Netherlands by cause and age'. During creation of that post I made one plot which I had not shown. It shows something odd. There is a vertical striping. Hence mortality varies by year across age.
To examine this phenomenon further here is a plot of some underlying causes. I would say the striping is present for at least three categories; "Diseases of the circulatory system", "Diseases of the respiratory organs" and "Sympt., Abnormal clinical Observations". This is odd, since these do not seem to be contagious. I suspect therefore that something like harsh weather (heat or cold) makes life more difficult, but does not get to be the final cause in the administration.
In addition there is something which I did not realize before regarding "Mental and behavioral disorders". They are age related. But it also seems that somewhere in the nineties of last century they became acceptable to register. And suddenly they are present, across several age categories.
This plot, same data, differently organized, shows that the years with these causes are similar, especially "Diseases of the circulatory system" and "Diseases of the respiratory organs"

Can it statistically be seen?

It is very nice that I can see that, but how about measuring it? Hence for age 90 to 95, after detrending, correlation between the two most visually correlated causes of death.
Pearson's product-moment correlation

data:  xx$`Diseases of the respiratory organs` and xx$`Diseases of the circulatory system`
t = 2.4997, df = 62, p-value = 0.01509
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.06133681 0.51042863
sample estimates:
     cor 
0.302584 

Code

library(dplyr)
library(ggplot2)
txtlines <- readLines('Overledenen__doodsoo_170615161506.csv')
txtlines <- grep('Centraal',txtlines,value=TRUE,invert=TRUE) 
#txtlines[1:5]
#cat(txtlines[4])
r1 <- read.csv(sep=';',header=FALSE,
        col.names=c('Causes','Causes2','Age','year','aantal','count'),
        na.strings='-',text=txtlines[3:length(txtlines)]) %>%
    select(.,-aantal,-Causes2)
transcauses <- c(
    "Infectious and parasitic diseases",
    "Diseases of skin and subcutaneous",
    "Diseases musculoskeletal system and connective ",
    "Diseases of the genitourinary system",
    "Pregnancy, childbirth",
    "Conditions of perinatal period",
    "Congenital abnormalities",
    "Sympt., Abnormal clinical Observations",
    "External causes of death",
    "Neoplasms",
    "Illness of blood, blood-forming organs",
    "Endocrine, nutritional, metabolic illness",
    "Mental and behavioral disorders",
    "Diseases of the nervous system and sense organs",
    "Diseases of the circulatory system",
    "Diseases of the respiratory organs",
    "Diseases of the digestive organs",
    "Population",
    "Total all causes of death")
#cc <- 
cbind(transcauses,levels(r1$Causes))
options(width=100)
levels(r1$Causes) <- transcauses
levels(r1$Age) <- 
    gsub('jaar','year',levels(r1$Age)) %>%
    gsub('tot','to',.) %>%
    gsub('of ouder','+',.)
r1 <- mutate(r1,age=as.numeric(sub(' .*$','',Age)))

Deaths <- filter(r1,Causes=='Total all causes of death') %>%
    mutate(.,Total=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[18]) %>%
    mutate(.,Population=count,
        Percentage=100*Total/Population,
        year = as.numeric(gsub('*','',year,fixed=TRUE))) %>%
    select(.,-Causes,-count)

png('deathall.png')
v <- ggplot(Deaths[Deaths$age>60,], aes( year,Age, fill = Percentage))
v + geom_raster() +
    scale_fill_gradientn (
        colours=c('white','black'))+
    theme(legend.position="bottom")+
    ggtitle('Total all causes of death')
dev.off()

v3 <- filter(r1,Causes %in% transcauses[18],
        age>65) %>%
    mutate(.,Population=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[c(8,15,9,10,13,16)]) %>%
    mutate(.,Total=count,
        Percentage=100*Total/Population,
        year = as.numeric(gsub('*','',year,fixed=TRUE))) %>%
    select(.,-count)
png('bycause.png')
 ggplot(v3, aes( year,Age, fill = Percentage))+
  geom_raster() +
    scale_fill_gradientn (
        colours=c('white','black'))+
    theme(legend.position="bottom")+
    facet_wrap( ~ Causes,nrow=3)
dev.off()

png('byage.png')
ggplot(v3[v3$age>75,], aes( year,Causes, fill = Percentage))+
    geom_raster() +
    scale_fill_gradientn (
        colours=c('white','black'))+
    theme(legend.position="bottom")+
    facet_wrap( ~ Age,nrow=3)
dev.off()

xx <- filter(r1,Causes %in% transcauses[18],
        age==90) %>%
    mutate(.,Population=count) %>%
    select(.,-count,-Causes) %>%
    merge(.,r1) %>%
    filter(.,Causes %in% transcauses[c(8,15,9,16)]) %>%
    mutate(.,Total=count,
        Percentage=100*Total/Population,
        year = as.numeric(gsub('*','',year,fixed=TRUE)),
        Causes=factor(Causes)) %>%
    select(.,-count,-Age,-age,-Population,-Total) %>%
   reshape(.,direction='wide',timevar='Causes',idvar='year')

names(xx) <- gsub('Percentage.','',names(xx))
for (i in 2:ncol(xx)) xx[,i]<- xx[,i] - predict(loess(xx[,i] ~ year,data=xx))

cor.test(xx$`Diseases of the respiratory organs`,
    xx$`Diseases of the circulatory system`)