Sunday, March 22, 2015

Launch into space

I saw a link to a list with all rocket launches into space the other day. This post contains some plots concerning day of launch made from that.

Data

Data is a fixed format file with eleven columns. Reading fixed format is not very difficult, however, it requires a bit of preparation. In this case, the first row contains a description of the columns, the second a number of '#'signs to denote the start of the columns. I found it most logical to first get the column widths, use these to create the variable names and then read the data. There is an interesting detail, the first two rows have '#' as first character, read.fwf has this as default comment character and skips them.
library(dplyr)
library(ggplot2)
library(lme4)
r1 <- readLines('launchlog.txt')
colwidth  <- gregexpr('#',r1[2])[[1]] %>%
    c(.,max(nchar(r1))) %>%
    diff(.)

cols <- read.fwf(textConnection(r1[1]),
        widths=colwidth,
        header=FALSE,
        comment.char='@') %>%
    unlist(.) %>%
    make.names(.) %>%
    gsub('\\.+$','',.) # remove extra . at end of string

r2 <- read.fwf('launchlog.txt',
    widths=colwidth,
    col.names=cols) 
Some launches contain more than one pay load. These are represented by rows containing just pay load information. These are removed by removing all records without a success value. In addition, a date variable is created using the launch date. Extra spaces, a consequence of the fixed input format, are removed from the suc variable.
r3 <- filter(r2,!is.na(Suc))
Sys.setlocale(category = "LC_TIME", locale = "C")
r3$Launch.Date..UTC[1:3]
r3$Date <- as.Date(r3$Launch.Date..UTC,format='%Y %b %d')
levels(r3$Suc) <- gsub(' *','',levels(r3$Suc))

Plots

Preparation

Since I wanted proportion of days with a launch, I created a few extra data frames. alldays contains all dates. monthorder and wdorder provide additional variables intended to arrange the months and week days in a reasonable order.
alldays <- data.frame(
    Date=seq(
        min(r3$Date),
        max(r3$Date) ,
        by=1 ))

monthorder <- alldays %>%
    mutate(.,
        month=months(Date),
        mn=format(Date,'%m')) %>%
    select(.,month,mn) %>%
    unique(.) %>%
    arrange(.,mn)

wdorder <- alldays %>%
    mutate(.,
        wd=weekdays(Date),
        wdn=format(Date,'%u')) %>%
    select(.,wd,wdn) %>%    
    unique(.) %>%
    arrange(.,wdn)

By month

January seems to get least launches. On the other hand December most. I wonder if that is publicity seeking in December of the well know January 1st deadline.

By weekdays

As expected, middle of the week gets most launches. The lower value for Monday than Saturday suggests the day before a launch is just as busy as the day of launch itself.
numlawd <-    mutate(r3,
        wd =factor(weekdays(Date),levels=wdorder$wd)) %>%
    xtabs( ~ wd,.) %>%
    as.data.frame(.) %>%
    rename(.,Occur=Freq)
numwd <-    mutate(alldays,
        wd =factor(weekdays(Date),levels=wdorder$wd)) %>%
    xtabs( ~ wd,.) %>%
    as.data.frame(.) %>%
    rename(.,NM=Freq)
propbwd <- merge(numlawd,numwd) %>%
    mutate(prop=Occur/NM)
qplot(y=prop,x= wd,data=propbwd) +
    ylab('Proportion') +
    xlab('Day of Week') +
    ggtitle('Proportion Week Days with Launch') +
    coord_flip()

By year

For years I have made a split between success and fail. It is also the most interesting of plots. Proportion launches per day increases quickly in the first years, as does the chance of a success. By mid eighties the launches decrease, only to pick up in the new millennium. The end of the space race is not visible in the number of launches, but seems to coincide with less failed launches. 
numlayr <- mutate(r3,year =factor(format(Date,'%Y'))) %>%
    xtabs( ~ year + Suc,.) %>%
    as.data.frame(.) %>%
    rename(.,Occur=Freq)
numyr <-    mutate(alldays,year =factor(format(Date,'%Y'))) %>%
    xtabs( ~ year,.) %>%
    as.data.frame(.) %>%
    rename(.,NM=Freq)
propyr <- merge(numlayr,numyr) %>%
    mutate(prop=Occur/NM,
        Year=as.numeric(levels(year))[year])
qplot(y=prop,x= Year,colour=Suc,data=propyr) +
    ylab('Proportion') +
    xlab('Year') +
    ggtitle('Proportion Days with Launch') +
    scale_colour_discrete(name='Success')

Most busy day?

Once there were four launches on a day.
r3 %>% xtabs(~factor(Date,levels=as.character(alldays$Date)),.) %>%
    as.data.frame(.) %>%
    rename(.,Nlaunch=Freq) %>%
    xtabs(~Nlaunch,.) %>%
    as.data.frame(.) 
  Nlaunch  Freq
1       0 16154
2       1  4224
3       2   562
4       3    44
5       4     1

Sunday, March 15, 2015

Medicines under evaluation

While browsing through the internet I ran into the medicines under evaluation page of the European Medicines Agency. There was a .pdf for each month and I thought it would be interesting to look if I could analyze those data. Each .pdf contains various sections of medicine, I have been focusing on the 'Non-orphan medicinal products'. The information given is the international non-proprietary name, the area. In addition, an entry in bold is new. The .pdf I looked it contained not too long a list, with about 40 or 50 entries and a few entries in bold. Hence I had the feeling that about a years worth data would see many medicines entering and leaving the list. I decided to take data starting January 2014, hence have 14 months of data.

Importing data

Reading from .pdf is difficult, this proved no exception. Rather than typing all data I pulled tabula 'Tabula is a tool for liberating data tables locked inside PDF files' from the internet. I marked all tables and converted them to spreadsheets. Unfortunately tabula did not understand that two text lines within a cell of the .pdf means it is one string. Quite some post processing was done to rectify this. Still, it is probably more easy than typing all data. Finally, I ended up with 14 .csv files. These could be imported into R. As can be seen, some post processing was needed, there was some inconsistency in spacing usage. The empty lines are a remainder of my merging texts into cells. I thought it more convenient to strip these in R than in each spreadsheet. The superscript 1 refers to a footnote which is indicated in some .pdf which got transformed into its own cell. Again, I decided to do that part in R. Not seen is some caps usage, this was resolved by editing the spreadsheets.
library(dplyr)
library(survival)
csvs <- dir(pattern='.csv')

step1 <- lapply(csvs,function(csv) {
          print(csv)
          r1 <- readLines(csv) 
          r1 <- r1[r1!=',']  # empty lines
          r1<-  r1[r1!=',1'] # lines with superscript 1
          r1 %<>% read.csv(text=., skip=1,col.names=c('Name','Area'),header=FALSE) %>%
              mutate(.,
                  Name=gsub("([[:space:]]+$)|(^[[:space:]]+)", "", Name),
                  Name=gsub(' +',' ',Name),
                  Name=gsub("/",' / ',Name),
                  Name=gsub("-",' - ',Name),
                  Name=gsub(' +',' ',Name),
                  Area=gsub(' *- *','-',Area),
                  Area=gsub(' +',' ',Area),
                  Area=gsub("([[:space:]]+$)|(^[[:space:]]+)", "", Area),
                  csvs=csv)
        }) %>%
    do.call(rbind,.) 
Since this now contains name of mother file rather than actual month, this is added.
csvsdf <- data.frame(csvs=csvs) %>%
    mutate(.,months =factor(tolower(substr(csvs,1,3)),
            levels=c('jan','feb','mrt','apr','may','jun',
                     'jul','aug','sep','oct','nov','dec')),
        year=as.numeric(substr(csvs,5,8)),
        monthno=12*year+c(1:12)[months],
        monthno=monthno-min(monthno)+1) %>%
    arrange(.,monthno)
step2 <- merge(step1,csvsdf) %>%
    mutate(.,Name=factor(Name),
        Area=factor(Area)) %>%
    arrange(.,Area,Name,monthno)

Areas of medicine

There are 99 medicines, distributed over 34 areas. The most frequent areas are:
xtabs(~ Name + Area,step2) %>%
    as.data.frame(.) %>%
    filter(.,Freq!=0) %>%
    xtabs(~ Area,.) %>%
    as.data.frame(.) %>%
    arrange(.,-Freq) %>%
    filter(.,Freq>4)
                                       Area Freq
1                  Antineoplastic medicines   11
2               Antivirals for systemic use   11
3                Medicines used in diabetes    7
4 Medicines for obstructive airway diseases    6
5                          Antihemorrhagics    5
6                  Antithrombotic medicines    5
7               Other therapeutic medicines    5

Duration

The intention was to run a Cox proportional hazards model. Hence I added an extra row for each medicine where I know both the beginning and the end month off.
terminated <- group_by(step2,Name,Area) %>%
    summarise(.,
        time=n()+1,
        event=!(min(monthno)==1 | max(monthno)==14) ) %>%
    filter(.,event) %>%
    select(.,Name,Area,time,event)
living <- mutate(step2,
        event=FALSE,
        time=monthno-ave(monthno,Name,FUN=min)+1) %>%
    select(.,Name,Area,time,event)
both <- rbind(terminated,living) %>%
    as.data.frame(.) %>%
    arrange(.,Area,Name,time)
Unfortunately Coxph gave warning messages, I do not trust the results sufficiently.
#coxph(Surv(time=time,event=event) ~ Name ,data=both)
Looking at the data it self, it seems I have only very sparse data which are not censored:
step2 %>%
    group_by(.,Name,Area) %>%
    summarise(.,
        time=n(),
        event=!(min(monthno)==1 | max(monthno)==14) ) %>%
    mutate(.,time=time+as.numeric(event) ) %>%
    select(.,Name,Area,time,event) %>%
    xtabs(~ time + event,.)
time FALSE TRUE
  1      8    0
  2     14    1
  3      8    3
  4      2    0
  5      9    1
  6      5    2
  7      7    4
  8      1    1
  9      7    0
  10     2    0
  11     9    1
  12     5    2
  13     3    1
  14     3    0

Sunday, March 8, 2015

SAS PROC MCMC example in R: Nonlinear Poisson Regression Multilevel Random-Effects Model

I am slowly working my way through the PROC MCMCexamples. Regarding these data, the SAS manual says: 'This example uses the pump failure data of Gaver and O’Muircheartaigh (1987) to illustrate how to fit a multilevel random-effects model with PROC MCMC. The number of failures and the time of operation are recorded for 10 pumps. Each of the pumps is classified into one of two groups that correspond to either continuous or intermittent operation'.
Regarding this example, while I was able to reproduce it well in JAGS and STAN, I was not able to get the LaplacesDemon perfect.

Previous post in the series PROC MCMC examples programmed in R were: example 1: sampling, example 2: Box Cox transformation, example 5: Poisson regression, example 6: Non-Linear Poisson Regression and example 6: Logistic Regression Random-Effects Model.

Data

Data read in as below
observed <-
'5  94.320 1  1 15.720 2    5 62.880 1
14 125.760 1  3  5.240 2   19 31.440 1
 1   1.048 2  1  1.048 2    4  2.096 2
22  10.480 2'
observed <- scan(text=gsub('[[:space:]]+',' ',observed),
    what=list(y=integer(),t=double(),group=integer()),
    sep=' '
)
observed <- as.data.frame(observed)
observed$pump <- factor(1:nrow(observed))
observed$group <- factor(observed$group)
observed$logtstd <- log(observed$t)
observed$logtstd <- observed$logtstd-mean(observed$logtstd)

Model

This example has the following model in SAS:
proc mcmc data=pump outpost=postout_c seed=248601 nmc=10000
   plots=trace diag=none;
   ods select tracepanel postsumint;
   array u[2] alpha beta;
   array mu[2] (0 0);
   parms s2 1;
   prior s2 ~ igamma(0.01, scale=0.01);
   random u ~ mvnar(mu, sd=1e6, rho=0) subject=group monitor=(u);
   w = alpha + beta * logtstd;
   random llambda ~ normal(w, var = s2) subject=pump monitor=(random(1));
   lambda = exp(llambda);
   model y ~ poisson(lambda);
run;
The most interesting bit is the llambda distribution statement. Somehow, it contributes to the model fit, and it fills llambda with values. In JAGS a similar construction can be made, but not in LaplacesDemon.

JAGS solution

This is a fairly straightforward transformation of the SAS model.
library(R2jags)
jagsdata <- list(
    N=nrow(observed),
    group=c(1,2)[observed$group],
    logtstd=observed$logtstd,
    y=observed$y)
jagsm <- function() {
  for(i in 1:2) {
    alpha[i] ~ dnorm(0.0,1.0E-3)
    beta[i]~dnorm(0.0,1.0e-3)
  }
  for(i in 1:N) {
    w[i] <- alpha[group[i]]+beta[group[i]]*logtstd[i]
    llambda[i] ~ dnorm(w[i],tau)
    lambda[i] <- exp(llambda[i])
    y[i] ~ dpois(lambda[i])
  }
  tau ~ dgamma(.01,0.01)
  s2 <- 1/tau
  ll8 <- llambda[8]
}
params <- c('alpha','beta','s2','ll8')
myj <-jags(model=jagsm,
    data = jagsdata,
    parameters=params)
myj

Inference for Bugs model at "...model9ef72aab174.txt", fit using jags,
 3 chains, each with 2000 iterations (first 1000 discarded)
 n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
alpha[1]   2.958   2.482 -2.154  1.548  2.945  4.397  7.878 1.004   640
alpha[2]   1.618   0.887 -0.240  1.130  1.686  2.159  3.223 1.001  2800
beta[1]   -0.444   1.322 -3.083 -1.217 -0.422  0.306  2.241 1.004   600
beta[2]    0.581   0.575 -0.640  0.241  0.604  0.940  1.639 1.003   910
ll8       -0.077   0.835 -1.919 -0.559  0.021  0.503  1.298 1.003  1400
s2         1.792   1.950  0.289  0.717  1.232  2.182  6.463 1.002  3000
deviance  43.132   4.266 36.715 39.992 42.425 45.679 53.082 1.001  3000

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 9.1 and DIC = 52.2
DIC is an estimate of expected predictive error (lower deviance is better).

Jags Solution 2

If I want to replace the llambda distribution by its contributing parts, the model becomes as below. There is now an explicit parameter for the pumps, which shows how much each pump deviates from the hyperdistribution.
jagsm <- function() {
  for(i in 1:2) {
    alpha[i] ~ dnorm(0.0,1.0E-3)
    beta[i]~dnorm(0.0,1.0e-3)
  }
  for(i in 1:N) {
    w[i] <- alpha[group[i]]+beta[group[i]]*logtstd[i]
    pumps[i]  ~ dnorm(0,tau)
    llambda[i] <- pumps[i]+w[i]

    lambda[i] <- exp(llambda[i])
    y[i] ~ dpois(lambda[i])
  }
  tau ~ dgamma(.01,0.01)
  s2 <- 1/tau
  ll8 <- llambda[8]
}

Inference for Bugs model at "/tmp/RtmpdTDOhm/model9ef186d53b5.txt", fit using jags,
 3 chains, each with 2000 iterations (first 1000 discarded)
 n.sims = 3000 iterations saved
         mu.vect sd.vect   2.5%    25%    50%    75%  97.5%  Rhat n.eff
alpha[1]   2.948   2.331 -1.735  1.574  2.930  4.292  7.846 1.004   830
alpha[2]   1.610   0.858 -0.235  1.138  1.658  2.151  3.126 1.007   590
beta[1]   -0.450   1.252 -3.055 -1.153 -0.432  0.284  2.056 1.004   960
beta[2]    0.568   0.570 -0.590  0.232  0.562  0.901  1.670 1.004  1800
ll8       -0.048   0.833 -1.849 -0.571  0.031  0.542  1.405 1.002  1100
s2         1.692   1.796  0.252  0.719  1.173  1.980  6.400 1.010   230
deviance  43.227   5.903 36.514 39.652 42.404 45.584 53.990 1.004   810

For each parameter, n.eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor (at convergence, Rhat=1).

DIC info (using the rule, pD = var(deviance)/2)
pD = 17.4 and DIC = 60.6
DIC is an estimate of expected predictive error (lower deviance is better).

STAN

In STAN it is fairly simple to program the explicit pump effect. However, the code is slightly longer.
stanData <- list(
    N=nrow(observed),
    group=c(1,2)[observed$group],
    logtstd=observed$logtstd,
    y=observed$y)

library(rstan)
smodel <- '
    data {
    int <lower=1> N;
    int <lower=1,upper=2> group[N];
    int  y[N];
    real logtstd[N];
    }
    parameters {
    real Beta[2];
    real Alpha[2];
    real <lower=0> s2;
    vector[N] pumps;
    }
    transformed parameters {
    vector[N] w;
    vector[N] llambda;
    vector[N] lambda;
    for (i in 1:N) {
    w[i] <- Alpha[group[i]]+Beta[group[i]]*logtstd[i];
    }
    llambda <- w+pumps;
    lambda <- exp(llambda);
    }
    model {
    y ~ poisson(lambda);
    s2 ~ inv_gamma(.01,.01);
    pumps ~ normal(0,sqrt(s2));
    Alpha ~ normal(0,1000);
    Beta ~ normal(0,1000);
    }
    generated quantities {
       real ll8;
       ll8 <- llambda[8];
    }
    '
fstan <- stan(model_code = smodel,
    data = stanData,
    pars=c('Beta','Alpha','s2','ll8'))
fstan

Inference for Stan model: smodel.
4 chains, each with iter=2000; warmup=1000; thin=1;
post-warmup draws per chain=1000, total post-warmup draws=4000.

          mean se_mean   sd  2.5%   25%    50%    75%  97.5% n_eff Rhat
Beta[1]  -0.40    0.04 1.29 -3.08 -1.12  -0.41   0.36   2.19  1317 1.00
Beta[2]   0.58    0.02 0.60 -0.69  0.23   0.60   0.97   1.72  1365 1.00
Alpha[1]  2.89    0.06 2.41 -1.93  1.48   2.89   4.26   7.90  1388 1.00
Alpha[2]  1.60    0.03 0.92 -0.39  1.09   1.66   2.18   3.28  1211 1.00
s2        1.79    0.07 1.95  0.31  0.76   1.24   2.12   6.27   710 1.01
ll8      -0.12    0.02 0.88 -2.10 -0.64  -0.01   0.52   1.30  3164 1.00
lp__     99.76    0.13 3.57 91.68 97.55 100.09 102.35 105.74   705 1.00

Samples were drawn using NUTS(diag_e) at Sat Mar  7 19:19:20 2015.
For each parameter, n_eff is a crude measure of effective sample size,
and Rhat is the potential scale reduction factor on split chains (at
convergence, Rhat=1).

LaplacesDemon

This was a problem. Given the code above, it should be straightforward. However, it seems not to give decent parameter values. The code below does use a normal distribution for sqrt(s2). I must say that the sensitivity for various priors of s2 in this model is somewhat worrisome.
library('LaplacesDemon')
mon.names <- "LP"
parm.names <- c(paste(rep(c('alpha','beta'),each=2),c(1,2,1,2),sep=''),
    's2',
    paste('pump',1:10,sep=''))
PGF <- function(Data) {
   c(runif(5,0,2),
       rnorm(10,0,1))
}

MyData <- list(mon.names=mon.names,
    parm.names=parm.names,
    PGF=PGF,
    group=c(1,2)[observed$group],
    logtstd=observed$logtstd,
    y=observed$y)
Model <- function(parm, Data)
{
  beta=parm[c(3,4)]
  alpha=parm[c(1,2)]
  s2=parm[5]
  pumps <- parm[6:15]
  w <- alpha[Data$group]+beta[Data$group]*Data$logtstd
  lambda  <- exp(w+pumps)
     if(s2<0 ) {
     LL <- -Inf
     LP <- -Inf
   } else {
   sd=sqrt(s2)
    LL <- sum(dpois(Data$y,lambda,log=TRUE)) +
        sum(dnorm(pumps,0,sd,log=TRUE))
    prior <- sum(dnorm(parm[1:4],0,1e6,log=TRUE)) +
        dnorm(sd,0,10,log=TRUE)
    LP=LL+prior 
  }
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,
      yhat=lambda,
      parm=parm)
  return(Modelout)
}
Initial.Values <- GIV(Model, MyData, PGF=TRUE)
Model(Initial.Values,MyData)

LA <- LaplaceApproximation(Model,
    parm=Initial.Values,
    Data=MyData,
    Method='BFGS',
    Iterations=2000,
    Stop.Tolerance=.001)

#LA$Covar       # covariance
Fit <- LaplacesDemon(Model,
    Data=MyData,
    Covar=LA$Covar,
    Algorithm='IM',
    Specs=list(mu=LA$Summary2[1:15,1]),
    Initial.Values = LA$Summary2[1:15,1],
    Iterations=40000,Status=1000,Thinning=40
)
Fit

Call:
LaplacesDemon(Model = Model, Data = MyData, Initial.Values = LA$Summary2[1:15,
    1], Covar = LA$Covar, Iterations = 40000, Status = 1000,
    Thinning = 40, Algorithm = "IM", Specs = list(mu = LA$Summary2[1:15,
        1]))

Acceptance Rate: 0.04538
Algorithm: Independence Metropolis
Covariance Matrix: (NOT SHOWN HERE; diagonal shown instead)
    alpha1     alpha2      beta1      beta2         s2      pump1      pump2
0.35685994 0.04768758 0.10016722 0.03220140 0.01307985 0.05277034 0.07376908
     pump3      pump4      pump5      pump6      pump7      pump8      pump9
0.04553773 0.06197309 0.05189976 0.08482326 0.06433302 0.07132478 0.06423315
    pump10
0.06345573

Covariance (Diagonal) History: (NOT SHOWN HERE)
Deviance Information Criterion (DIC):
        All Stationary
Dbar 57.754     58.036
pD    2.056      1.696
DIC  59.811     59.733
Initial Values:
     alpha1      alpha2       beta1       beta2          s2       pump1
 3.02746793  1.84533490 -0.40853043  0.64195643  0.45668880 -0.41735869
      pump2       pump3       pump4       pump5       pump6       pump7
-0.97307147 -0.59334362  0.47363733 -0.19075343  0.23899690 -0.13979169
      pump8       pump9      pump10
-0.05190615  0.41583871  1.18566087

Iterations: 40000
Log(Marginal Likelihood): -24.60551
Minutes of run-time: 0.19
Model: (NOT SHOWN HERE)
Monitor: (NOT SHOWN HERE)
Parameters (Number of): 15
Posterior1: (NOT SHOWN HERE)
Posterior2: (NOT SHOWN HERE)
Recommended Burn-In of Thinned Samples: 100
Recommended Burn-In of Un-thinned Samples: 4000
Recommended Thinning: 160
Specs: (NOT SHOWN HERE)
Status is displayed every 1000 iterations
Summary1: (SHOWN BELOW)
Summary2: (SHOWN BELOW)
Thinned Samples: 1000
Thinning: 40


Summary of All Samples
                 Mean        SD        MCSE       ESS           LB       Median
alpha1     3.02902055 0.5976765 0.033875624 465.42785   1.94620166   3.02746793
alpha2     1.85405458 0.2184840 0.015060612 349.50295   1.39240432   1.84803324
beta1     -0.41890176 0.3166503 0.018997023 481.52841  -1.05188183  -0.40853043
beta2      0.63293644 0.1795371 0.011348265 436.71962   0.28758602   0.63737896
s2         0.43066261 0.1144214 0.007302261 274.64302   0.22155991   0.43348687
pump1     -0.37243690 0.2298285 0.014321032 391.96002  -0.82709530  -0.38552546
pump2     -0.93920883 0.2717384 0.016885114 404.84643  -1.48440399  -0.95172632
pump3     -0.54940028 0.2134980 0.013956405 447.00328  -0.94402826  -0.57351952
pump4      0.48362669 0.2490683 0.012248080 522.99249  -0.01000839   0.47363733
pump5     -0.18414709 0.2279291 0.013777023 446.75275  -0.63917122  -0.19075343
pump6      0.25575145 0.2913896 0.018120358 443.85113  -0.33560223   0.23899690
pump7     -0.11427184 0.2537652 0.012513772 614.87586  -0.60451815  -0.10665053
pump8     -0.06221604 0.2672004 0.015234186 433.78090  -0.59796589  -0.05190615
pump9      0.38098070 0.2535670 0.015870761 514.86989  -0.15911780   0.41583871
pump10     1.16241102 0.2520292 0.016987945 337.82320   0.70150436   1.18413142
Deviance  57.75419036 2.0279787 0.277358759  64.22020  54.20970306  57.65028176
LP       -91.03856848 1.0140989 0.138688445  64.23136 -93.07403549 -90.98684715
                   UB
alpha1     4.21405063
alpha2     2.28149310
beta1      0.17711412
beta2      1.00444603
s2         0.68595978
pump1      0.05950237
pump2     -0.35260941
pump3     -0.10527832
pump4      1.00662909
pump5      0.30802060
pump6      0.78622765
pump7      0.36687146
pump8      0.50955936
pump9      0.86326767
pump10     1.69605148
Deviance  61.82358222
LP       -89.26645496


Summary of Stationary Samples
                 Mean        SD        MCSE      ESS          LB       Median
alpha1     3.03336875 0.6197257 0.037684775 419.6243   1.9462017   3.02051199
alpha2     1.85320726 0.2265947 0.016535659 309.2038   1.3924043   1.86648271
beta1     -0.42224878 0.3281491 0.021042445 434.6613  -1.0589187  -0.39535607
beta2      0.62932660 0.1858990 0.012220603 393.5420   0.2875860   0.61842982
s2         0.42761565 0.1194602 0.007947874 330.2853   0.2074068   0.42158060
pump1     -0.36915544 0.2395712 0.015756385 348.6680  -0.8440599  -0.35070290
pump2     -0.93369795 0.2832837 0.018533606 362.1901  -1.4896206  -0.93595868
pump3     -0.54539147 0.2221992 0.015225595 399.3457  -0.9580968  -0.54747889
pump4      0.48469802 0.2601885 0.013553884 516.0653  -0.0511650   0.47031679
pump5     -0.18647603 0.2332461 0.014356623 418.1546  -0.6435100  -0.19870380
pump6      0.25620372 0.3030826 0.020288966 398.5128  -0.3356022   0.25472223
pump7     -0.11024675 0.2634318 0.013481854 555.4753  -0.6184502  -0.09874182
pump8     -0.06356307 0.2750350 0.017000257 387.8663  -0.6158914  -0.08068829
pump9      0.37664857 0.2636891 0.017295769 464.9818  -0.1592315   0.39456739
pump10     1.16067466 0.2622398 0.018642621 303.4919   0.6766951   1.16507221
Deviance  58.03623819 1.8418930 0.200950651 146.1974  54.5450513  57.89738441
LP       -91.17957717 0.9211010 0.138688445 146.1838 -93.1095221 -91.10993804
                   UB
alpha1     4.24847266
alpha2     2.28149310
beta1      0.18123920
beta2      1.01535987
s2         0.68595978
pump1      0.05950237
pump2     -0.35260941
pump3     -0.09900105
pump4      1.01182177
pump5      0.30716549
pump6      0.81374718
pump7      0.36734086
pump8      0.51642505
pump9      0.87133356
pump10     1.69605148
Deviance  61.89407135
LP       -89.43333665

Sunday, March 1, 2015

Should I use premium Diesel? Setup

Since I drive quite a lot, I have some interest in getting the most km out every Euro spent on fuel. One thing to change is the fuel. The oil companies have a premium fuel, which is supposed to be better for both engine and fuel consumption. On the other hand, it is easy to find counter claims which say it is not beneficial for fuel consumption. In that case the extra costs would be a waste. In this post I am creating a setup to check the claims.

Current data

The current information which I have are fuel consumption of last year. I have taken a set of those data from March and April of 2014.
library(MASS)
r1 <- read.table(text='l km
28.25 710.6
22.93 690.4
28.51 760.5
23.22 697.9
31.52 871.2
24.68 689.6
30.85 826.9
23.04 699
29.96 845.3
30.16 894.7
25.71 696
23.6 669.8
28.57 739
27.23 727.4
18.31 499.9
24.28 689.5',header=TRUE)
r1$usage=100*r1$l/r1$km
plot(density(r1$usage),
    main='Observed normal diesel usage',
    xlab='l/100 km')

The data are from a distribution with a mean around 3.6 l/100 km.
fitdistr(r1$usage,'normal')
      mean          sd    
  3.59517971   0.19314598 
 (0.04828649) (0.03414371)

Approach

Analysis will be a hypothesis test and an estimate of premium diesel usage.
The assumptions which I will make are similar driving patterns and weather as last year. I think that should be possible, given my driving style. A cross check may be made, especially regarding obtaining similar speed. Data with serious traffic jams may be discarded in the analysis.
A check for outliers is not planned. However, obviously faulty data will be corrected or removed from the data. No intermediate analysis is planned, unless data seems to be pointing a marked increase of fuel usage.

Power for hypothesis test

The advice price levels of premium and standard diesel are 1.433 and 1.363 Euro/liter according to the internet. This is about 5% price increase. It should be noted that prices at the pump vary wildly from these values, especially non-brand non-manned fuel stations may be significantly cheaper. Last year's data was from such non brand fuel. Competition can force the price of both standard and premium fuel down a bit. I will take the 5% price increase as target for finding value for premium diesel. Given significance level of 10% and power of 90%, I come at 17 samples for each group. This means I will have to take a bit more data from last year, which is not a problem. The choice of alpha and beta reflect that I find both kind of errors equally bad.
power.t.test(delta=3.6*.05,
    sd=0.2,
    sig.level=.1,
    power=.9,
    alternative='one.sided')
     Two-sample t test power calculation 

              n = 16.66118
          delta = 0.18
             sd = 0.2
      sig.level = 0.1
          power = 0.9
    alternative = one.sided

NOTE: n is number in *each* group

Estimating usage of premium diesel

Besides a significance test, I desire an estimate of usage. This manner I can extrapolate the data to other scenarios. I will use a Bayesian analysis to obtain these estimates. The prior to be used is a mixture of three believes. Either it does not make a difference, or there is indeed a 5% gain to be made or something else entirely. This latter is an uninformed prior between 3 and 4 l/km. The combined density is plotted below.
usage <- seq(3.2,3.8,.01)
dens <- (dnorm(usage,3.6,.05)+
      dnorm(usage,3.6/1.05,.05)+
      dnorm(usage,(3.6+3.6/1.05)/2,.15))/3
plot(x=usage,y=dens,type='l',
    ylim=c(0,4),
    ylab='density',
    xlab='l/100 km',
    main='prior')

Sunday, February 22, 2015

Priors on odds and probability of success

In Bayesian Approaches to Clinical Trials and Health-Care Evaluation (David J. Spiegelhalter, Keith R. Abrams, Jonathan P. Myles) they mention that an non informative prior should be uniform over the range of interest. They combine this with the desire that Odds ratio has the same prior as 1/Odds ratio and from this they conclude log Odds ratio should be uniform distributed. They write it is equivalent to Beta(0,0), hence not a proper distribution, but use it anyway. In this post looks at the prior both at log odds scale and probability scale.
As a side, this exercise forced me to look at the Jacobian, I needed it to transform a density to the other scale. For this yacas came in handy, it has been a while since I differentiated by hand.

Analysis of prior

In general, I will try to give both the distribution according to a MCMC sampler (black lines) and an analytical solution (green line). The MCMC sampler used is LaplacesDemon, since it is fully in R it allows reusing analytical code and easy adding of the Jacobian.
As a side, in many of the plots I made the smoothing bandwidth a bit smaller, so some of the details could be captured.
Four priors are used. Probability of success Beta(0.5,0.5) and Beta(1,1). Log odds uniform on -10 to 10 and normal(0,10).
I think LaplacesDemon with the standard algorithm has some difficulty with the beta(0.5,0.5) distribution.



Using Data

After observing data, one success and five failure again same prior. For the priors on log odds there is no exact formulas to obtain the exact likelihood in the book. I used a formula for normal approximation of log odds ratio (2.30) and simplified from there. I am actually surprised how close that is to the MCMC sampler.



Code

# general code
library(magrittr)
library('LaplacesDemon')

#conversion and Jacobian
prob2Lo <- function(prob) log(prob/(1-prob))
Lo2prob <- function(Lo) exp(Lo)/(1+exp(Lo))
Jprob2Lo <- function(prob) prob*(1-prob)
JLo2prob <- function(Lo) (exp(Lo)+1)^2/exp(Lo)

# wrapper for Jacobian
dprob2dLo <- function(Lo,dprob,log=FALSE) {
  prob <- Lo2prob(Lo)
  if (log) {dprob(prob)+log(Jprob2Lo(prob))
  } else dprob(prob)*Jprob2Lo(prob)
}

dLo2dprob <- function(prob,dLo,log=FALSE) {
  Lo <- prob2Lo(prob)
  if (log) {dLo(Lo)+log(JLo2prob(Lo))
  } else dLo(Lo)*JLo2prob(Lo)
}

# helper function
densplot <- function (x,adjust=1,...) {
  density(x,adjust) %>%
      plot(.,...)
}

mon.names <- "LP"
parm.names <- c('prob')
MyData <- list(mon.names=mon.names,
    parm.names=parm.names)
N <-1
Model <- function(parm, Data)
{
  LL <- 1
  yhat <- 1
  LP <- dbeta(parm,.5,.5,log=TRUE)+LL
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LL,
      yhat=1, parm=parm)
  return(Modelout)
}
Initial.Values <- c(0.5)
Fit <- LaplacesDemon(Model, Data=MyData,
    Initial.Values,
    Iteration=1e5,Status=1e4)
png('B55.png')
par(mfrow=c(1,2))
densplot(prob2Lo(Fit$Posterior2),
    adjust=.2,
    main='Prior: Probability ~ Beta .5 .5',
    xlab='Log odds')
Lo <- seq(-10,10,.1)
dLo <- dprob2dLo(Lo,dprob=function(prob)
      dbeta(prob,1,1))
lines(x=Lo,y=dLo,col='green')
##
densplot(Fit$Posterior2,
    adjust=.01,
    main='Prior: Probability ~ Beta .5 .5',
    xlab='Probability Success')
prob <- seq(0.01,.99,.01)
dprob <- dbeta(prob,.5,.5)
lines(x=prob,y=dprob,col='green')
dev.off()
###
Model <- function(parm, Data)
{
  LL <- 1
  yhat <- 1
  LP <- dbeta(parm,1,1,log=TRUE)+LL
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LL,
      yhat=1, parm=parm)
  return(Modelout)
}
Initial.Values <- c(0.5)
Fit <- LaplacesDemon(Model,
    Data=MyData,
    Initial.Values,
    Iteration=1e5,Status=1e4)
png('B11.png')
par(mfrow=c(1,2))
densplot(prob2Lo(Fit$Posterior2),
    adjust=.1,
    main='Prior: Prob ~ Beta 1 1',
    xlab='Log odds')
Lo <- seq(-10,10,.1)
dLo <- dprob2dLo(Lo,dprob=function(prob)
      dbeta(prob,1,1))
lines(x=Lo,y=dLo,col='green')
densplot(Fit$Posterior2,adj=.01,
    main='Prior: Prob ~ Beta 1 1',
    xlab='Probability Success')
prob <- seq(0.01,.99,.01)
dprob <- dbeta(prob,1,1)
lines(x=prob,y=dprob,col='green')
dev.off()
###
# prior on log odds
##
Model <- function(parm, Data)
{
  LL <- 1
  yhat <- 1
  LP <- dunif(parm,-10,10,log=TRUE)+LL
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LL,
      yhat=1, parm=parm)
  return(Modelout)
}
Initial.Values <- c(0)
Fit <- LaplacesDemon(Model,
    Data=MyData,
    Initial.Values,
    Iteration=1e5,
    Status=1e4)
png('u1010.png')
par(mfrow=c(1,2))
densplot(Fit$Posterior2,
    adjust=.1,
    main='Prior: Log Odds ~ Unif -10 10',
    xlab='Log odds')
Lo <- seq(-10.1,10.1,.1)
dLo <- dunif(Lo,-10,10)
lines(x=Lo,y=dLo,col='green')
densplot(Lo2prob(Fit$Posterior2),
    adjust=.01,
    main='Prior: Log Odds ~ Unif -10 10',
    xlab='Probability Success')
prob <- seq(0.01,.99,.01)
dprob <- dLo2dprob(prob,dLo=function(Lo)
      dunif(Lo,-10,10))
lines(x=prob,y=dprob,col='green')
dev.off()
#
Model <- function(parm, Data)
{
  LL <- 1
  yhat <- 1
  LP <- dnorm(parm,0,10,log=TRUE)+LL
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LL,
      yhat=1, parm=parm)
  return(Modelout)
}
Initial.Values <- c(0)
Fit <- LaplacesDemon(Model,
    Data=MyData,
    Initial.Values,
    Iteration=1e5,
    Status=1e4)
png('N010.png')
par(mfrow=c(1,2))
densplot(Fit$Posterior2,
    adjust=.1,
    main='Prior: Log Odds ~ norm 0,10',
    xlab='Log odds')
Lo <- seq(-20.1,20.1,.1)
dLo <- dnorm(Lo,0,10)
lines(x=Lo,y=dLo,col='green')
##
densplot(Lo2prob(Fit$Posterior2),
    adjust=.01,
    main='Prior: Log Odds ~ norm 0,10',
    xlab='Probability Success')
prob <- seq(0.01,.99,.01)
dprob <- dLo2dprob(prob,dLo=function(Lo) dnorm(Lo,0,10))
lines(x=prob,y=dprob,col='green')
dev.off()

#############
# with data
#############
# 1 out of 6
parm.names <- c('prob')
MyData <- list(mon.names=mon.names,
    parm.names=parm.names,
    n=6,s=1)

Model <- function(parm, Data)
{
  prob <- parm
  LL <- dbinom(prob,x=Data$s,size=Data$n,log=TRUE)
  LP <- dbeta(prob,.5,.5,log=TRUE)+LL
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LL,
      yhat=1, parm=parm)
  return(Modelout)
}
Initial.Values <- c(.5)
Fit <- LaplacesDemon(Model, Data=MyData,
    Initial.Values,
    Iteration=1e5,Status=1e4)
png('db55.png')
par(mfrow=c(1,2))
densplot(prob2Lo(Fit$Posterior2),
    adjust=.1,
    main='Prior: Probability ~ Beta .5 .5',
    xlab='Log odds')
Lo <- seq(-10,10,.1)
dLo <- dprob2dLo(Lo,dprob=function(prob)
      dbeta(prob,1.5,5.5))
lines(x=Lo,y=dLo,col='green')
##
densplot(Fit$Posterior2,
    adjust=.01,
    main='Prior Probability ~ Beta .5 .5',
    xlab='Probability Success',
    xlim=c(0,1))
prob <- seq(0.01,.99,.01)
dprob <- dbeta(prob,1.5,5.5)
lines(x=prob,y=dprob,col='green')
dev.off()
# 1 1
parm.names <- c('prob')
MyData <- list(mon.names=mon.names,
    parm.names=parm.names,n=6,s=1)

Model <- function(parm, Data)
{
  prob <- parm
  LL <- dbinom(prob,x=Data$s,size=Data$n,log=TRUE)
  LP <- dbeta(prob,1,1,log=TRUE)+LL
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LL,
      yhat=1, parm=parm)
  return(Modelout)
}
Initial.Values <- c(.5)
Fit <- LaplacesDemon(Model, Data=MyData,
    Initial.Values,
    Iteration=1e5,Status=1e4)
png('db11.png')
par(mfrow=c(1,2))
densplot(prob2Lo(Fit$Posterior2),
    adjust=.1,
    main='Prior: Probability ~ Beta 1 1',
    xlab='Log odds')
Lo <- seq(-10,10,.1)
dLo <- dprob2dLo(Lo,dprob=function(prob) dbeta(prob,2,6))
lines(x=Lo,y=dLo,col='green')
densplot(Fit$Posterior2,
    adjust=.01,
    main='Prior Probability ~ Beta 1 1',
    xlab='Probability Success',
    xlim=c(0,1))
prob <- seq(0.01,.99,.01)
dprob <- dbeta(prob,2,6)
lines(x=prob,y=dprob,col='green')
dev.off()
##u -10 10 #########
mon.names <- "LP"
parm.names <- c('Lo')
MyData <- list(mon.names=mon.names,
    parm.names=parm.names,
    n=6,s=1)

Model <- function(parm, Data)
{
  Lo <- parm
  LL <-dprob2dLo(Lo,
      dprob=function(prob)
        dbinom(prob,
            x=Data$s,
            size=Data$n ,
            log=TRUE),
      log=TRUE)
  yhat <- Lo
  LP <- dunif(Lo,-10,10,log=TRUE)+LL
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LL,
      yhat=1, parm=parm)
  return(Modelout)
}
Initial.Values <- c(.1)
Fit <- LaplacesDemon(Model,
    Data=MyData,
    Initial.Values,
    Iteration=1e5,
    Status=1e4)
png('du1010.png')
par(mfrow=c(1,2))
densplot(Fit$Posterior2,
    adjust=.1,
    main='Prior: Log Odds ~ Unif -10 10',
    xlab='Log odds')
Lo <- seq(-20.1,20.1,.1)
dLo <- dnorm(Lo,log(1.5/5.5),sqrt(1/1.5+1/5.5))
lines(x=Lo,y=dLo,col='green')
densplot(Lo2prob(Fit$Posterior2),
    adjust=.01,
    main='Prior: Log Odds ~ Unif -10 10',
    xlab='Probability Success')
prob <- seq(0.01,.99,.01)
dprob <- dLo2dprob(prob,dLo=function(Lo) dnorm(Lo,log(1.5/5.5),sqrt(1/1.5+1/5.5)))
lines(x=prob,y=dprob,col='green')
dev.off()
## N 0 10
parm.names <- c('Lo')
MyData <- list(mon.names=mon.names,
    parm.names=parm.names,
    n=6,s=1)

Model <- function(parm, Data)
{
  Lo <- parm
  LL <-dprob2dLo(Lo,
      dprob=function(prob)
        dbinom(prob,
            x=Data$s,
            size=Data$n,
            log=TRUE),
      log=TRUE)
  yhat <- Lo
  LP <- dnorm(Lo,0,10,log=TRUE)+LL
  Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LL,
      yhat=1, parm=parm)
  return(Modelout)
}
Initial.Values <- c(.1)
Fit <- LaplacesDemon(Model,
    Data=MyData,
    Initial.Values,
    Iteration=1e5,
    Status=1e4)
png('dn010.png')
par(mfrow=c(1,2))

t1 <- 0
t2 <- log(1.5/5.5)
var1 <- 100
var2 <- 1/1.5+1/5.5
tt <- (t1/var1+t2/var2)/(1/var1+1/var2)
vv <- 1/(1/var1+1/var2)
densplot(Fit$Posterior2,
    adjust=.1,
    main='Prior: Log Odds ~ norm 0,10',
    xlab='Log odds')
Lo <- seq(-20.1,20.1,.1)
dLo <- dnorm(Lo,tt,sqrt(vv))
lines(x=Lo,y=dLo,col='green')
densplot(Lo2prob(Fit$Posterior2),
    adjust=.01,
    main='Prior: Log Odds ~ norm 0,10',
    xlab='Probability Success')
prob <- seq(0.01,.99,.01)
dprob <- dLo2dprob(prob,dLo=function(Lo) dnorm(Lo,tt,sqrt(vv)))
lines(x=prob,y=dprob,col='green')
dev.off()