Sunday, December 21, 2014

Merry Christmas

Based on The DO loop, since I wanted a fractal Christmas tree and there is no point in inventing what has been made already. Besides, this is not the first time this year that I used R to do what has been done in SAS.

Code

# http://blogs.sas.com/content/iml/2012/12/14/a-fractal-christmas-tree/
# Each row is a 2x2 linear transformation 
# Christmas tree 
L <-  matrix(
    c(0.03,  0,     0  ,  0.1,
        0.85,  0.00,  0.00, 0.85,
        0.8,   0.00,  0.00, 0.8,
        0.2,  -0.08,  0.15, 0.22,
        -0.2,   0.08,  0.15, 0.22,
        0.25, -0.1,   0.12, 0.25,
        -0.2,   0.1,   0.12, 0.2),
    nrow=4)
# ... and each row is a translation vector
B <- matrix(
    c(0, 0,
        0, 1.5,
        0, 1.5,
        0, 0.85,
        0, 0.85,
        0, 0.3,
        0, 0.4),
    nrow=2)

prob = c(0.02, 0.6,.08, 0.07, 0.07, 0.07, 0.07)

# Iterate the discrete stochastic map 
N = 1e5 #5  #   number of iterations 
x = matrix(NA,nrow=2,ncol=N)
x[,1] = c(0,2)   # initial point
k <- sample(1:7,N,prob,replace=TRUE) # values 1-7 

for (i in 2:N) 
  x[,i] = crossprod(matrix(L[,k[i]],nrow=2),x[,i-1]) + B[,k[i]] # iterate 

# Plot the iteration history 
png('card.png')
par(bg='darkblue',mar=rep(0,4))    
plot(x=x[1,],y=x[2,],
    col=grep('green',colors(),value=TRUE),
    axes=FALSE,
    cex=.1,
    xlab='',
    ylab='' )#,pch='.')

bals <- sample(N,20)
points(x=x[1,bals],y=x[2,bals]-.1,
    col=c('red','blue','yellow','orange'),
    cex=2,
    pch=19
)
text(x=-.7,y=8,
    labels='Merry',
    adj=c(.5,.5),
    srt=45,
    vfont=c('script','plain'),
    cex=3,
    col='gold'
)
text(x=0.7,y=8,
    labels='Christmas',
    adj=c(.5,.5),
    srt=-45,
    vfont=c('script','plain'),
    cex=3,
    col='gold'
)

Sunday, December 14, 2014

Monthly Weather in Netherlands

When I downloaded the KNMI meteorological data, the intention was to do something which takes more than just the computers memory. While it is clearly not big data, at the very least 100 years of daily data is not small either. So I took along a load of extra variables to see what trouble I would run into. I did not run out of memory, but did make some figures.

Data

Data are acquired from KNMI. They have various sets of data, this page has a selection form which leads to the data used today. The data comes with a header explaining details, unfortunately in Dutch.

Plots

Just about everybody knows days are shorter in winter. What I never realized, even within that shorter day, we get less daylight. The short days are often so clouded, we don't get sun, meanwhile, in summer the sun does shine a bigger part of the daylight period.


In real hours sunshine this results in the following plot. December clearly has the darkest hours.
What we do get, not surprising since Dutch weather is fairly similar to English weather, is rain. Not continuous rain, most of the time it is dry, but still, autumn and winter do have days where it does not seem to stop. Autumn has the bad reputation for rain, but this plot makes winter look particular bad.
All this rain gives a load of humidity. This humidity in its turn, gives rise to a weather we name 'waterkoud'. It is above zero C but still quite cold outside. The humidity makes for air with a high heat capacity, hence one cools down quickly. Temperatures below zero can make for much nicer weather, but that can hamper traffic quite a lot. Most of the time it just doesn't freeze.
Code

library(plyr)
library(dplyr)
library(ggplot2)
r1 <- read.csv('KNMI_20141115.edited.txt')
Sys.setlocale(category = "LC_TIME", locale = "C")
r2 <- mutate(r1,
    date = as.Date(format(YYYYMMDD),'%Y%m%d'),
    month =factor(months(date,abbreviate=TRUE),
        levels=months(as.Date(
                paste('2014',
                    formatC(1:12,digits=2,width=2,flag='0'),
                    '01',sep='-')),
            abbreviate=TRUE)),
    yearf=factor(format(date,'%Y')),
    yearn=as.numeric(substr(YYYYMMDD,1,4)),
    day=format(date,'%e'))

g1 <- ggplot(r2,aes(x=month,y=SP))
g1 + geom_violin() +
        ylab('% of longest possible sunshine')

g1 <- ggplot(r2,aes(x=month,y=SQ/10))
g1 + geom_violin()  +
        ylab('Sunshine duration (h)')

g1 <- ggplot(r2,aes(x=month,y=DR/10))
g1 + geom_violin() +
        scale_y_continuous('Precipitation Duration (h)',
                breaks=c(0,6,12,18,24))

g1 <- ggplot(r2,aes(x=month,y=UG))
g1 + geom_violin() +
        ylab('Relative Humidity (%)')
 

g1 <- ggplot(r2,aes(x=month,y=TG/10))
g1 + geom_violin() +
        ylab('Temperature (C)')

Saturday, December 6, 2014

SAS PROC MCMC in R: Nonlinear Poisson Regression Models

In exercise 61.1 the problem is that the model has bad mixing. In the SAS manual the mixing is demonstrated after which a modified distribution is used to fix the model.
In this post the same problem is tackled in R; MCMCpack, RJags, RStan and LaplaceDemon. MCMCpack has quite some mixing problems, RStan seems to do best.

Data

To quote the SAS manual "This example shows how to analyze count data for calls to a technical support help line in the weeks immediately following a product release. (...) You can model the number of daily calls as a Poisson random variable, with the average number of calls modeled as a nonlinear function of the number of weeks that have elapsed since the product’s release. (...) During the first several weeks after a new product is released, the number of questions that technical support receives concerning the product increases in a sigmoidal fashion. The expression for the mean value in the classic Poisson regression involves the log link. There is some theoretical justification for this link, but with MCMC methodologies, you are not constrained to exploring only models that are computationally convenient. The number of calls to technical support tapers off after the initial release, so in this example you can use a logistic-type function to model the mean number of calls received weekly for the time period immediately following the initial release."
observed <- scan(text='
1 0 1 2 2 2 2 1 3 1 3 3
4 5 4 8 5 5 5 9 6 17 6 9
7 24 7 16 8 23 8 27',
    what=list(integer(),integer()),
    sep=' ',
)
names(observed) <- c('weeks','calls')
observed <- as.data.frame(observed)

Analysis

MCMCpack

The MCMCpack solution is derived from LaplacesDemon solution below. It is placed as first because it shows some of the problems with the mixing. As a change from LaplacesDemon, gamma could get negative, for which I made a -Inf likelihood.
library(MCMCpack)
MCMCfun <- function(parm) {
    names(parm) <- c('alpha','beta','gamma')
    if (parm['gamma']<0) return(-Inf)
    mu <-parm['gamma']*
        LaplacesDemon::invlogit(parm['alpha']+parm['beta']*observed$weeks)
    LL <- sum(dpois(observed$calls,mu,log=TRUE))
    LP <- LL+ dgamma(parm['gamma'],shape=3.4,scale=12,log=TRUE) +
        dnorm(parm['alpha'],-5,sd=.25,log=TRUE) +
        dnorm(parm['beta'],0.75,.5,log=TRUE)
    return(LP)
}
mcmcout <- MCMCmetrop1R(MCMCfun,
    c(alpha=-4,beta=1,gamma=2))

The figures show bad mixing. Parameters, especially Beta and Gamma, get stuck. There is quite some autocorrelation.
plot(mcmcout)
acf(mcmcout)
The cause is a nasty correlation between Beta and Gamma
pairs(as.data.frame(mcmcout))

LaplacesDemon

I chose an adaptive algorithm for LaplacesDemon. For the specs, the numbers are derived from the standard deviation of a previous run. Stepsize keeps reasonably constant through the latter part of run. The samples look much better than MCMCpack, although the mixing is not ideal either.
At a later stage I tried this same analysis with reflective Slice Sampler, however, that did was quite a bit more difficult and the end result was not better than this.
library('LaplacesDemon')
mon.names <- "LP"
parm.names <- c('alpha','beta','gamma')
PGF <- function(Data) c(rnorm(3,0,1))
N <-1
MyData <- list(mon.names=mon.names,
    parm.names=parm.names,
    PGF=PGF,
    calls=observed$calls,
    weeks=observed$weeks)
Model <- function(parm, Data)
{
    mu <-parm['gamma']*
        invlogit(parm['alpha']+parm['beta']*Data$weeks)
    LL <- sum(dpois(Data$calls,mu,log=TRUE))
    LP <- LL+ dgamma(parm['gamma'],shape=3.4,scale=12,log=TRUE) +
        dnorm(parm['alpha'],-5,sd=.25,log=TRUE) +
        dnorm(parm['beta'],0.75,.5,log=TRUE)
    Modelout <- list(LP=LP, Dev=-2*LL, Monitor=LP,
        yhat=mu,
        parm=parm)
    return(Modelout)
}

Initial.Values <- c(alpha=-4,beta=1,gamma=2) #GIV(Model, MyData, PGF=TRUE)
Fit1 <- LaplacesDemon(Model,
    Data=MyData,
    Initial.Values = Initial.Values,
    Algorithm = "AHMC",
    Specs = list(epsilon = c(.23,.2,13.5)/4,
        L = 2, Periodicity = 10),
    Iterations=40000,Status=2000   
)
BurnIn <- Fit1$Rec.BurnIn.Thinned
plot(Fit1, BurnIn, MyData, PDF=FALSE)

Jags

I do not think using one chain is advisable, especially since Jags makes more chains so easy. But in the spirit of this analysis I am using one. Coda plots are used since they are a bit more compact.
library(R2jags)
datain <- list(
    calls=observed$calls,
    weeks=observed$weeks,
    n=nrow(observed))
parameters <- c('alpha','beta','gamma')

jmodel <- function() {
    for (i in 1:n) {       
        mu[i] <- gamma*ilogit(alpha+beta*weeks[i])
        calls[i] ~ dpois(mu[i])
    }
    alpha ~ dnorm(-5,1/(.25*.25))
    gamma ~ dgamma(3.4,1/12)
    beta ~ dnorm(.75,1/(.5*.5))
}

jj <- jags(model.file=jmodel,
    data=datain,
    parameters=parameters,
    n.chains=1,
    inits=list(list(alpha=-4,beta=1,gamma=2))
    )

cc <- as.mcmc(jj$BUGSoutput$sims.array[,1,])

plot(cc)
acfplot(cc)

Stan

Stan probably does best handling this typical distribution. Again, one chain is only in the context of this posting.
library(rstan)
smodel <- '
    data {
    int <lower=1> n;
    int calls[n];
    real  weeks[n];
    }
    parameters {
    real Alpha;
    real Beta;
    real Gamma;
    }
    transformed parameters {
    vector[n] mu;
    for (i in 1:n) {
       mu[i] <- Gamma*inv_logit(Alpha+Beta*weeks[i]);
    }
    }
    model {
    calls ~ poisson(mu);
    Gamma ~ gamma(3.4,1./12.);
    Beta ~ normal(.75,.5);
    Alpha ~ normal(-5,.25);
    }
    '
fstan <- stan(model_code = smodel,
    data = datain,
    pars=c('Alpha','Beta','Gamma'),
    chains=1,
    init=list(list(Alpha=-4,Beta=1,Gamma=2)))

traceplot(fstan,inc_warmup=FALSE)

smc <- as.mcmc(as.matrix(fstan))
acfplot(smc)






Sunday, November 30, 2014

Change in temperature in Netherlands over the last century

I read a post 'race for the warmest year' at sargasso.nl. They used a plot, originating from Ed Hawkins to see how 2014 progressed to be warmest year. Obviously I wanted to make the same plot using R. In addition, I wondered which parts of the year had most increased in temperature.

Data

Similar to last week, data are acquired from KNMI. They have various sets of data, this page has a selection form which leads to the data used today. The data comes with a header explaining details, unfortunately in Dutch. Of relevance for this post is TG, average and minimum temperature in 0.1 C. Station used is de Bilt, where they got most data. Prior to reading the data into R, the explanatory text header was removed.
The data input is completed by converting YYYYMMDD to date, year, month and dayno variables. Prior to analysis for simplicity a leap day was removed. I chose merge() rather than a dplyr merge function since the latter releveled my moth factor. The data.frame mylegend is used later on to label the first of the month.
library(plyr)
library(dplyr)
library(ggplot2)
library(locfit)
library(plotrix)
r1 <- read.csv('KNMI_20141115.edited.txt')
Sys.setlocale(category = "LC_TIME", locale = "C")
r2 <- mutate(r1,
    date = as.Date(format(YYYYMMDD),'%Y%m%d'),
    month =factor(months(date,abbreviate=TRUE),
        levels=months(as.Date(
                paste('2014',
                    formatC(1:12,digits=2,width=2,flag='0'),
                    '01',sep='-')),
            abbreviate=TRUE)),
    yearf=factor(format(date,'%Y')),
    yearn=as.numeric(substr(YYYYMMDD,1,4)),
    day=format(date,'%e'))

days <- filter(r2,yearn==1901) %>%
    mutate(.,dayno=format(date,'%j') ) %>%
    select(.,month,day,dayno)

r3 <- merge(r2,days,all=TRUE) %>%
    filter(.,!grepl('0229',YYYYMMDD)) %>%
    mutate(.,daynon=as.numeric(dayno))

mylegend <- filter(days,day==' 1') %>%
    mutate(.,daynon=as.numeric(dayno))

Plots

Cumulative average by year

Each line is a separate year. For this plot is is needed to have year a factor. Unfortunately, I was not able to get a colourbar legend for colors, that required a continuous year variable. Green is beginning last century, pinkish is recent, the fat black line is 2014.

r4 <- group_by(r3,yearf) %>%
    mutate(.,cmtemp = cummean(TG/10))

g1 <- ggplot(r4,aes(x=daynon,y=cmtemp,
        col=yearf))
g1 + geom_line(alpha=.4,show_guide=FALSE)  +
    scale_x_continuous('Day',
        breaks=mylegend$daynon,
        labels=mylegend$month,
        expand=c(0,0)) +
    scale_y_continuous('Temperature (C)')  +
    geom_line(data=r4[r4$yearf=='2014',],
        aes(x=daynon,y=cmtemp),
        col='black',
        size=2)

2014 with average of 30 years

To get a better idea how 2014 compares to previous years, the average of 30 years has been added. We had warm year, except for August, which suggested an early spring. In hindsight, second half of August had colder days than beginning April or end October.


r3$Period <- cut(r3$yearn,c(seq(1900,2013,30),2013,2014),
    labels=c('1901-1930','1931-1960',
        '1961-1990','1991-2013','2014'))
g1 <- ggplot(r3[r3$yearn<2014,],aes(x=daynon,y=TG/10,col=Period))
g1 + geom_smooth(span=.15,method='loess',size=1.5)  +
    scale_x_continuous('Day',
        breaks=mylegend$daynon,
        labels=mylegend$month,
        expand=c(0,0)) +
    geom_line(#aes(x=daynon,y=TG/10),
        data=r3[r3$yearn==2014,]) +
    scale_y_continuous('Temperature (C)')

Change by year

Finally, a plot showing how temperature changed within the years. To obtain this plot, I needed a day corrected base temperature. The baseline temperature is smoothed over days for years 1901 to 1924. The baseline was used to get a corrected baseline, which was subsequently smoothed over years and days.
Smoothers have edge effects, to remove these from the visual part, January and December have been added as extra to the data. Hence within the year there are only minimal edge effects.
The plot shows that middle last century, some parts of the year actually had a drop in temperature. In contrast, November has gradually been getting warmer since middle last century. The new century has seen quite an increase. 
myyears <- r3[r3$yearn<1925,]
m13 <- filter(myyears,daynon<30) %>%
    mutate(.,daynon=daynon+365)
m0 <- filter(myyears,daynon>335) %>%
    mutate(.,daynon=daynon-365)
myyears <- rbind_list(m0,myyears,m13)

nn <- .2
mymod <- locfit(TG ~ lp(daynon,nn=nn),
    data=myyears)
topred <- data.frame(daynon=1:365)
topred$pp <- predict(mymod,topred)
#plot(pp~ daynon,data=topred)

r5 <- merge(r3,topred) %>%
    mutate(.,tdiff=(TG-pp)/10) %>%
    select(.,tdiff,daynon,yearn)
m13 <- filter(r5,daynon<30) %>%
    mutate(.,daynon=daynon+365,
        yearn=yearn-1)
m0 <- filter(r5,daynon>335) %>%
    mutate(.,daynon=daynon-365,
        yearn=yearn+1)
r6 <- rbind_list(m0,r5,m13)

topred <- expand.grid(
    daynon=seq(1:365),
    yearn=1901:2014)
topred$pp2 <- locfit(
        tdiff ~ lp(yearn,daynon,nn=nn),
        data=r6) %>%
    predict(.,topred)
#topred <- arrange(topred,daynon,yearn)

myz <- matrix(topred$pp2,ncol=365)
zmin <- floor(min(topred$pp2)*10)/10
zmax <- ceiling(max(topred$pp2)*10)/10
myseq <- seq(zmin,zmax,.1)
par(mar=c(5,4,4,6))
image(myz,useRaster=TRUE,
    axes=FALSE,frame.plot=TRUE,
    col=colorRampPalette(c('blue','red'))(length(myseq)-1),
    breaks=myseq)
axis((seq(10,114,by=10)-1)/113,labels=seq(1910,2010,by=10),side=1)
axis((mylegend$daynon-1)/365,labels=mylegend$month,side=2)
color.legend(1.1,0,1.2,1,legend=c(zmin,zmax),gradient='y',
    rect.col=colorRampPalette(c('blue','red'))(length(myseq)-1))

Sunday, November 23, 2014

When should I change to snow tires in Netherlands

The Royal Netherlands Meteorological Institute has weather information by day for a number of Dutch stations. In this post I want to use those data for a practical problem: when should I switch to winter tires? (or is that snow tires? In any case nails or metal studs are forbidden in Netherlands). Netherlands does not have laws prescribing winter tires, it does not get bad enough, though there are some regulations in neighboring countries, which is of relevance for people driving to winter sports. For others, it is up to themselves.

Data

Data are acquired from KNMI. They have various sets of data, this page has a selection form which leads to the data used today. The data comes with a header explaining details, unfortunately in Dutch. Of relevance for this post are TG and TN, average and minimum temperature in 0.1 C. Station used is de Bilt, where they got most data.

Analysis 

It seems under 7 C  there is advantage to using winter tires. Temperature varies within the day, but in general the coldest part should be just before dawn which is morning rush hour. Warmest part around noon, with evening rush hour already cooling down. Based on this I decided that days with an average temperature of 7 C or less, or a minimum temperature of 0 C or less benefit from winter tires. In addition I chose the period 1980 to 2013.

Result

It seems October is way too early to switch. Second half of November is an appropriate time.


Code

library(plyr)
library(dplyr)
library(ggplot2)
# data prepared without text heading
r1 <- read.csv('KNMI_20141115.edited.txt')
Sys.setlocale(category = "LC_TIME", locale = "C") # Not NL months
r2 <- mutate(r1,
    date = as.Date(format(YYYYMMDD),'%Y%m%d'),
    month =factor(months(date,abbreviate=TRUE),
        levels=months(as.Date(
                paste('2014',
                    formatC(1:12,digits=2,width=2,flag='0'),
                    '01',sep='-')),
            abbreviate=TRUE)),
    yearf=factor(format(date,'%Y')),
    yearn=as.numeric(substr(YYYYMMDD,1,4)),
    day=format(date,'%e'))
# days number 1 to 365, using numbers from 1901

days <- filter(r2,yearn==1901) %>%
    mutate(.,dayno=format(date,'%j') ) %>%
    select(.,month,day,dayno)

# select correct years and Months, remove leap day to sync day numbers
# and flag selected days
r3 <- merge(r2,days,all=TRUE) %>%
    filter(.,!grepl('0229',YYYYMMDD)) %>%
    mutate(.,daynon=as.numeric(dayno)) %>%
    filter(.,yearn>1980 & yearn<2014 & month %in% c('Oct','Nov','Dec')) %>%
    mutate(.,wt = TN<0 | TG<70 ) %>%
    mutate(.,month=factor(month,levels=c('Oct','Nov','Dec')))

# count the days
r4 <- as.data.frame(xtabs(wt ~ dayno,data=r3)/length(unique(r3$yearn)))
r4$daynon <- as.numeric(as.character(r4$dayno))

# plot

mylegend <- select(r3,day,month,daynon,dayno) %>%
    unique(.) %>%
    filter(.,day ==' 1')
g1 <- ggplot(r4,aes(x=daynon,y=Freq))
g1 + geom_smooth()  +
    geom_point() +
    scale_x_continuous('Date',breaks=mylegend$daynon,labels=mylegend$month) +
    ylab('Proportion wintery')