Sunday, July 28, 2013

A JAGS calculation on pattern of rain January 1906-1915 against 2003-2012

Two weeks ago I showed rain data from six stations in Netherlands years 1906 till now. Last week I showed that frequency of days with and without rain differed between December 1906-1915 and December 2003-2012. This week I am considering the same data as t-distributed with a left truncation at 0 mm rain. This can be modeled very easy in JAGS. This model gives the posterior distribution of the amount of rain moved 1 mm of rain upwards and half a mm narrower, but both these intervals include 0.

The data

As described before; the data are from KNMI; ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE. The actual page for the data is http://www.knmi.nl/klimatologie/monv/reeksen/select_rr.html.  The data selected were from 6 locations; De Bilt, Groningen, Heerde, Kerkwerve, Ter Apel and Winterswijk (Sibinkweg). The blog starts with data entered in a data.frame named all which was derived two weeks ago. The data selected is again every 5th day in January 1906-195 against January 2004-2013.

The model

The model is almost off the shelf. Truncation, two t-distributions with different means and standard deviations. Priors there taken from Data Analysis Using Regression and Multilevel/Hierarchical Models, Gelman and Hill.
sel1 <- all[(all$year <1916 | all$year>2003) 
        & all$Month=='Jan' 
        & all$location=='DE-BILT'
        & all$day %in% seq(1,31,by=5),]
sel1$decade <- factor(c('1906-1915','2004-2013')[1+(sel1$year>1950)]) 

datain <- list(
    isCensored = 1-as.numeric(sel1$RD==0),
    N = nrow(sel1),
    RD = ifelse(sel1$RD==0,NA,sel1$RD/10),
    period = 1+(sel1$year>1950)
)

model1 <- function() {
  for ( i in 1:N ) {
    isCensored[i] ~ dinterval( RD[i] , 0 )
    RD[i] ~ dt( mu[period[i]] , tau[period[i]],nu ) 
  }
  for (i in 1:2) {
    tau[i] <- 1/pow(sigma[i],2)
    sigma[i] ~ dlnorm(sigmamu,sigmatau)
    mu[i] ~ dnorm(mumu,mutau)
  }
  mumu ~ dnorm(0,1e-6)
  mutau <- 1/pow(musigma,2)
  musigma ~ dunif(0,100)
  sigmamu ~ dnorm(0,1e-6)
  sigmatau <- 1/pow(sigmasigma,2)
  sigmasigma ~ dunif(1,100)
  mudif <- mu[2]-mu[1]
  sigmadif <- sigma[2]-sigma[1]
  nu <- 1/nuinv 
  nuinv ~ dunif(0,.5)
}
params <- c('mu','sigma','mudif','sigmadif','nu')
inits <- function() {
  list(mumu = rnorm(1,0),
      musigma = runif(1,1,2),
      sigmamu = rnorm(1,0),
      sigmasigma = runif(1,1,2),
      nuinv = runif(1,0,.5))
  
}

jagsfit <- jags(datain, model=model1, inits=inits, 
    parameters=params,progress.bar="gui",
    n.chains=1,n.iter=10000,n.burnin=3000,n.thin=3)
jagsfit
Inference for Bugs model at "C:/Users/.../Local/Temp/Rtmp4ydV3G/modele0c40226a9c.txt", fit using jags,
 1 chains, each with 10000 iterations (first 3000 discarded), n.thin = 3
 n.sims = 2334 iterations saved
         mu.vect sd.vect    2.5%     25%     50%     75%   97.5%
mu[1]      0.026   0.480  -0.925  -0.284   0.042   0.332   0.951
mu[2]      0.917   0.361   0.250   0.665   0.900   1.152   1.658
mudif      0.891   0.615  -0.231   0.449   0.880   1.315   2.109
nu         2.331   0.348   2.006   2.085   2.212   2.465   3.290
sigma[1]   2.980   0.597   2.018   2.555   2.913   3.327   4.361
sigma[2]   2.309   0.395   1.642   2.024   2.274   2.552   3.197
sigmadif  -0.671   0.689  -2.108  -1.098  -0.622  -0.214   0.588
deviance 509.412   9.431 492.173 502.886 508.990 515.454 528.740

DIC info (using the rule, pD = var(deviance)/2)
pD = 44.5 and DIC = 553.9
DIC is an estimate of expected predictive error (lower deviance is better).

Interpretation

The means shown in the first decade is very close to 0, with an increase to 1 mm in the last decade. The difference is just 0.9 mm a very small increase, with 0 just within the posterior 95% interval.
The standard deviation decreases from first to last decade. It is almost like the rain becomes more predictable. Again the 95% posterior interval of the difference does include 0.
The degrees of freedom of the t distribution is two to three, quite heavy tailed. Hence where almost half of a the days don't have rain, half of the days have a little rain (a few mm, certainly not more than 6 mm), a few dais have more rain.

Sunday, July 21, 2013

Did the pattern of rain change in the last 100 years?

Last week I showed rain data from six stations in Netherlands years 1906 till now. The obvious next question is; did it change? A surprisingly difficult question. The data is not normal distributed, but it is time-correlated, location correlated.

The data

As described before; the data are from KNMI; ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE. The actual page for the data is http://www.knmi.nl/klimatologie/monv/reeksen/select_rr.html.  The data selected were from 6 locations; De Bilt, Groningen, Heerde, Kerkwerve, Ter Apel and Winterswijk (Sibinkweg). The blog starts with data entered in a data.frame named all which was derived last week.

The problems with the data

The data is certainly not normal. Most days have no rain. The data is also location correlated. It is improbable to have beautiful weather in one location and loads of rain 200 km, possibly with the exception of summer, when thunderstorms may be highly local. It is also not often to have nice weather one day and loads of rain the next. For those not living here, examine this plot of 1906 data.
library(ggplot2)
Y1906 <- all[all$year==1906,]
c1 <- ggplot(Y1906, aes(x=day, y=RD,col=location,linetype=location))
c1 + geom_line() + 
    facet_wrap( ~ Month) +
    theme(legend.position = "bottom") +
    labs(col = " ",linetype=' ')

Analysis

There are many ways to analyze the data for differences. In this case I wanted to compare the first and last 10 years of data. A non-parametric method is preferred because the non-normality That leaves all the other problems, correlations between observations. To avoid these I tried to eliminate data. Just one location avoids correlation between locations. Data every sixth day avoids time correlation. De Bilt is chosen as location because it is centre of the country and home of the KNMI. Finally, January, because only one month seems most opportune. It seems quite crude when you look at it.
sel1 <- all[(all$year <1916 | all$year>2003) 
        & all$Month=='Jan' 
        & all$location=='DE-BILT'
        & all$day %in% seq(1,31,by=5),]
sel1$decade <- factor(c('1906-1915','2004-2012')[1+(sel1$year>1950)]) 
Packige coin has a one_way function which seems suitable. Distribution approximate makes it do an actual Monte Carlo resampling, and the result is no difference.
oneway_test(RD ~ decade,data=sel1,distribution='approximate')
Approximative 2-Sample Permutation Test

data:  RD by decade (1906-1915, 2004-2013)
Z = -0.6353, p-value = 0.535

alternative hypothesis: true mu is not equal to 0
Oneway_test tests for location,Kolmogorov_Smirnov tests for general differences. The assumption is that the data is continuous, not so sure that holds. Result; p-value close to 0.05.
ks.test(sel1$RD[sel1$decade=='1906-1915'],
    sel1$RD[sel1$decade=='2004-2013'])
 Two-sample Kolmogorov-Smirnov test data: sel1$RD[sel1$decade == "1906-1915"] and sel1$RD[sel1$decade == "2004-2013"] D = 0.2286, p-value = 0.05161 alternative hypothesis: two-sided

The question is why there are differences? After plotting, it seems the number of days without rain changes. The latter seems to show a significant difference.
c1 <- ggplot(sel1, aes(x=RD/10,col=decade))
c1 + geom_density() + xlab('mm rain/day')
xtabs(~ decade + factor(RD==0),data=sel1)
           factor(RD == 0)
decade      FALSE TRUE
  1906-1915    37   33
  2004-2013    53   17
prop.test(    xtabs(~ decade + factor(RD==0) ,data=sel1))

2-sample test for equality of proportions with continuity correction

data:  xtabs(~decade + factor(RD == 0), data = sel1)
X-squared = 7, df = 1, p-value = 0.008151
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.3970179 -0.0601250
sample estimates:
   prop 1    prop 2 
0.5285714 0.7571429 

Check with December data

Since January was used to select the analysis method, ideally it should not be used for the p-value. There is data enough; December turns out to show a p-value of 0.005.
sel2 <- all[(all$year <1916 | all$year>2002) 
        & all$Month=='Dec' 
        & all$location=='DE-BILT'
        & all$day %in% seq(1,31,by=5),]
sel2$decade <- factor(c('1906-1915','2002-2012')[1+(sel2$year>1950)]) 
prop.test(    xtabs(~ decade + factor(RD==0) ,data=sel2))

2-sample test for equality of proportions with continuity correction

data:  xtabs(~decade + factor(RD == 0), data = sel2)
X-squared = 7.7712, df = 1, p-value = 0.005309
alternative hypothesis: two.sided
95 percent confidence interval:
 -0.36463362 -0.06393781
sample estimates:
   prop 1    prop 2 
0.6571429 0.8714286 

Sunday, July 14, 2013

Rain in Netherlands during the past 100 years

Climate has my interest. But discussions on climate change seem to be focused on temperature. In real life, we look at temperature, rain, sunshine and wind. I was therefor happy to find a load of rain data on Royal Netherlands Meteorological Institute. In this post I plot some of the data.

Data

The actual page for the data is http://www.knmi.nl/klimatologie/monv/reeksen/select_rr.html. The data selected were from 6 locations; De Bilt, Groningen, Heerde, Kerkwerve, Ter Apel and Winterswijk (Sibinkweg). The reason for these specific stations is the length of the series; they start January 1906 and are still continuing. The header of each file reads (retaining only English and two lines of data);
THESE DATA CAN BE USED FREELY PROVIDED THAT THE FOLLOWING SOURCE IS ACKNOWLEDGED:
ROYAL NETHERLANDS METEOROLOGICAL INSTITUTE

STN      = stationnumber
YYYYMMDD = date (YYYY=year MM=month DD=day)
RD       = daily precipitation amount in 0.1 mm over the period 08.00 preceding day - 08.00 UTC present day
SX       = code for the snow cover at 08.00 UTC:

           code                           snow cover
           1                                    1 cm
           ...                                   ...
           996                                996 cm

           997 broken snow cover < 1 cm
           998 broken snow cover >=1 cm
           999 snow dunes

5 spaces represents a missing value

STN,YYYYMMDD,   RD,   SX,
745,19060101,    0,     ,
It is a bit inconvenient to have a comma at the end of data fields, R expects more data after a comma. So as manual pre-processing I removed the header until the actual column headers and added dummy as final variable. The net effect is that the data can be read via read.csv. 
files <- dir(pattern='.csv$')
# extract link station number and location name 
splits <- do.call(rbind,strsplit(files,'_|\\.'))
splits <- data.frame(STN=splits[,3],location=splits[,2])
#read and combine data
all <- lapply(files,read.csv)
all <- do.call(rbind,all)
all <- merge(all,splits) 
all$Date <- as.Date(as.character(all$YYYYMMDD),format='%Y%m%d')
As in many countries we have seasons, so I do want to examine by month (in the proper order). For the blog, I want the months to be in English. (the formats such as %b, %d are documented in strptime, I always have difficulty finding them, so writing that down too).
Sys.setlocale(category = "LC_TIME", locale = "C")
all$Month <-  format(all$Date,'%b')
all$Month <- factor(all$Month,unique(all$Month))
all$Day <-  format(all$Date,'%d')
all$day <- as.numeric(all$Day)
all$Year <- format(all$Date,'%Y')
all$year <- as.numeric(all$Year)

Plotting the data

There are 39263 time points, which is too much for one plot. So I aggregated the data over locations, months and years. Four aggregate variables are used; mean and standard deviation of amount of rain, 75% quantile and % of days with rain. There were a seven records without data (April 1952, at Heerde), hence na.rm=TRUE. For the smoother a 0.5 level was used rather than the 0.95. By lowering the level it was possible to add the fourth aggregate variable.
rm <- aggregate(all$RD,list(Month=all$Month,
      Year=all$Year,
      year=all$year),
    function(x) mean(x,na.rm=TRUE))
rm$stat <- 'mean (0.1 mm)'
rpropd <- aggregate(all$RD,list(Month=all$Month,
      Year=all$Year,
      year=all$year),
    function(x) 100*mean(x>0,na.rm=TRUE))
rpropd$stat <- '% days'
sd <- aggregate(all$RD,list(Month=all$Month,
      Year=all$Year,
      year=all$year),
    function(x) sd(x,na.rm=TRUE))
sd$stat <- 'sd (0.1 mm)'


qu75 <- aggregate(all$RD,list(Month=all$Month,

        Year=all$Year,
        year=all$year),
     function(x) quantile(x,0.75,na.rm=TRUE))
qu75$stat <- '75 % quantile (0.1 mm)'
summ <- rbind(rm,sd,rpropd,qu75)
c1 <- ggplot(summ, aes(x=year, y=x,col=stat))
c1 + stat_smooth(method='loess',level=0.5) +
   facet_wrap( ~ Month) +
   theme(legend.position = "bottom") +
   labs(col = "Statistic") 
+
   geom_vline(xintercept = 1956,col='dark grey')

I never knew weather was this depressing in winter, not only is there little day light (sunrise 8:47, sunset 16:33 on Christmas ) but there is 60 % chance on rain within any in the three dark months (November, December, January). Luckily first impressions are a bit overdone, a day with rain is not necessarily a day with rain while you are going somewhere. An amount of 20 equates to 2 mm per day (or 3 on the rainy days), which is not a lot. 
The question of climate change implies there is a standard. Beginning last century, is that early enough? On the other hand, in this 100 years CO2 levels increased a lot, in Europe and the US smog is mostly a thing of the past ( UK clean air act is from 1956, the vertical line, but more recently it increased in South East Asia). There a quite a lot of shift mid 20th century, can they be explained by smog? Last years it seems rain is increasing in summer and decreasing in spring (although 2013 had a terrible March).

Plot with quantiles

To get a slightly better handle on the distributions I made a plot with quantiles. Rain gets more extreme in July August. Other months have less change but increasing, possibly with decreasing rain in spring.
qu <- lapply(seq(.5,.9,.1), function(qq) {
      aa<-  aggregate(all$RD,list(Month=all$Month,
        Year=all$Year,
        year=all$year),
     function(x) quantile(x,qq,na.rm=TRUE))
     aa$stat <- paste(round(100*qq),'%')
     aa
   })

sumq <- do.call(rbind,qu)
c1 <- ggplot(sumq, aes(x=year, y=x/10,col=stat))
c1 + stat_smooth(method='loess',level=.5) + 
    facet_wrap( ~ Month) +
    theme(legend.position = "bottom") +
    labs(col = "Quantile") + ylab('rain per day (mm)') +
    geom_vline(xintercept = c(1956),col='dark grey')

Location

I did not think this would make a difference in such a small country. Indeed it is all so close in the figure that I decided not to use the bands. For those interested, Heerde is near Zwolle, centre of the Netherlands, while Kerkwerve is near Zierikzee (Zeeland), southwest of Netherlands, and it appears the better part of the Netherlands for summer vacation.
c1 <- ggplot(rml, aes(x=year, y=x,col=location,linetype=location))
c1 + stat_smooth(method='loess',se=FALSE) + 
    facet_wrap( ~ Month) +
    theme(legend.position = "bottom") +
    labs(col = " ",linetype=' ') +
    geom_vline(xintercept = c(1956),col='dark grey')

Sunday, July 7, 2013

change in weight of cars plot

Based on last week's faster algorithm I wanted to finish with car weights. Unfortunately a fail again. By now it is a fail of myself, it needs a bit more dedication and grunt than I am willing and able to give for this blog. This week I added some extra functions around the existing functions so I could harvest final results pretty easily. But the results seemed a bit odd at places, I ran the same again, the second time around they were a bit different. Nothing that seems unsolvable with a bit more attention, manually checking if stable sampler has been obtained, maybe tune the number of normal distributions which make the combined distribution.
Having said the negative, the pictures can be interpreted. The car market has changed from a double peaked distribution to a three peak distribution. Sales in recent years are mostly in the lowest weight category. The weight of cars in this category is slowly increasing though.

Data

Data were obtained as described here.  I made an additional plot of the weights. These are the raw data weighted by the width of the categories.
lweight4 <- weight2[weight2$RefYear == weight2$BuildYear+1,]
weight4$lower <- lweightcats[weight4$Onderwerpen_2]
weight4$upper <- uweightcats[weight4$Onderwerpen_2]
datashow <- expand.grid(year=2000:2012,
    weight=seq(500,2000,by=20))
datashow$y <-0
for (ii in 1:nrow(weight4)) {
  datashow$y[datashow$weight>=weight4$lower[ii] 
           & datashow$weight<=weight4$upper[ii] 
           & datashow$year==weight4$BuildYear[ii]] <-
    weight4$Waarde[ii]*100/(weight4$upper[ii]-weight4$lower[ii])         
}
library(lattice)
levelplot(y ~ weight+ year , data=datashow,   col.regions=
        colorRampPalette(c('white','yellow','green','blue','purple','red'))
)
xyplot(y ~ weight | factor(year) , data=datashow,type='l')

Modelling.

I worked on the previous code to obtain something which was very short and wrapped in a lapply. This means I had to trust the outcome, which I thought was reasonable. 
la <- lapply(2000:2012,function(x) {
      weight <- makdata(x)
      weightlims <- c(0,weight$upper)
      update <- updater(weight)
      dist <- update2line(weight,update)
      dist$year <- x
      dist
    }
 ) 
rbla <- do.call(rbind,la)
levelplot(y ~ x+ year  , data=rbla[rbla$x %in% seq(700,1600,10),],
   col.regions=
  colorRampPalette(c('white','yellow','green','blue','purple','red'))
The nice thing about this plot is that it clearly shows three peaks within each year and a slow increase of the locations of the years. 2009 looks a bit odd though. The lowest category existed maybe at 2003, but really took off in 2010. In 2009 it also seems the heaviest cars got less popular, but this seems to be reverting slightly in 2011.

model flaws

So I did the calculations again, resulting in rbla2, because I found 2009 odd compared to the rest. For comparison the next plot:
rbla$sim <- 1
rbla2$sim <- 2
rblaall <- rbind(rbla,rbla2)
xyplot(y ~ x |  factor(year) ,group=sim, type='l', data=rblaall[rblaall$x %in% seq(700,1600,10),])
A number of panes have separate lines, indicating different results. It seems the results of 2000, 2002 and 2009 are different, hence needing a bit more attention

R-code

weight1 <- read.csv2('Motorvoertuigen__per_010613140907.csv',na.strings='-')
weight2 <- weight1[!is.na(weight1$Waarde),]
weight2$BuildYear <- as.numeric(sub('Bouwjaar ','',as.character(weight2$Bouwjaren)))
weight2$RefYear <- as.numeric(sub(', 1 januari','',as.character(weight2$Peildatum)))
weightcats <- levels(weight2$Onderwerpen_2)
weightcats <- gsub('en meer','and more',weightcats)
levels(weight2$Onderwerpen_2) <- weightcats 
lweightcats <- as.numeric(gsub('( |-).*$','',weightcats))
uweightcats <- as.numeric(gsub('(^[0-9]+-)|([ [:alpha:]]+$)','',weightcats))
uweightcats[uweightcats==2451] <- 3500

nspacejump1 <- function(x,i,sd) {
  xold <- log(x[i]/(1-x[i])) + rnorm(1,0,sd)
  xnew <- 1/(1+exp(-xold))
  x <- x/sum(x[-i])*(1-xnew)
  x[i] <- xnew
  x
}

curldev5 <- function(counts,curmodel,pemodel) {
  dsd <- sum(dnorm(curmodel$sd,300,300,log=TRUE))
  prop.exp <- pemodel  %*% curmodel$w
  prop.exp <-  prop.exp/sum(prop.exp)
  dsd + lgamma(sum(counts$Waarde) + 1) + sum(counts$Waarde * log(prop.exp) - counts$lgam)
  #dmultinom(counts$Waarde,prob=prop.exp,log=TRUE)
}

version6 <- function(weight3,curmodel,niter) {
  pemodel <- matrix(0,nrow=22,ncol=nrow(curmodel))
  pemodel2 <- pemodel
  for (i in 1:nrow(curmodel)) {
    pn <- pnorm(weightlims,curmodel$mean[i],curmodel$sd[i])
    pemodel[,i] <- (pn[-1]-pn[-23])
  }
  cnow <- curldev5(weight3,curmodel,pemodel)
  ndist <- nrow(curmodel)
  result <- matrix(nrow=ndist*niter,ncol=3*ndist+1)
  index <- 1
  for (iter in 1:niter) {
    for (dist in 1:ndist) {
      nowmodel <- curmodel
      nowpe <- pemodel
      result[index,] <- c(curmodel$mean,curmodel$sd,curmodel$w,cnow)
      curmodel$mean[dist] <- curmodel$mean[dist] + rnorm(1,0,1)
      curmodel$sd[dist] <- curmodel$sd[dist] * runif(1,1/1.01,1.01)
      curmodel$w <- nspacejump1(curmodel$w,dist,.1)
      pn <- pnorm(weightlims,curmodel$mean[dist],curmodel$sd[dist])
      pemodel[,dist] <- (pn[-1]-pn[-23])
      cnew <- curldev5(weight3,curmodel,pemodel)
# cat(c(cnow,cnew,'\n'))  
      if (exp(cnew-cnow) > runif(1) ) cnow <- cnew
      else {curmodel <- nowmodel
        pemodel <- nowpe }
      index <- index + 1
    } 
  }
  return(list(result=result,curmodel=curmodel))
}
curmodel5 <- data.frame(
    mean=c( 892.0141263, 1.417520e+03, 1306.8248062, 1.939645e+03,1000),
    sd  =c(  99.5660288, 2.129247e+02,  194.2452168, 1.684932e+02,100),
    w   =c(   0.2252041, 4.384217e-02,    0.6125014, 1.845233e-02,.1))


makdata <- function(xxxx) {
  weightxxxx <- weight2[weight2$RefYear==xxxx+1 & weight2$BuildYear==xxxx,c(2,6)]
  weightxxxx$lower <- lweightcats[weightxxxx$Onderwerpen_2]
  weightxxxx$upper <- uweightcats[weightxxxx$Onderwerpen_2]
  weightxxxx$lgam <- lgamma(weightxxxx$Waarde + 1)
  weightxxxx
}

updater <- function(weight,nburnin=200000,nsample=10000) {
  update  <- version6(weight,curmodel5,nburnin)
  update <- version6(weight,update$curmodel,nsample)
  plot(update$result[,ncol(update$result)],type='l')
  return(update)
}

update2line <- function(weight,update) {
  integral <- sum((weight$upper-weight$lower)*weight$Waarde)
  x <- seq(0,3500,10)
  ysum <- rep(0,length(x))
  for(i in seq(1,10000,100)) {
    ndist <- nrow(update$curmodel)
    y <- rep(0,length(x))
    newmodel <- data.frame(mean = update$result[i,1:ndist],
        sd=update$result[i,ndist+(1:ndist)],
        w=update$result[i,2*ndist+(1:ndist)])
    for (ir in 1:nrow(newmodel))
      y <- y +  newmodel$w[ir]*dnorm(x,newmodel$mean[ir],newmodel$sd[ir])
    ysum <- ysum+y
  }
  data.frame(x=x,y=integral*ysum/sum(y))
}