Tuesday, February 2, 2016

Unemployment in Europe

A couple of years I have made plots of unemployment and its change over the years. At first this was a bigger and complex piece of code. As things have progressed, the code can now become pretty concise. There are just plenty of packages to do the heavy lifting. So, this year I tried to make the code easy to read and reasonably documented.


Data is from Eurostat. Since we have the joy of the Eurostat package, suffice to say this is dataset une_rt_m. Since the get_eurostat function gave me codes for things such as country and gender, the first step is to use a dictionary to decode. Subsequently, the country names are a bit sanitized and data is selected.

library(scales) # to access breaks/formatting functions

r1 <- get_eurostat('une_rt_m')%>%
    mutate(.,geo=as.character(geo)) # character preferred for merge
r2 <- get_eurostat_dic('geo') %>%
    rename(.,geo=V1) %>%
# part of country name within braces removed        
        country=gsub(' $','',country),
        country=ifelse(geo=='EA19',paste(country,'(19)'),country)) %>%
    select(.,geo,country) %>%
    right_join(.,r1) %>%
# keep only total, drop sexes
    filter(.,sex=='T') %>%
# filter out old Euro area and keep only EU28 , EA19    
    filter(.,!grepl('EA..',geo)|  geo=='EA19') %>% 
    filter(.,!(geo %in% c('EU15','EU25','EU27')) ) %>%         
# SA is seasonably adjusted    
    filter(.,s_adj=='SA') %>% 
    mutate(.,country=factor(country)) %>%


To make plots I want to have smoothed data. Ggplot will do this, but it is my preference to have the same smoothing for all curves, hence it is done before entering ggplot. There are a bit many countries, hence the number is reduced to 36, which are displayed in three plots of 3*4, for countries with low, middle and high maximum unemployment respectively. Two smoothers are applied, once for the smoothed data, the second for its first derivative. The derivative has forced more smooth, to avoid extreme fluctuation.

# add 3 categories for the 3 3*4 displays
r3 <- aggregate(r2$values,by=list(geo=r2$geo),FUN=max,na.rm=TRUE) %>%
            labels=c('low','middle','high'))) %>%
    select(.,-x) %>% # maxima not needed any more

#locpoly to make smooth same for all countries
Perc <- ddply(.data=r3,.variables=.(age,geo), 
    function(piece,...) {
      piece <- piece[!is.na(piece$values),]
      lp <- locpoly(x=as.numeric(piece$time),y=piece$values,
      sdf <- data.frame(Date=as.Date(lp$x,origin='1970-01-01'),

# locpoly for deriviative too
dPerc <- ddply(.data=r3,.variables=.(age,geo), 
    function(piece,...) {
      piece <- piece[!is.na(piece$values),]
      lp <- locpoly(x=as.numeric(piece$time),y=piece$values,
      sdf <- data.frame(Date=as.Date(lp$x,origin='1970-01-01'),
The plots are processed by subsection.
for (i in c('low','middle','high')) {
  g <- filter(Perc,class==i) %>%
          aes(x=Date,y=sPerc,colour=age)) +
      facet_wrap( ~ country, drop=TRUE) +
      geom_line()  +
      theme(legend.position = "bottom")+
      ylab('% Unemployment') + xlab('Year') +
      scale_x_date(breaks = date_breaks("5 years"),
          labels = date_format("%y")) 
for (i in c('low','middle','high')) {
  g <- filter(dPerc,class==i) %>%
          aes(x=Date,y=dPerc,colour=age)) +
      facet_wrap( ~ country, drop=TRUE) +
      geom_line()  +
      theme(legend.position = "bottom")+
      ylab('Change in % Unemployment') + xlab('Year')+
      scale_x_date(breaks = date_breaks("5 years"),
          labels = date_format("%y"))


In general, things are improving, which is good news, though there is still ways to go. As always, Eurostat has a nice document are certainly more knowledgeable than me on this topic. 

Average unemployment

First derivative

Sunday, January 17, 2016

A simple ANOVA

I was browsing Davies Design and Analysis of Industrial Experiments (second edition, 1967). Published by for ICI in times when industry did that kind of thing. It is quite an applied book. On page 107 there is an example where the variance of a process is estimated.


Data is from nine batches from which three samples were selected (A, B and C) and each a duplicate measurement. I am not sure about copyright of these data, so I will not reprint the data here. The problem is to determine the measurement ans sampling error in a chemical process.
    facet_wrap(~  batch )


At the time of writing the book, the only approach was to do a classical ANOVA and calculate the estimates from there.
aov(x~ batch + batch:Sample,data=r4) %>%
Analysis of Variance Table

Response: x
             Df Sum Sq Mean Sq  F value  Pr(>F)    
batch         8 792.88  99.110 132.6710 < 2e-16 ***
batch:Sample 18  25.30   1.406   1.8818 0.06675 .  
Residuals    27  20.17   0.747                     
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
In this case the residual variation is 0.75. The batch:Sample variation estimates is, due to the design, twice the sapling variation plus residual variation. Hence it is estimated as 0.33. How lucky we are to have tools (lme4) which can do this estimate directly. In this case, as it was a well designed experiment, these estimates are the same as from the ANOVA. 
l1 <- lmer(x ~1+ (1 | batch) + (1|batch:Sample) ,data=r4 )

Linear mixed model fit by REML ['lmerMod']
Formula: x ~ 1 + (1 | batch) + (1 | batch:Sample)
   Data: r4

REML criterion at convergence: 189.4

Scaled residuals: 
     Min       1Q   Median       3Q      Max 
-1.64833 -0.50283 -0.06649  0.55039  1.57801 

Random effects:
 Groups       Name        Variance Std.Dev.
 batch:Sample (Intercept)  0.3294  0.5739  
 batch        (Intercept) 16.2841  4.0354  
 Residual                  0.7470  0.8643  
Number of obs: 54, groups: batch:Sample, 27; batch, 9

Fixed effects:
            Estimate Std. Error t value

(Intercept)   47.148      1.355    34.8
A next step is confidence intervals around the estimates. Davies uses limits from a Chi-squared distribution for the residual variation, leading to a 90% interval 0.505  to 1.25. In contrast lme4 has two estimators, profile (computing a likelihood profile and finding the appropriate cutoffs based on the likelihood ratio test;and bootstrap (perform parametric bootstrapping with confidence intervals computed from the bootstrap distribution according to boot.type). Each of these takes one or few second on my laptop, not feasible for the pre computer age. The estimates are different, to my surprise more narrow:
Computing profile confidence intervals ...
                   5 %       95 %
.sig01       0.0000000  0.9623748
.sig02       2.6742109  5.9597328
.sigma       0.7017849  1.1007261
(Intercept) 44.8789739 49.4173227

Computing bootstrap confidence intervals ...
                                  5 %       95 %
sd_(Intercept)|batch:Sample  0.000000  0.8880414
sd_(Intercept)|batch         2.203608  5.7998348
sigma                        0.664149  1.0430984

(Intercept)                 45.140652 49.4931109
Davies continues to estimate the ratio to residual for sampling variation, which was the best available for that time. This I won't repeat.

Sunday, January 3, 2016

A plot of 'Who works at home'

I ran across this post containing displays on who works from home. I must say it looks great and is interactive but it did not help me understand the data. So I created this post to display the same data with a boring plot which might help me. For those really interested in this topic, census.gov created a .pdf which contains a full report with much more information than here.


Data is from census.gov. I have taken the first spreadsheet. It is one of those spreadsheets with counts and percentages and empty lines to display categories. Very nice to check some numbers, horrible to process. So, a bit of code to extract the numbers.
r1 <- read.xls('2010_Table_1.xls',stringsAsFactors=FALSE)
# throw out percentages
r2 <- r1[,r1[4,]!='Percent']
# put all column names in one row
r2$X.6[2] <- r2$X.6[3]
r2$X.8[2] <- r2$X.8[3]
# select part with data
r3 <- r2[2:61,c(1,3,5,6)]
names(r3)[1] <- r3[1,1]
r4 <-r3[c(-1:-3),]
#eliminate one row with mean income. 
r4 <- r4[-grep('$',r4[,2],fixed=TRUE),]
#reshape in long form
r5 <- reshape(r4,
row.names(r5) <- 1:nrow(r5)
# remove ',' from numbers and make numerical values. 
# units are in 1000, so update that too
r5$count <- as.numeric(gsub(',','',r5$count))*1000
# clean up numbers used for footnotes
r5$class <- gsub('(1|2|3)','',r5$class)
#some upfront '.' removed.
r5$Characteristic <- gsub('^\\.+','',r5$Characteristic)
# create a factor
r5$Characteristic <- factor(r5$Characteristic,
   levels=rev(r5$Characteristic[r5$class=='Home Workers']))
# and create a higher level factor
for (i in 1:nrow(r5)) r5$Mchar[i] <- 
   if(is.na(r5$count[i]) | r5$Mchar[i]=='Total') r5$Mchar[i] 
   else r5$Mchar[i-1]


The plot is made using old style graphics. I could not get either ggplot2 or lattice to provide the plot I wanted.
# prepare for axis labels
index <- subset(r5,r5$class=='Home Workers',c(Characteristic,Mchar))
index2 <- index[index$Characteristic!=index$Mchar | index$Characteristic=='Total',]
index3 <- index[index$Characteristic==index$Mchar & index$Characteristic!='Total',]

r6 <- merge(r5,index)
r6$class <- factor(r6$class)
 #   log='x',

Why I did not use ggplot2?

The ideal solution for ggplot2 might look something like this:
r7 <- r5[!is.na(r5$count),]
r7$Mchar <- factor(r7$Mchar,levels=unique(r7$Mchar))
        aes(x=Characteristic,y=count,col=class)) +
However, this throws an error:
Error in facet_render.wrap(plot$facet, panel, plot$coordinates, theme,  : 
  ggplot2 does not currently support free scales with a non-cartesian coord or coord_flip.
I also tried the system described here: http://wresch.github.io/2014/05/22/aligning-ggplot2-graphs.html, but I think width has changed in content, could not get that to be satisfactory.


tt <- as.data.frame(table(r7$Mchar))
tt$Freq[12] <- tt$Freq[12] +15

la <- lapply(tt$Var1,function(x) {
      r8 <- r5[r5$Mchar==as.character(x) ,]
      r8 <- r8[ !is.na(r8$count),]
              aes(x=Characteristic,y=count,col=class)) +

# http://wresch.github.io/2014/05/22/aligning-ggplot2-graphs.html
lax <- lapply(la,function(x) x$widths[2:3])
maxwidths <- do.call(grid::unit.pmax,lax)
for(i in 1:12) la[[i]]$widths <- as.list(maxwidths)

la[[12]] <- la[[12]] + 
        plot.margin = unit(c(0.01, 0.1, 0.02, 0.1), "null"))
for (i in 1:11) la[[i]] <- la[[i]] +
          axis.text.x = element_blank(),
          axis.title.x = element_blank(), 
          axis.ticks.x = element_blank(),
         plot.margin = unit(c(0.01, 0.1, 0.02, 0.1), "null"))

lag <- lapply(la,ggplotGrob)

g <- gtable_matrix(name = "demo",
    grobs = matrix(lag, nrow = 12), 
    widths = unit(9, "null"),
    heights = unit(tt$Freq, "null"))


Saturday, December 12, 2015

Vacancies in the Netherlands

Over the last couple of years, each weekend I have registering how many vacancies websites claim to have. This post shows some of the observations one may draw from the plots.


Data is from general and more specialized websites. The first observations started in 2010. Since that time there were a number of changes. In general, the most useful sites are uitzendbureau.nl, nuwerk.nl nationalevacaturebank.nl, indeed.nl and jobbird.nl. These all cover the whole market.
Uitzendbureau (employment agency) will cater mostly for the short term solutions. As an employer, wants an employee quickly, without much hassle, which can let go just as easy, an uitzendbureau is the port of call. The website covers many of these agencies, from what I have seen, I can confirm the bigger agencies. This makes the number of vacancies they have a marker from the general work market. The other three cover more the jobs at employers themselves. I seem to remember that nationalevacaturebank (national vacancies bank) did a lot of advertisement at the time. Nuwerk.nl is a subsite of nu.nl, a popular news site in the Netherlands. The others I added later to the collection.
In addition, there is VK banen, Intermediair and Stepstone. VK banen (Volkskrant) and Intermediair are the traditional sites for higher educated employees. Back in the time when internet was not a factor of importance, Intermediair came for free for high educated people starting just before graduation and ending at age 40 or 45. Thus traditionally it was one of the best places for vacancies. Volkskrant had somewhat lesser amount of vacancies, and many in education. At some point within the data they merged their vacancy activity. Stepstone is an international entity, which tries to get hold in the Dutch market.
Finally, BCFJobs is a website which caters Bio (life sciences), Chemistry and Food jobs. It does jobs from medium level upwards. It got into this list, because this covers my personal background.

Data import

Over the years I have used several (spreadsheet) programs to store the data. Currently it sits in Libreoffice, thus I will extract the data from an ODS file. read.ods resulted in a list (for each page in the spreadsheet) with an entry with text columns for each column in the page, I did not notice any option to directly employ the column names which were in row 1. Hence there is a bit of processing to obtain a data.frame.


General observations 

In the first figure there are uitzendbureaus, nuwerk and nationalevacaturebank. The line for nuwerk is terminated. At that point they switched to vacancies supplied by monsterboard. The numbers were totally different. and I stopped registering them.
Uitzendbureaus has quite fluctuating numbers. Especially in 2014, but to a lesser extend in other years, there is a spike just before summer holidays. Much of the work which needs to be done when people go on vacation is supplied via uitzendbureaus. A valley at the end of each year is caused by Christmas vacation. Many business close between Christmas and new year, somewhat depending on the actual weekdays on which they fall. As a result there is a dip in both uitzendbureau and nationalevacaturebank.nl. A final observation, which I find it difficult to explain is that the beginning of just many year shows some optimism. The number of vacancies increases. Then, after the summer vacation, things are more pessimistic. In the crisis years the second half of the year gave an actual decrease, post crisis it is a flat line. One possible explanation may be in the end of the school year just before summer. This means summer and autumn there may be a fresh batch out of school. Most of these may be entering their names at the uitzendbureau, one is required to search for a job, and this is one of the ways to go about that. By the end of the year these will all been supplied. But I could think of other reasons too (e.g. head count low at end of year, optimism/pessimism due to seasons).
Regarding the crisis, starting mid 2011 and continuing till end of 2012 there are less and less vacancies. 2013 is the year things stabilized, while in 2014 things started to get better. In terms of vacancies, we are getting to the level of the beginning of the series, before the crisis hit hard.
Nuwerk was actually able to weather the crisis reasonably well. However, subsequently they did lose a bit of market and the associated uptake in vacancies in 2014 was not captured. I imagine this is part of the reason they started using monsterboard.

Extending the web sites used

In this extended plot there a few extra sources. They are added for completeness sake. In addition, since the number of vacancies covers a larger scale, it is plotted on logarithmic scale. Werk is the government website. On this website one is required to register to which vacancies one has applied. Insufficient activity on the job acquisition will result in cuts in benefits. It also has vacancies. Unfortunately it is not a very good website and quite often in the weekend it is down for maintenance. I gave up on retrying later in the weekend, fortunately I was never in the situation of having to use this website.
Jobbird did quite some advertising at some point. However, there are some odd spikes. I am not sure what that is, but this is for me a reason to doubt if their number of vacancies is a good indicator of what is happening on the jobs market.
Indeed has a number for the new vacancies. This is therefor even more fluctuating. Every school vacation has a bigger or smaller dip. It is also one of the last websites which I added. Especially the large fluctuation was a big reason not to plot these data in the first figure.

Higher education

VK (Volkskrant) was traditionally the website where school teachers were recruited. As a consequence, somewhere end of April, beginning of May, a large number of vacancies appeared for the next school year. These are the peaks visible in the three years for which VK has data. A similar but much less pronounced peak is visible in 2013 and 2014. In the section above, I stated that 2013 was the year things stabilized. This is not true in this plot. Intermediair showed a decrease in 2013. Stepstone was able to gain some market, which they subsequently lost in 2014. Then 2014 was the year things stabilized and 2015 saw increases in jobs. What is happening end of 2015 is something new. Intermediair made a big jump in one week (2125 to 2645 from 12 to 19 September), stepstone followed a few weeks later with some jumps. This really feels like a change in the market.
In contrast, BCFJobs did fairly well through the crisis. However, during and after the summer vacation it did have a loss in vacancies. It is almost as if vacancies have been pulled from BCFJobs and placed into Intermediair and stepstone. As explained above, Intermediair is like the best known place for higher educated people to start their job search, likewise it may be the best place for employers to go to when it gets less easy to find employees. As the more and more people get employed again, this may be forcing the change. But that is speculation.


The number of jobs is fluctuating, depending on vacations, season and progress through the crisis. Regarding the crisis, 2013 was the year things stabilized and 2014 saw an increase in vacancies. For higher educated personnel, this change happened about a year later.


r1 <- read.ods('banen.aantal.clean.ods')
mynames <- sapply(1:11,function(i) r1[[1]][1,i])
mycols <- lapply(1:11,function(i) {
      if (i==1) as.Date(r1[[1]][-1,1],format='%b %e, %Y')
      else as.numeric(r1[[1]][-1,i])})
r2 <- as.data.frame(mycols)
names(r2) <- mynames
r3 <- mutate(r2,

l1 <- reshape(r3,

l1 %>% 
    filter(.,source %in% c('nationalevacaturebank.nl','uitzendbureaus',
            'nuwerk')) %>%
    ggplot(.,aes(y=count,x=Date,col=source)) +

l1 %>% 
    filter(.,source %in% c('nationalevacaturebank.nl','uitzendbureaus',
        Date>as.Date('2012-01-01')) %>%
    ggplot(.,aes(y=count,x=Date,col=source)) +
    labs(y='Count (log scale)')

l1 %>% 
    filter(.,source %in% c('VK','Intermediair','stepstone',
       'Intermediair_VK','BCFJobs')) %>%
    ggplot(.,aes(y=count,x=Date,col=source)) +

Sunday, November 29, 2015

Wind in Netherlands II

Two weeks ago I plotted how wind measurements on the edge of the North Sea changed in the past century. This week the same dataset is used for hypothesis testing.


The most important things to reiterate from previous post is that the data is from KNMI and they come with a comment: "These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for trend analysis. For climate change studies we refer to the homogenized series of monthly temperatures of De Bilt or the Central Netherlands Temperature"
Data reading has slighlty changed, mostly because I needed different variables. In addition, for testing I wanted some categorical variables, these are Month and year. For year I have chosen five chunks of 22 years, 22 was chosen since it seemed large enough and resulted in approximately equal size chunks. Finally, for display purposes, wind direction was categorized in 8 directions according to the compass rose (North, North-East, East etc.).

r1 <- readLines('etmgeg_235.txt')
r2 <- r1[grep('^#',r1):length(r1)]
explain <- r1[1:(grep('^#',r1)-1)]
r2 <- gsub('#','',r2)
r3 <- read.csv(text=r2)

r4 <- mutate(r3,
        # Vector mean wind direction in degrees 
        # (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
        rFHVEC=FHVEC/10, # Vector mean windspeed (in 0.1 m/s)
    ) %>%


The circular package comes with an aov.circular() function, which can do one way analysis. Since I am a firm believer that direction varies according to the seasons, the presence of a time effect (the five categories) has been examined by Month. To make result compact, only p-values are displayed, they are all significant.
sapply(month.name,function(x) {
      aa <- filter(r4,month==x)
      bb <- aov.circular(aa$rDDVEC,aa$yearf,method='F.test')
    }) %>% as.data.frame
January   4.633e-05
February    < 1e-05
March       < 1e-05
April       < 1e-05
May         < 1e-05
June        0.00121
July       0.000726
August    0.0001453
September   0.02316
October     < 1e-05
November  0.0001511
December   0.003236
The associated plot with this data shows frequency of directions by year and Month. The advantage here being that the time axis is the x-axis, so changes are more easily visible.
ggplot(r4[complete.cases(r4),], aes(x=yearf))+
    facet_grid(DDVECf ~ month)+
    ggtitle('Frequency of Wind Direction')

The other part of wind is strength. Two weeks ago I saw clear differences. However, since that may also be effect of instrument or location change. The test I am interested here is therefore not the main effect of year categories but rather the interaction Month*Year. In the objective of robustness I wanted to go nonparametric with this. However, since I did not find anything regarding two factor interaction in my second edition of Hollander and Wolfe I googled for robust interaction. This gave a hit on rcompanion for the WRS2 package.
t2way(rFHVEC ~ yearf + month + yearf:month, 
    data = r4)
                value p.value
yearf       1063.0473   0.001
month        767.5687   0.001
yearf:month  169.4807   0.001


The data seems to show a change in wind measurements over these 110 years. This can be due to changes in wind or measurement instrument or instrument location. The statistical testing was chosen such as to counter some effects of these changes, hence it can be thought that the change is due to changes in wind itself.