Saturday, December 12, 2015

Vacancies in the Netherlands

Over the last couple of years, each weekend I have registering how many vacancies websites claim to have. This post shows some of the observations one may draw from the plots.

Data

Data is from general and more specialized websites. The first observations started in 2010. Since that time there were a number of changes. In general, the most useful sites are uitzendbureau.nl, nuwerk.nl nationalevacaturebank.nl, indeed.nl and jobbird.nl. These all cover the whole market.
Uitzendbureau (employment agency) will cater mostly for the short term solutions. As an employer, wants an employee quickly, without much hassle, which can let go just as easy, an uitzendbureau is the port of call. The website covers many of these agencies, from what I have seen, I can confirm the bigger agencies. This makes the number of vacancies they have a marker from the general work market. The other three cover more the jobs at employers themselves. I seem to remember that nationalevacaturebank (national vacancies bank) did a lot of advertisement at the time. Nuwerk.nl is a subsite of nu.nl, a popular news site in the Netherlands. The others I added later to the collection.
In addition, there is VK banen, Intermediair and Stepstone. VK banen (Volkskrant) and Intermediair are the traditional sites for higher educated employees. Back in the time when internet was not a factor of importance, Intermediair came for free for high educated people starting just before graduation and ending at age 40 or 45. Thus traditionally it was one of the best places for vacancies. Volkskrant had somewhat lesser amount of vacancies, and many in education. At some point within the data they merged their vacancy activity. Stepstone is an international entity, which tries to get hold in the Dutch market.
Finally, BCFJobs is a website which caters Bio (life sciences), Chemistry and Food jobs. It does jobs from medium level upwards. It got into this list, because this covers my personal background.

Data import

Over the years I have used several (spreadsheet) programs to store the data. Currently it sits in Libreoffice, thus I will extract the data from an ODS file. read.ods resulted in a list (for each page in the spreadsheet) with an entry with text columns for each column in the page, I did not notice any option to directly employ the column names which were in row 1. Hence there is a bit of processing to obtain a data.frame.

Plots

General observations 

In the first figure there are uitzendbureaus, nuwerk and nationalevacaturebank. The line for nuwerk is terminated. At that point they switched to vacancies supplied by monsterboard. The numbers were totally different. and I stopped registering them.
Uitzendbureaus has quite fluctuating numbers. Especially in 2014, but to a lesser extend in other years, there is a spike just before summer holidays. Much of the work which needs to be done when people go on vacation is supplied via uitzendbureaus. A valley at the end of each year is caused by Christmas vacation. Many business close between Christmas and new year, somewhat depending on the actual weekdays on which they fall. As a result there is a dip in both uitzendbureau and nationalevacaturebank.nl. A final observation, which I find it difficult to explain is that the beginning of just many year shows some optimism. The number of vacancies increases. Then, after the summer vacation, things are more pessimistic. In the crisis years the second half of the year gave an actual decrease, post crisis it is a flat line. One possible explanation may be in the end of the school year just before summer. This means summer and autumn there may be a fresh batch out of school. Most of these may be entering their names at the uitzendbureau, one is required to search for a job, and this is one of the ways to go about that. By the end of the year these will all been supplied. But I could think of other reasons too (e.g. head count low at end of year, optimism/pessimism due to seasons).
Regarding the crisis, starting mid 2011 and continuing till end of 2012 there are less and less vacancies. 2013 is the year things stabilized, while in 2014 things started to get better. In terms of vacancies, we are getting to the level of the beginning of the series, before the crisis hit hard.
Nuwerk was actually able to weather the crisis reasonably well. However, subsequently they did lose a bit of market and the associated uptake in vacancies in 2014 was not captured. I imagine this is part of the reason they started using monsterboard.

Extending the web sites used

In this extended plot there a few extra sources. They are added for completeness sake. In addition, since the number of vacancies covers a larger scale, it is plotted on logarithmic scale. Werk is the government website. On this website one is required to register to which vacancies one has applied. Insufficient activity on the job acquisition will result in cuts in benefits. It also has vacancies. Unfortunately it is not a very good website and quite often in the weekend it is down for maintenance. I gave up on retrying later in the weekend, fortunately I was never in the situation of having to use this website.
Jobbird did quite some advertising at some point. However, there are some odd spikes. I am not sure what that is, but this is for me a reason to doubt if their number of vacancies is a good indicator of what is happening on the jobs market.
Indeed has a number for the new vacancies. This is therefor even more fluctuating. Every school vacation has a bigger or smaller dip. It is also one of the last websites which I added. Especially the large fluctuation was a big reason not to plot these data in the first figure.

Higher education

VK (Volkskrant) was traditionally the website where school teachers were recruited. As a consequence, somewhere end of April, beginning of May, a large number of vacancies appeared for the next school year. These are the peaks visible in the three years for which VK has data. A similar but much less pronounced peak is visible in 2013 and 2014. In the section above, I stated that 2013 was the year things stabilized. This is not true in this plot. Intermediair showed a decrease in 2013. Stepstone was able to gain some market, which they subsequently lost in 2014. Then 2014 was the year things stabilized and 2015 saw increases in jobs. What is happening end of 2015 is something new. Intermediair made a big jump in one week (2125 to 2645 from 12 to 19 September), stepstone followed a few weeks later with some jumps. This really feels like a change in the market.
In contrast, BCFJobs did fairly well through the crisis. However, during and after the summer vacation it did have a loss in vacancies. It is almost as if vacancies have been pulled from BCFJobs and placed into Intermediair and stepstone. As explained above, Intermediair is like the best known place for higher educated people to start their job search, likewise it may be the best place for employers to go to when it gets less easy to find employees. As the more and more people get employed again, this may be forcing the change. But that is speculation.

Conclusion

The number of jobs is fluctuating, depending on vacations, season and progress through the crisis. Regarding the crisis, 2013 was the year things stabilized and 2014 saw an increase in vacancies. For higher educated personnel, this change happened about a year later.

Code

library(readODS)
library(ggplot2)
library(dplyr)
r1 <- read.ods('banen.aantal.clean.ods')
mynames <- sapply(1:11,function(i) r1[[1]][1,i])
mycols <- lapply(1:11,function(i) {
      if (i==1) as.Date(r1[[1]][-1,1],format='%b %e, %Y')
      else as.numeric(r1[[1]][-1,i])})
r2 <- as.data.frame(mycols)
mynames
names(r2) <- mynames
r3 <- mutate(r2,
    Intermediair_VK=ifelse(!is.na(VK),NA,Intermediair),
    Intermediair=ifelse(is.na(VK),NA,Intermediair))

l1 <- reshape(r3,
    idvar='date',
    direction='long',
    varying=list(names(r3)[-1]),
    timevar='source',
    v.names='count',
    times=names(r3)[-1])

l1 %>% 
    filter(.,source %in% c('nationalevacaturebank.nl','uitzendbureaus',
            'nuwerk')) %>%
    ggplot(.,aes(y=count,x=Date,col=source)) +
    geom_line()+
    ylim(0,NA)+
    theme(legend.position="bottom")

l1 %>% 
    filter(.,source %in% c('nationalevacaturebank.nl','uitzendbureaus',
            'indeed','werk','jobbird','nuwerk'),
        Date>as.Date('2012-01-01')) %>%
    ggplot(.,aes(y=count,x=Date,col=source)) +
    geom_line()+
    scale_y_log10()+
    theme(legend.position="bottom")+
    labs(y='Count (log scale)')

l1 %>% 
    filter(.,source %in% c('VK','Intermediair','stepstone',
       'Intermediair_VK','BCFJobs')) %>%
    ggplot(.,aes(y=count,x=Date,col=source)) +
    geom_line()+
    ylim(0,NA)+
    theme(legend.position="bottom")


Sunday, November 29, 2015

Wind in Netherlands II

Two weeks ago I plotted how wind measurements on the edge of the North Sea changed in the past century. This week the same dataset is used for hypothesis testing.

Data

The most important things to reiterate from previous post is that the data is from KNMI and they come with a comment: "These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for trend analysis. For climate change studies we refer to the homogenized series of monthly temperatures of De Bilt or the Central Netherlands Temperature"
Data reading has slighlty changed, mostly because I needed different variables. In addition, for testing I wanted some categorical variables, these are Month and year. For year I have chosen five chunks of 22 years, 22 was chosen since it seemed large enough and resulted in approximately equal size chunks. Finally, for display purposes, wind direction was categorized in 8 directions according to the compass rose (North, North-East, East etc.).
library(circular)
library(dplyr)
library(ggplot2)
library(WRS2)

r1 <- readLines('etmgeg_235.txt')
r2 <- r1[grep('^#',r1):length(r1)]
explain <- r1[1:(grep('^#',r1)-1)]
explain
r2 <- gsub('#','',r2)
r3 <- read.csv(text=r2)

r4 <- mutate(r3,
        Date=as.Date(format(YYYYMMDD),format='%Y%m%d'),
        year=floor(YYYYMMDD/1e4),
        rDDVEC=as.circular(DDVEC,units='degrees',template='geographics'),
        # Vector mean wind direction in degrees 
        # (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
        DDVECf=as.character(cut(DDVEC,breaks=c(0,seq(15,330,45),361),left=TRUE,
            labels=c('N','NE','E','SE','S','SW','W','NW','N2'))),
        DDVECf=ifelse(DDVECf=='N2','N',DDVECf),
        DDVECf=factor(DDVECf,levels=c('N','NE','E','SE','S','SW','W','NW')),
        rFHVEC=FHVEC/10, # Vector mean windspeed (in 0.1 m/s)
        yearf=cut(year,seq(1905,2015,22),labels=c('05','27','49','71','93')),
        month=factor(format(Date,'%B'),levels=month.name),
        tcat=interaction(month,yearf)
    ) %>%
    select(.,YYYYMMDD,Date,year,month,DDVEC,rDDVEC,DDVECf,rFHVEC,yearf,tcat)

Analysis

The circular package comes with an aov.circular() function, which can do one way analysis. Since I am a firm believer that direction varies according to the seasons, the presence of a time effect (the five categories) has been examined by Month. To make result compact, only p-values are displayed, they are all significant.
sapply(month.name,function(x) {
      aa <- filter(r4,month==x)
      bb <- aov.circular(aa$rDDVEC,aa$yearf,method='F.test')
      format.pval(bb$p.value,digits=4,eps=1e-5)
    }) %>% as.data.frame
January   4.633e-05
February    < 1e-05
March       < 1e-05
April       < 1e-05
May         < 1e-05
June        0.00121
July       0.000726
August    0.0001453
September   0.02316
October     < 1e-05
November  0.0001511
December   0.003236
The associated plot with this data shows frequency of directions by year and Month. The advantage here being that the time axis is the x-axis, so changes are more easily visible.
ggplot(r4[complete.cases(r4),], aes(x=yearf))+
    geom_histogram()+
    facet_grid(DDVECf ~ month)+
    ggtitle('Frequency of Wind Direction')

The other part of wind is strength. Two weeks ago I saw clear differences. However, since that may also be effect of instrument or location change. The test I am interested here is therefore not the main effect of year categories but rather the interaction Month*Year. In the objective of robustness I wanted to go nonparametric with this. However, since I did not find anything regarding two factor interaction in my second edition of Hollander and Wolfe I googled for robust interaction. This gave a hit on rcompanion for the WRS2 package.
t2way(rFHVEC ~ yearf + month + yearf:month, 
    data = r4)
                value p.value
yearf       1063.0473   0.001
month        767.5687   0.001
yearf:month  169.4807   0.001

Conclusion

The data seems to show a change in wind measurements over these 110 years. This can be due to changes in wind or measurement instrument or instrument location. The statistical testing was chosen such as to counter some effects of these changes, hence it can be thought that the change is due to changes in wind itself.

Sunday, November 15, 2015

Wind in Netherlands

In climate change discussions, everybody talks about temperature. But weather is much more than that. There is at least rain and wind as directly experienced quality, and air pressure as measurable quantity. In the Netherlands, some observation stations have more than a century of daily data on these things. The data may be broken in the sense that equipment and location can have changed. To quote: "These time series are inhomogeneous because of station relocations and changes in observation techniques. As a result, these series are not suitable for trend analysis. For climate change studies we refer to the homogenized series of monthly temperatures of De Bilt link or the Central Netherlands Temperature link." Since I am not looking at temperature but wind, I will keep to this station's data.

Data

Data are from daily observations from KNMI. I have chosen station De Kooy. For those less familiar with Dutch geography, this is close to Den Helder, in the tip North West of Netherlands. This means pretty close to the North Sea, Wadden Sea and Lake IJssel. Wind should be relatively unhindered there. The data themselves are daily observations. For wind there are:
DDVEC     Vector mean wind direction in degrees
                  (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
FHVEC     Vector mean windspeed (in 0.1 m/s)
FG             Daily mean windspeed (in 0.1 m/s)
FHX          Maximum hourly mean windspeed (in 0.1 m/s)
FHXH       Hourly division in which FHX was measured
FHN          Minimum hourly mean windspeed (in 0.1 m/s)
FHNH       Hourly division in which FHN was measured
FXX          Maximum wind gust (in 0.1 m/s)
FXXH       Hourly division in which FXX was measured
The header of the data downloaded contains this, and much more information. I am sure there are good reasons to do speed in 0.1 m/s, but personally I find m/s more easy.
The two first variables are 'vector means'. It is obvious that one cannot simply average directions. Luckily there is the circular package, which does understand direction.
Thus the data reading script becomes:
r1 <- readLines('etmgeg_235.txt')
r2 <- r1[grep('^#',r1):length(r1)]
explain <- r1[1:(grep('^#',r1)-1)]
# explain
r2 <- gsub('#','',r2)
r3 <- read.csv(text=r2)
library(dplyr)
library(circular)
methods(sd)
r4 <- mutate(r3,
    Date=as.Date(format(YYYYMMDD),format='%Y%m%d'),
    year=floor(YYYYMMDD/1e4),
    month=factor(format(Date,'%B'),levels=month.name),
    rDDVEC=as.circular(DDVEC,units='degrees',template='geographics'),
              # Vector mean wind direction in degrees 
              # (360=north, 90=east, 180=south, 270=west, 0=calm/variable)
    rFHVEC=FHVEC/10, # Vector mean windspeed (in 0.1 m/s)
    rFG=FG/10,   # Daily mean windspeed (in 0.1 m/s) 
    rFHX=FHX/10, # Maximum hourly mean windspeed (in 0.1 m/s)
    rFHN=FHN/10, # Minimum hourly mean windspeed (in 0.1 m/s)
    rFXX=FXX/10 # Maximum wind gust (in 0.1 m/s)
    ) %>%
    select(.,YYYYMMDD,Date,year,month,rDDVEC,rFHVEC,
        rFG,rFHX,rFHN,rFXX)

Plots

Plot of mean wind speed shows several effects. There is an equipment change just before year 2000. At the beginning of the curve the values are lowest, while in the sixties there is a bit more wind, as was n the nineties. I wonder about that. Is that equipment? I can imagine that hundred years ago there was lesser equipment giving such a change, but fifty or twenty years ago? Finally, close to the end of the war there is missing data.
ggplot(data=r4,aes(y=rFG,x=Date))+
    geom_smooth()+
    geom_point(alpha=.03) +
    ylab('Mean wind speed x (m/s)')+
    xlab('Year')

A second plot is by month. This shows somewhat different patterns. There is still most wind in the middle of last century. However, September and October have the most wind just before 1950, while November and December have most wind after 1950. Such a pattern cannot be attributed to changes in equipment. It would seem there is some kind of change in wind speeds then.
r5 <- group_by(r4,month,year) %>%
    summarise(.,mFG=mean(rFG),mFHX=max(rFHX),mFXX=max(rFXX))
ggplot(data=r5,aes(y=mFG,x=year)) +
    geom_smooth(method='loess') +
    geom_point(alpha=.5)+
    facet_wrap(~ month)
 

Wind direction

In the Netherlands there is a clear connection between wind and the remainder of the weather. Most of the wind is from the SW (south west, I will be using N, E, S, W to abbreviate directions from here on). N, NW, W and SW winds take humidity from the North Sea and Atlantic Ocean, which in turn will bring rain. In winter, the SW wind will also bring warmth, there will be no frost with W and SW wind. In contrast, N, NE and E will bring cold. A winter wind from Siberia will bring skating fever. In summer, the nice and sunny weather is associated with S to E winds the E wind in May is associated with nice spring weather. SE is by far the least common direction. 
The circular package has a both density and plot functions. Combining these gets the following directions for the oldest part of the data. 
par(mfrow=c(3,4),mar=c(0,0,3,0))
lapply(month.name,function(x) {
      xx <- r4$rDDVEC[r4$year<1921 & r4$month==x]
      xx <- xx[!is.na(xx)]
      density(xx,bw=50)  %>% 
          plot(main=x,xlab='',ylab='',shrink=1.2)
      1
    })
title("1906-1920", line = -1, outer = TRUE)

I would be hard pressed to see significant differences between old and recent data. The densities are slightly different, but not really impressive. Note the lack of E wind in summer, indicating that recent summers have been not been very spectacular. 
par(mfrow=c(3,4),mar=c(0,0,2,0))
lapply(month.name,function(x) {
      xx <- r4$rDDVEC[r4$year>=2000 & r4$month==x]
      xx <- xx[!is.na(xx)]
      density(xx,bw=50)  %>% 
          plot(main=x,xlab='',ylab='',shrink=1.2)
      1
    })
title("2000-now", line = -1, outer = TRUE)

Sunday, November 1, 2015

Vacancies in Europe

I like playing around with data from Eurostat. At this time the tools to do so are just so easy. There are tools to pull the data directly from the data base in R (eurostat package). Process it a bit using dplyr and before you know it, ggplot makes a plot.

Data

My starting point to examine data is the database page. From there I can browse for the correct table and view its contents. Having done that, I can take the name of the table and pull that in R. The name of the vacancy database I chose (Job vacancy statistics - quarterly data (from 2001 onwards), NACE Rev. 2) is jvs_q_nace2, hence with
library(eurostat)
library(dplyr)
library(ggplot2)
library(scales)
r1 <- get_eurostat('jvs_q_nace2')
I have all packages needed and the data in R. One of the properties of the data is that everything is coded. Hence the next step is to merge the codes. The following code pulls the country codes and does a bit of post processing on the names to get them a bit nicer. Subsequently, the variously combinations of countries determined by expanding of the EU and Euro area at various time points are removed. These data have the property that they are too abundant, some data removal is needed. Finally, seasonably adjusted data is selected and all company sizes are used.
# add country names
r2 <- get_eurostat_dic('geo') %>%
    mutate(.,
        geo=V1,
        country=V2,
        country=gsub('\\(.*$','',country),
        country=gsub(' $','',country)) %>%
    select(.,geo,country) %>%
    merge(.,r1) %>%
# filter countries
    filter(.,
        !grepl('EA.*',geo),
        !grepl('EU.*',geo),
        s_adj=='SA'  ,# seas. adj.
        sizeclas!='Total') %>% # all company sizes
    mutate(.,country=factor(country)) %>%
    select(.,-geo,-s_adj,-sizeclas)
For other variables, it is more or less the same. get_eurostat_dic() pulls the coding and they can be merged. The text in nace is a bit long, so I shortened it.
r3 <- get_eurostat_dic('nace_r2') %>%
    rename(.,
        nace_r2=V1, # add NACE 
        nace=V2) %>%
    mutate(.,
        nace=substr(nace,1,110)) %>%
    merge(.,r2) %>%
    mutate(nace=factor(nace))

r4 <- get_eurostat_dic('indic_em') %>%
    rename(.,
        indic_em=V1) %>%
    merge(.,r3) %>%
    mutate(.,
        property=factor(V2)) %>%
    select(.,-V2)

Plots

Since the data is now prepared, the next step is to plot. There are actually far too many categories in nace and a selection to be displayed is needed. If you want know what different categories are, use
nace <- select(r4,nace_r2,nace) %>% unique() 
to display what each category represents. I chose to select a number of industry related categories. In addition some countries have very limited data, they are eliminated.
filter(r4,
        property=='Job vacancy rate',
        nace_r2 %in% c('A-S','B-E','B-S','B-F'),
        !(country %in% 
              c('Croatia', 'Greece','Portugal',# limited years
                  'Switzerland')),             # limited classes
        time>as.Date('01-01-2006',format='%d-%m-%Y'),
        !is.na(values)) %>%
    mutate(.,country=factor(country)
        ,nace_r2=factor(nace_r2)) %>%
    
    ggplot(.,aes(x=time,y=values,color=nace)) +
    geom_line() +
    facet_wrap(  ~ country )+
    ylab('Job vacancy rate')+
    guides(color=guide_legend(ncol=1))+
    scale_x_date(labels=date_format("%y"))+
    xlab('Year')+
    theme(legend.position="bottom", legend.title=element_blank())
In the plot the enormous drops for Cyprus, Czech Republic and Estonia are clearly visible. The Czech Republic is also rebounding quite steeply. UK had a smaller drop in 2008, but is now at pre-crisis job vacancy rates. In fact many countries show increases in job vacancy rate.

Getting a different display is just very easy. Below the call to get number of vacancies in education, information and communication and research. Since the number of vacancies is really dependent on country size, a logarithmic scale is chosen. The countries displayed are slightly different, it appears not all countries have all data. But the trends are similar as the previous plot.

filter(r4,
        property=='Number of job vacancies',
        nace_r2 %in% c('J','M','M_N','P'),
        !(country %in% # limited years
              c('Croatia', 'Greece','Portugal','Sweden')),
        !is.na(values)) %>%
    mutate(.,country=factor(country)
        ,nace_r2=factor(nace_r2)) %>%
    
    ggplot(.,aes(x=time,y=values,color=nace )) +
    geom_line() +
    facet_wrap(  ~ country )+
    ylab('Number of job vacancies')+
    guides(color=guide_legend(ncol=1))+
    scale_x_date(labels=date_format("%y"))+
    xlab('Year')+
    scale_y_log10()+
    theme(legend.position="bottom", legend.title=element_blank())

Sunday, October 18, 2015

Trying to optimize

I wanted to try some more machine learning. On Kaggle there is a competition How Much Did It Rain? II. This is quite a bigger data set than Titanic. To quote from Kaggle:
Rainfall is highly variable across space and time, making it notoriously tricky to measure. Rain gauges can be an effective measurement tool for a specific location, but it is impossible to have them everywhere. In order to have widespread coverage, data from weather radars is used to estimate rainfall nationwide. Unfortunately, these predictions never exactly match the measurements taken using rain gauges.

Data

On the data themselves:
To understand the data, you have to realize that there are multiple radar observations over the course of an hour, and only one gauge observation (the 'Expected'). That is why there are multiple rows with the same 'Id'.
I have downloaded the data and at this point am just experimenting with them. It is quite a big data set: there are 9125329 rows in the training set. My idea was to do 'something' per record, combine the records of one hour to get a prediction. The 'something' is as yet undefined. The idea to combine by Id is supposed to be retained.
What became clear pretty quickly is that everything is slow with this amount of data. Hence for now I will use only 10% of the training data. For ease of access the data are sitting in a R data set.
load('aaa3.RData')
###
# take 10% of data
rawdata <- rawdata[rawdata$Id < quantile(rawdata$Id,.1),]
# extract keys per hour . Id & Expected
# rawdata$Id!=c(0,rawdata$Id[1:(nrow(rawdata)-1) 
# is the R way to write the SAS code: by Id; If first.Id;
r1b <- rawdata[rawdata$Id!=c(0,rawdata$Id[1:(nrow(rawdata)-1)]),c(1,24)]
Id <- factor(rawdata$Id)

Model

To get but to get an idea of the process I started with linear regression, but that is just a temporary approach. For linear regression there are 22 parameters, for 21 observed values and an intercept. Prediction per row follows from a simple matrix multiplication. The model including estimation of the error in the fit sits in small R function. As preparation a column of ones is added to the x data. The summary per Id can be done pretty quickly and easy via the group_by() and summarise() functions from dplyr.
Based on the current results I have decided that such a function will have to be transferred to C++ or such in order to have a decent computation time. But that is for a future time, it has been quite some years that I programmed in C or Fortran, I'll need a refresher first, luckily edX has a course 'Introduction to C++' running right now.
r1m <- as.matrix(rawdata[,c(-1,-2,-24)])
rm(rawdata) # control memory usage
r1m <- cbind(rep(1,nrow(r1m)),r1m) # add column of 1
r1m[is.na(r1m) ] <- 0
betas <- rep(1,ncol(r1m))
#myerr calculated mean prediction per Id and compares with Expected values
myerr <- function(betas) {
  pred <- data.frame(Id=Id,
          pred=as.numeric(tcrossprod(betas,r1m))) %>%
      group_by(.,Id) %>%
      summarise(.,m=mean(pred)) 
  sum(abs(pred$m-r1b$Expected))
}
#mmyerr is myerr for maximization
mmyerr <- function(betas) -myerr(betas) 

Parameter estimation & optimization

The problem has now been reduced to getting the parameters which give the lowest prediction error. just throwing this in optim() did not lead to satisfactory results. So this post gives some experiments with alternate approaches. So, I played with some of these, and for this post ran it all to get decent data. The table shows the quick summary.

package function time result converged
stats optim 771 1805459 No

optim (BFGS) 729 1678736 Yes

optim(CG) 5527 1678775 Yes
adagio simpleEA 623 1722928 No
dfoptim hjk 4289 1678734 Yes
GA ga 589 1775910 NA
It appears that optim() (Nelder Mead) did not function at all. In contrast, using the BFGS option gave quite an improvement. Using that would imply that a fast differentiation could improve the result quite a bit. The quite decent result for simpleEA() is actually a bit of a disappointment. It started with a box and converged to the center of that box. Whatever I changed in the options, it would always end in the same center. hjk() functions but is quite slow. Finally, ga() is just a bit too slow and has difficulty finding the actual optimum.
> system.time(
+     ooptim <- optim(betas,
+         myerr,
+         control=list(maxit=5000)))
   user  system elapsed 
771.204   0.216 770.743 
> ooptim
$par
 [1]  5.91962946 -0.26798071 -1.19797656 -0.13297181  0.51015756  1.49454289
 [7]  0.75377772  0.47774932 -0.54917591 -1.36237144  9.32248883 -4.58509259
[13]  1.53973413 -1.28367152  1.57070010  2.80960763 -1.53639162  0.24043815
[19]  0.55385066  0.65754031  2.25820673 -0.09389256

$value
[1] 1805459

$counts
function gradient 
    5001       NA 

$convergence
[1] 1

$message
NULL

> system.time(
+     ooptimBFGS <- optim(betas,
+         myerr,
+         control=list(maxit=5000),
+         method='BFGS'))
   user  system elapsed 
730.253   0.226 729.741 
> ooptimBFGS
$par
 [1] -0.193572528  0.036277278  0.018915001  0.055201253 -0.005343143
 [6]  0.048512761  0.018770974 -0.067885397  0.006576984  0.009960527
[11]  0.354908920 -0.417610915  0.004300646 -0.313622718  0.013003956
[16]  0.030448388  0.004622354  0.026076960  0.022466472  0.015661149
[21]  0.093649287 -0.056203122

$value
[1] 1678736

$counts
function gradient 
     626       94 

$convergence
[1] 0

$message
NULL

> system.time(
+     ooptimCG <- optim(betas,
+         myerr,
+         control=list(maxit=5000),
+         method='CG'))
    user   system  elapsed 
5531.910    1.397 5527.053 
> ooptimCG
$par
 [1] -0.112575961  0.030875504  0.019111803  0.06Yes0134187 -0.009593463
 [6]  0.051076794  0.019737360 -0.077191287  0.016508372  0.003658111
[11] -0.071932138 -0.036570553 -0.051100594 -0.179869515 -0.010209693
[16]  0.066388807  0.005459227  0.039691325  0.020881573  0.020044547
[21]  0.068064112 -0.051825924

$value
[1] 1678775

$counts
function gradient 
    2283      772 

$convergence
[1] 0

$message
NULL

> library(adagio)
> system.time(
+     osimpleEA <- simpleEA(myerr,
+         lower=rep(-5,ncol(r1m)),
+         upper=rep(5,ncol(r1m))))
   user  system elapsed 
624.432   0.176 623.857 
> osimpleEA
$par
 [1] 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0

$val
[1] 1722928

$fun.calls
[1] 4060

$rel.scl
[1] 0.625

$rel.tol
[1] 0

> library(dfoptim)
> system.time(
+     ohjk <- hjk(betas,myerr)
+ )
    user   system  elapsed 
4291.935    2.036 4289.022 
> ohjk
$par
 [1] -0.188190460  0.035865784  0.020156860  0.056446075 -0.005027771
 [6]  0.046699524  0.018516541 -0.069412231  0.006889343  0.010589600
[11]  0.257308960 -0.412174225  0.065429688 -0.279411316  0.014663696
[16]  0.027637482  0.011859894  0.020298004  0.023769379  0.013179779
[21]  0.092247009 -0.056926727

$value
[1] 1678734

$convergence
[1] 0

$feval
[1] 28245

$niter
[1] 19

> library(GA)
Loading required package: foreach
foreach: simple, scalable parallel programming from Revolution Analytics
Use Revolution R for scalability, fault tolerance and more.
http://www.revolutionanalytics.com
Loading required package: iterators
Package 'GA' version 2.2
Type 'citation("GA")' for citing this R package in publications.
> system.time(
+     oga <- ga(type='real-valued',
+         fitness=mmyerr,
+         min=rep(-5,ncol(r1m)),
+         max=rep(5,ncol(r1m))))
Iter = 1  | Mean = -10979924  | Best = -2576076 
Iter = 2  | Mean = -7162806  | Best = -2576076 
Iter = 3  | Mean = -4932565  | Best = -2318262 
Iter = 4  | Mean = -4639204  | Best = -2237286 
Iter = 5  | Mean = -3638851  | Best = -2237286 
Iter = 6  | Mean = -3689674  | Best = -2068472 
Iter = 7  | Mean = -3226292  | Best = -1990599 
Iter = 8  | Mean = -2977074  | Best = -1876791 
Iter = 9  | Mean = -2775401  | Best = -1876791 
Iter = 10  | Mean = -2332140  | Best = -1865152 
Iter = 11  | Mean = -2437267  | Best = -1817242 
Iter = 12  | Mean = -2471635  | Best = -1799720 
Iter = 13  | Mean = -2168155  | Best = -1799720 
Iter = 14  | Mean = -2096314  | Best = -1799720 
Iter = 15  | Mean = -2253981  | Best = -1799720 
Iter = 16  | Mean = -2049947  | Best = -1799720 
Iter = 17  | Mean = -2089075  | Best = -1789813 
Iter = 18  | Mean = -2093853  | Best = -1789813 
Iter = 19  | Mean = -1976973  | Best = -1789813 
Iter = 20  | Mean = -1933447  | Best = -1789813 
Iter = 21  | Mean = -2000920  | Best = -1789813 
Iter = 22  | Mean = -1950218  | Best = -1788022 
Iter = 23  | Mean = -1886263  | Best = -1788022 
Iter = 24  | Mean = -1914422  | Best = -1788022 
Iter = 25  | Mean = -2024395  | Best = -1788022 
Iter = 26  | Mean = -2016097  | Best = -1788022 
Iter = 27  | Mean = -1947956  | Best = -1788022 
Iter = 28  | Mean = -1900507  | Best = -1787331 
Iter = 29  | Mean = -2066703  | Best = -1787331 
Iter = 30  | Mean = -1835251  | Best = -1787331 
Iter = 31  | Mean = -1942032  | Best = -1787331 
Iter = 32  | Mean = -2019920  | Best = -1787331 
Iter = 33  | Mean = -1891563  | Best = -1787331 
Iter = 34  | Mean = -1838287  | Best = -1787331 
Iter = 35  | Mean = -1925757  | Best = -1787331 
Iter = 36  | Mean = -1972783  | Best = -1787331 
Iter = 37  | Mean = -2053707  | Best = -1787331 
Iter = 38  | Mean = -1999298  | Best = -1787331 
Iter = 39  | Mean = -2057161  | Best = -1787331 
Iter = 40  | Mean = -2016452  | Best = -1787331 
Iter = 41  | Mean = -2106619  | Best = -1787331 
Iter = 42  | Mean = -1848483  | Best = -1787331 
Iter = 43  | Mean = -1952123  | Best = -1784611 
Iter = 44  | Mean = -1911939  | Best = -1784611 
Iter = 45  | Mean = -1867947  | Best = -1784611 
Iter = 46  | Mean = -1961794  | Best = -1784611 
Iter = 47  | Mean = -1878957  | Best = -1784611 
Iter = 48  | Mean = -2004529  | Best = -1784611 
Iter = 49  | Mean = -1819421  | Best = -1784611 
Iter = 50  | Mean = -1904183  | Best = -1784611 
Iter = 51  | Mean = -1870062  | Best = -1783617 
Iter = 52  | Mean = -1948733  | Best = -1782194 
Iter = 53  | Mean = -2023928  | Best = -1782194 
Iter = 54  | Mean = -2008359  | Best = -1782194 
Iter = 55  | Mean = -2022090  | Best = -1782194 
Iter = 56  | Mean = -2086763  | Best = -1782194 
Iter = 57  | Mean = -1959158  | Best = -1782194 
Iter = 58  | Mean = -1849578  | Best = -1782194 
Iter = 59  | Mean = -2119746  | Best = -1782194 
Iter = 60  | Mean = -2218175  | Best = -1780252 
Iter = 61  | Mean = -2096596  | Best = -1780252 
Iter = 62  | Mean = -1979876  | Best = -1779764 
Iter = 63  | Mean = -2172198  | Best = -1779187 
Iter = 64  | Mean = -1876837  | Best = -1779187 
Iter = 65  | Mean = -1880262  | Best = -1779187 
Iter = 66  | Mean = -1830083  | Best = -1779187 
Iter = 67  | Mean = -1908915  | Best = -1779187 
Iter = 68  | Mean = -1994229  | Best = -1779187 
Iter = 69  | Mean = -2090335  | Best = -1779187 
Iter = 70  | Mean = -2314393  | Best = -1779187 
Iter = 71  | Mean = -2072167  | Best = -1779187 
Iter = 72  | Mean = -1901014  | Best = -1779187 
Iter = 73  | Mean = -1805347  | Best = -1779187 
Iter = 74  | Mean = -1868138  | Best = -1779187 
Iter = 75  | Mean = -2054748  | Best = -1779187 
Iter = 76  | Mean = -1894803  | Best = -1779187 
Iter = 77  | Mean = -1789384  | Best = -1779187 
Iter = 78  | Mean = -1919701  | Best = -1779187 
Iter = 79  | Mean = -1885417  | Best = -1779187 
Iter = 80  | Mean = -1911716  | Best = -1779187 
Iter = 81  | Mean = -1951938  | Best = -1779187 
Iter = 82  | Mean = -1968106  | Best = -1779187 
Iter = 83  | Mean = -2118552  | Best = -1778650 
Iter = 84  | Mean = -1891108  | Best = -1778650 
Iter = 85  | Mean = -1967131  | Best = -1778650 
Iter = 86  | Mean = -1971338  | Best = -1778650 
Iter = 87  | Mean = -1798790  | Best = -1778508 
Iter = 88  | Mean = -1990970  | Best = -1778440 
Iter = 89  | Mean = -2103851  | Best = -1778440 
Iter = 90  | Mean = -2176462  | Best = -1778440 
Iter = 91  | Mean = -1852161  | Best = -1778440 
Iter = 92  | Mean = -1913977  | Best = -1778440 
Iter = 93  | Mean = -2110709  | Best = -1778440 
Iter = 94  | Mean = -1948281  | Best = -1778440 
Iter = 95  | Mean = -2221610  | Best = -1777905 
Iter = 96  | Mean = -2276691  | Best = -1775910 
Iter = 97  | Mean = -2141834  | Best = -1775910 
Iter = 98  | Mean = -1871241  | Best = -1775910 
Iter = 99  | Mean = -1829631  | Best = -1775910 
Iter = 100  | Mean = -1914998  | Best = -1775910 
   user  system elapsed 
590.256   0.418 589.855 
> summary(oga)
+-----------------------------------+
|         Genetic Algorithm         |
+-----------------------------------+

GA settings: 
Type                  =  real-valued 
Population size       =  50 
Number of generations =  100 
Elitism               =  2 
Crossover probability =  0.8 
Mutation probability  =  0.1 
Search domain 
    x1 x2 x3 x4 x5 x6 x7 x8 x9 x10 x11 x12 x13 x14 x15 x16 x17 x18 x19 x20 x21
Min -5 -5 -5 -5 -5 -5 -5 -5 -5  -5  -5  -5  -5  -5  -5  -5  -5  -5  -5  -5  -5
Max  5  5  5  5  5  5  5  5  5   5   5   5   5   5   5   5   5   5   5   5   5
    x22
Min  -5
Max   5

GA results: 
Iterations             = 100 
Fitness function value = -1775910 
Solution               = 
            x1        x2        x3        x4         x5        x6       x7
[1,] 0.2099863 0.3481724 0.3910016 -1.416855 -0.3929375 0.5143907 0.319798
           x8        x9        x10         x11       x12      x13        x14
[1,] 1.795368 -0.445552 -0.8136325 -0.01177531 -0.823179 0.744795 -0.7005772
           x15       x16       x17         x18        x19       x20       x21
[1,] 0.3549757 0.4885278 0.3535035 -0.04773755 -0.5155096 0.3308792 0.1921899
           x22
[1,] 0.3446834