Sunday, April 27, 2014

Project Tycho, Correlation between states

In this fourth post on Measles data I want to have a look at correlation between states. As described before, the data is from Project Tycho, which contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.

Data

I discovered an error in previous code which made 1960 to appear twice. Hence updated script.
setwd('/home/kees/Documents/tycho/')
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='STATE',
    direction='long')
r2$STATE=factor(r2$STATE)

####################3
years <- dir(pattern='+.txt')
years

pop1 <-
    lapply(years,function(x) {
            rl <- readLines(x)
            print(x)
            sp <- grep('^U.S.',rl)
            st1 <- grep('^AL',rl)
            st2 <- grep('^WY',rl)
            rl1 <- rl[c(sp[1]-2,st1[1]:st2[1])]
            rl2 <- rl[c(sp[2]-2,st1[2]:st2[2])]
           
            read1 <- function(rlx) {
                rlx[1] <- paste('abb',rlx[1])
                rlx <- gsub(',','',rlx,fixed=TRUE)
                rt <- read.table(textConnection(rlx),header=TRUE)
                rt[,grep('census',names(rt),invert=TRUE)]
            }
            rr <- merge(read1(rl1),read1(rl2))
            ll <- reshape(rr,
                list(names(rr)[-1]),
                v.names='pop',
                timevar='YEAR',
                idvar='abb',
                times=as.integer(gsub('X','',names(rr)[-1])),
                direction='long')
        })
pop <- do.call(rbind,pop1)
pop <- pop[grep('19601',rownames(pop),invert=TRUE),]

states <- rbind(
    data.frame(
        abb=datasets::state.abb,
        State=datasets::state.name),
    data.frame(abb='DC',
        State='District of Columbia'))
states$STATE=gsub(' ','.',toupper(states$State))

r3 <- merge(r2,states)
r4 <- merge(r3,pop)
r4$incidence <- r4$Cases/r4$pop

r5 <- subset(r4,r4$YEAR>1927,-STATE)
r6 <- r5[complete.cases(r5),]

New variable

In previous posts it became clear there is in general a yearly cycle. However, the minimum in this cycle is in summer. This means for yearly summary it might be best not to use calender years, but rather something which breaks during summer. My choice is week 37.
with(r6[r6$WEEK>30 & r6$WEEK<45,],
    aggregate(incidence,by=list(WEEK=WEEK),mean))

   WEEK           x
1    31 0.016757440
2    32 0.013776182
3    33 0.011313391
4    34 0.008783259
5    35 0.007348603
6    36 0.006843930
7    37 0.006528467
8    38 0.007078171
9    39 0.008652546
10   40 0.016784205
11   41 0.013392375
12   42 0.016158805
13   43 0.018391632
14   44 0.021788221
r6$cycle <- r6$YEAR + (r6$WEEK>37)

Plot

States over time

Since not all states have complete data, it was decided to use state-year combinations with at least 40 observations (weeks). As can be seen there is some correlation between states, especially in 1945.  If anything, correlation gets weaker past 1955.
library(ggplot2)ggplot(with(r6,aggregate(incidence,
                list(cycle=cycle,
                    State=State),
                function(x)
                    if(length(x)>40)
                        sum(x) else
                        NA)),
        aes(cycle, x,group=State)) +
    geom_line(size=.1) +
    ylab('Incidence registered Measles Cases Per Year') +
    theme(text=element_text(family='Arial')) +
    scale_y_log10()

Between states

I have seen too many examples of people rebuilding maps based on traveling times or distances. Now I want to do the same. Proper (euclidean) distance of the states would make the variables the year/week combinations, which gives all kind of scaling issues. What I did is to use correlation and transform that into something distance like. ftime is just a helper variable, so I am sure the reshape works correctly.
r6$ftime <- interaction(r6$YEAR,r6$WEEK)
xm <- reshape(r6,
    v.names='incidence',
    idvar='ftime',
    timevar='State',
    direction='wide',
    drop=c('abb','Cases','pop'))

xm2 <- xm[,grep('incidence',names(xm))]
cc <- cor(xm2,use='pairwise.complete.obs')
dimnames(cc) <- lapply(dimnames(cc),function(x) sub('incidence.','',x))
dd <- as.dist(1-cc/2)

The heatmap reveals the structure best.
heatmap(as.matrix(dd),dist=as.dist,symm=TRUE)
MDS is most nice to look at. I will leave comparisons to the US map to those who actually know all these state's relative locations.
library(MASS)
mdsx <- isoMDS(dd)
par(mai=rep(0,4))
plot(mdsx$points,
    type = "n",
    axes=FALSE,
    xlim=c(-1,1),
    ylim=c(-1,1.1))
text(mdsx$points, labels = dimnames(cc)[[1]])



References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Monday, April 21, 2014

High incidence in Measles Data in Project Tycho

In this third post on Measles data I want to have a look at some high incidence occasions. As described before, the data is from Project Tycho, which contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.

Data

Data reading follows the posting Looking at Measles Data in Project Tycho, part II. In the plot there, some data over 10 seemed to be displayed, which converts to 10 persons per 1000 in a week.
r6 <- r5[complete.cases(r5),]
r6[r6$incidence>10,]

      YEAR abb WEEK Cases   State pop incidence
49841 1939  MT   19  5860 Montana 555  10.55856
51076 1939  WY   17  3338 Wyoming 248  13.45968
51090 1939  WY   18  2509 Wyoming 248  10.11694

Indeed in 1939 three values are over 10. I have always thought you could only catch measles once, so this suggests a number of years with hardly measles must have occurred before.

Time

To have a decent plot I need a decent time variable.

Quick and dirty

My quick and dirty approach was to add a small fraction for weeks:
r6$time <- r6$YEAR+
        (r6$WEEK-.5)/with(r6,ave(WEEK,YEAR,State,FUN=max)+1)

Create a date using Formatting

After reading the post Date formating in R I tried a different approach. According to the manual:
%U
Week of the year as decimal number (00–53) using Sunday as the first day 1 of the week (and typically with the first Sunday of the year as day 1 of week 1). The US convention.

Unfortunately for me that did not work out for reading data:
> as.Date('02 1929',format='%U %Y')
[1] "1929-04-20"

20th of April is the day I am running the code, not the second week of 1929. It seems the %U is ignored in the R I compiled.

Reconstructing a date

I thought to start at the correct date and just add 7 for each week:
uu <- unique(data.frame(YEAR=r5$YEAR,WEEK=r5$WEEK))
uu <- uu[order(uu$YEAR,uu$WEEK),]
uu$Date <- as.Date('01-01-1928',

      '%d-%m-%Y')+seq(0,by=7,length.out=nrow(uu))+3
r7 <- merge(r6,uu)
tail(uu)

       YEAR WEEK       Date
112196 1970   47 1970-11-25
112197 1970   48 1970-12-02
112177 1970   49 1970-12-09
112183 1970   50 1970-12-16
112176 1970   51 1970-12-23
112191 1970   52 1970-12-30

Note that I cannot confirm the correct date. Second day of 1963 formats to week 0, which does not match my data. The year is correct though.
format(as.Date('1963-01-02'),format='%d %b %Y week: %U')
[1] "02 Jan 1963 week: 00"

Plot

The plot is at this point easy.
library(ggplot2)
png('wm.png')
ggplot(r7[r7$State %in% c('Wyoming','Montana') &
                r7$YEAR<1945 &
                r7$YEAR>1930 &
                r7$incidence >0,],
        aes(Date, incidence,group=State,colour=State)) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    theme(text=element_text(family='Arial')) +
    geom_line() +
    scale_y_log10()

Indeed the years before 1939 have lower incidence of measles. What surprised me, the first years after 1939 also have less incidence.

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Extra

Completely unrelated, but if you are living in Amsterdam foto expo "Do you see me?" in Cafe Restaurant Nel might be worth a visit.


Sunday, April 13, 2014

Following open courseware

I love massive open online courses such as provided on Coursera and edX. So I enrolled in the Data Analysis for Genomics course on edX. I am not alone there as seen from this posting on FreshBiostats.

I was shocked when I took the Pre-Course R self-assessment, imagining this would be easy, click through some answers and done. But I read these questions "Use match() and indexing..." and "How would you use split() to...".  If I ever used match() and split() it must have been ages ago. Hat to use the help just to know what they did.
So, I am wondering how may other basic R functions I have forgotten. I remember searching for quite some time because I forgot ave(), must have forgotten that 3 or 4 times. Same for Reduce().
I am almost certain I have programmed the wheel once or twice, not knowing it is in a package sitting right in my computer. Hence even though boring, I find it a good thing to get down to basics again, but it would be nice to run these lectures at 1.5 speed.
Then the course suggests RStudio where my preference is Eclipse with StatET. It is probably good to be out of my comfort zone but I don't expect my taste to change. 
Finally, programs come in markup language (.Rmd files). I am curious to learn if they manage to make added value out of that. Let week 2 begin. 

Sunday, April 6, 2014

Looking at Measles Data in Project Tycho, part II

Continuing from last week, I will now look at incidence rates of measles in the US. To recap, Project Tycho contains data from all weekly notifiable disease reports for the United States dating back to 1888. These data are freely available to anybody interested.

Data

Tycho data

This follows the same pattern as last week.
r1 <- read.csv('MEASLES_Cases_1909-1982_20140323140631.csv',
    na.strings='-',
    skip=2)
r2 <- reshape(r1,
    varying=names(r1)[-c(1,2)],
    v.names='Cases',
    idvar=c('YEAR' , 'WEEK'),
    times=names(r1)[-c(1,2)],
    timevar='STATE',
    direction='long')
r2$STATE=factor(r2$STATE)

Population

The absolute counts are less of an interest, relevant is per 1000 or 100 000 inhabitants, hence population data are needed. Luckily these can be found at census.gov. Less lucky, they sit in text files per decade and the decades have slightly different layouts. On top of that, rather than printing all columns next to each other it is two sections. Based on my investigation last week, years 1920 to 1970 are retrieved. Please notice the units are thousands of inhabitants.
years <- dir(pattern='+.txt')
pop1 <-
    lapply(years,function(x) {
            rl <- readLines(x)
            print(x)
            sp <- grep('^U.S.',rl)
            st1 <- grep('^AL',rl)
            st2 <- grep('^WY',rl)
            rl1 <- rl[c(sp[1]-2,st1[1]:st2[1])]
            rl2 <- rl[c(sp[2]-2,st1[2]:st2[2])]
           
            read1 <- function(rlx) {
                rlx[1] <- paste('abb',rlx[1])
                rlx <- gsub(',','',rlx,fixed=TRUE)
                rt <- read.table(textConnection(rlx),header=TRUE)
                rt[,grep('census',names(rt),invert=TRUE)]
            }
            rr <- merge(read1(rl1),read1(rl2))
            ll <- reshape(rr,
                list(names(rr)[-1]),
                v.names='pop',
                timevar='YEAR',
                idvar='abb',
                times=as.integer(gsub('X','',names(rr)[-1])),
                direction='long')
        })

[1] "st2029ts.txt"
[1] "st3039ts.txt"
[1] "st4049ts.txt"
[1] "st5060ts.txt"
[1] "st6070ts.txt"

pop <- do.call(rbind,pop1)

Glue between datasets

It is not shown in the printout above, but the census data contains states as 2 character abbreviations, while the Tycho data has states as uppercase texts with  spaces replaced by dots, because they started out as variables. Some glue is needed. This can be done via the states data in the datasets package. DC is added manually, since it was not in the states data, but was present in both source data sets. Incidence is Cases/pop, and has as units cases per 1000 inhabitants per week. Variable STATE has served its purpose, so is removed.
states <- rbind(
    data.frame(
        abb=datasets::state.abb,
        State=datasets::state.name),
    data.frame(abb='DC',
        State='District of Columbia'))
states$STATE=gsub(' ','.',toupper(states$State))

r3 <- merge(r2,states)
r4 <- merge(r3,pop)
r4$incidence <- r4$Cases/r4$pop
r5 <- subset(r4,r4$YEAR>1927,-STATE)
head(r5)

      YEAR abb WEEK Cases   State  pop   incidence
20434 1928  AL   44     6 Alabama 2640 0.002272727
20435 1928  AL   27    45 Alabama 2640 0.017045455
20436 1928  AL   47    22 Alabama 2640 0.008333333
20437 1928  AL   26   106 Alabama 2640 0.040151515
20438 1928  AL   30    66 Alabama 2640 0.025000000
20439 1928  AL   18   251 Alabama 2640 0.095075758

Those who think R has a memory problem may want to delete some data here (r2, r3), see end of post. On the other hand, the objects are not that large.

Plots

Data are now on comparable scales, so I tried boxplots. By 1966 it seems measles was under control.

library(ggplot2)
ggplot(with(r5,aggregate(incidence,list(Year=YEAR,State=State),
                function(x) mean(x,na.rm=TRUE))),
        aes(factor(Year), x)) +
    geom_boxplot(notch = TRUE) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    theme(text=element_text(family='Arial')) +
    scale_x_discrete(labels=
            ifelse(levels(factor(r5$YEAR)) %in%
                    seq(1920,1970,by=10),
                levels(factor(r5$YEAR)),
                ''))

The pattern within the years is still clearly visible. A slow increase from week 38 to week 20 of the next year. Then a fast decrease in the remaining 18 weeks.

ggplot(r5[r5$YEAR<1963,],
        aes(factor(WEEK), incidence)) +
    ylab('Incidence registered Measles Cases per week per 1000') +
    ggtitle('Measles 1928-1962')+
    theme(text=element_text(family='Arial')) +
    geom_boxplot(notch = TRUE) +
    scale_x_discrete(labels=
            ifelse(levels(factor(r5$WEEK)) %in%
                    seq(5,55,by=5),
                levels(factor(r5$WEEK)),
                ''))    +
    scale_y_log10()

References

Willem G. van Panhuis, John Grefenstette, Su Yon Jung, Nian Shong Chok, Anne Cross, Heather Eng, Bruce Y Lee, Vladimir Zadorozhny, Shawn Brown, Derek Cummings, Donald S. Burke. Contagious Diseases in the United States from 1888 to the present. NEJM 2013; 369(22): 2152-2158.

Deleting data?

These are not the smallest objects;
#https://gist.github.com/jedifran/1447362
.ls.objects <- function (pos = 1, pattern, order.by,
    decreasing=FALSE, head=FALSE, n=5) {
    napply <- function(names, fn) sapply(names, function(x)
                fn(get(x, pos = pos)))
    names <- ls(pos = pos, pattern = pattern)
    obj.class <- napply(names, function(x) as.character(class(x))[1])
    obj.mode <- napply(names, mode)
    obj.type <- ifelse(is.na(obj.class), obj.mode, obj.class)
    obj.prettysize <- napply(names, function(x) {
            capture.output(print(object.size(x), units = "auto")) })
    obj.size <- napply(names, object.size)
    obj.dim <- t(napply(names, function(x)
                as.numeric(dim(x))[1:2]))
    vec <- is.na(obj.dim)[, 1] & (obj.type != "function")
    obj.dim[vec, 1] <- napply(names, length)[vec]
    out <- data.frame(obj.type, obj.size, obj.prettysize, obj.dim)
    names(out) <- c("Type", "Size", "PrettySize", "Rows", "Columns")
    if (!missing(order.by))
        out <- out[order(out[[order.by]], decreasing=decreasing), ]
    if (head)
        out <- head(out, n)
    out
}

# shorthand
lsos <- function(..., n=10) {
    .ls.objects(..., order.by="Size", decreasing=TRUE, head=TRUE, n=n)
}

lsos()

             Type     Size PrettySize   Rows Columns
r2     data.frame 20879368    19.9 Mb 231660       4
r4     data.frame  4880608     4.7 Mb 135233       8
r3     data.frame  4737864     4.5 Mb 196911       6
r5     data.frame  4140816     3.9 Mb 114800       7
r1     data.frame   965152   942.5 Kb   3861      62
pop1         list   204752     200 Kb      5      NA
pop    data.frame   182312     178 Kb   2592       3
states data.frame    11144    10.9 Kb     51       3
lsos     function     5400     5.3 Kb     NA      NA
years   character      368  368 bytes      5      NA


But how large is 20 Mb these days? I have 4 Gb of RAM. At the end of making this post 1.5 Gb is used.