Sunday, August 17, 2014

Quicksort speed, just in time compiling and vectorizing

I was reading the Julia documentation the other day. They do speed comparisons to other languages. Obviously R does not come out very well. The R code for quicksort is here and I noticed it was not vectorized at all. So I wondered if it could be improved. A quick check on wikipedia showed that the algorithm displayed by wikipedia is the one used by the Julia team. By abandoning this structure and using a just in time compiler some extra speed can be achieved. Additional bonus is that the algorithm actually got more simple. It should be noted (wikipedia)  'Quicksort can be implemented with an in-place partitioning algorithm, so the entire sort can be done with only O(log n) additional space'. Such an implementation was used by the Julia team, however profiling seems to show that limited additional space was not achieved with the R implementation.

Algorithm

To quote wikipedia again:
  • Pick an element, called a pivot, from the array.
  • Reorder the array so that all elements with values less than the pivot come before the pivot, while all elements with values greater than the pivot come after it (equal values can go either way). After this partitioning, the pivot is in its final position. This is called the partition operation.
  • Recursively apply the above steps to the sub-array of elements with smaller values and separately to the sub-array of elements with greater values.

Or, in roughly in R
wfqs1 <- function(x) {
  if (length(x)<2) return(x)
  pivot <- x[sample(length(x),1)]
  c(wfqs1(x[x<pivot]),x[x==pivot],wfqs1(x[x>pivot]))
}
The implementation by the Julia team does do the in place reordering:
qsort = function(a) {
  qsort_kernel = function(lo, hi) {
    i = lo
    j = hi
    while (i < hi) {
      pivot = a[floor((lo+hi)/2)]
      while (i <= j) {
        while (a[i] < pivot) i = i + 1
        while (a[j] > pivot) j = j - 1
        if (i <= j) {
          t = a[i]
          a[i] <<- a[j]
          a[j] <<- t
          i = i + 1;
          j = j - 1;
        }
      }
      if (lo < j) qsort_kernel(lo, j)
      lo = i
      j = hi
    }
  }
  qsort_kernel(1, length(a))
  return(a)
}

Variations in implementation

After coding the first version, it was tried to make faster variations. The first three version are intended to have less comparisons than wfqs1(). The last one is intended to build a bit more vectorizing in qsort().

wfqs2 <- function(x) {
  if (length(x)<2) return(x)
  ipivot <- sample(length(x),1)
  pivot <- x[ipivot]
  c(wfqs2(x[x<pivot]),pivot,wfqs2(x[-ipivot][x[-ipivot]>=pivot]))
}
wfqs3 <- function(x) {
  if (length(x)<2) return(x)
  ipivot <- sample(length(x),1)
  pivot <- x[ipivot]
  split <- x<pivot
  split2 <- !split
  split2[ipivot] <- FALSE
  c(wfqs3(x[split]),pivot,wfqs3(x[split2]))
}
wfqs4 <- function(x) {
  if (length(x)<2) return(x)
  split <- x<x[1]
  c(wfqs4(x[split]),x[1],wfqs4(x[!split][-1]))
}

wfqsx = function(a) {
  qsort_kernel = function(lo, hi) {
    if(lo>=hi) return()
    if(hi-lo==1) {
      if(a[lo]>a[hi]) {
        t <- a[lo]
        a[lo] <<- a[hi]
        a[hi] <<- t
      }
      return()
    }
    goUp <- a[(lo+1):hi]>a[lo]
    up <- which(goUp)
    do <- which(!goUp)
    pivottarget <- lo+length(do)
    up <- up[up<=length(do)]+lo
    do <- do[do>length(do)]+lo
    t <- a[do]
    a[do] <<- a[up]
    a[up] <<- t
    t <- a[pivottarget]
    a[pivottarget] <<- a[lo]
    a[lo] <<- t  
    qsort_kernel(lo,pivottarget-1)
    qsort_kernel(pivottarget+1, hi)
  }
  qsort_kernel(1, length(a))
  return(a)
}

Analysis

Speed without JIT

The first figure shows that sorting time is roughly equivalent to the number of items to be sorted. The exception there is base:sort() where around 100 items there is no size effect. Even though not very pronounced on this scale, qsort() is slowest and wfqs4() is fastest after base:sort().

The second figure shows speed relative to base:sort(). This shows that qsort() is takes approximately 6 times as much time as wfqs4().

Profiling

Profiling seems to show that time is spend because it is spend. Most of it is not spend on any function which Rprof() registers. Regarding memory usage, if mem.total is anything to go by, qsort() uses quite more than wfqs4().

#####     qsort    #############################################
$by.self
               self.time self.pct total.time total.pct mem.total
"qsort_kernel"      6.72    86.15       7.80    100.00    1671.4
"<GC>"              0.68     8.72       0.68      8.72     112.2
"+"                 0.16     2.05       0.16      2.05      31.5
"<"                 0.12     1.54       0.12      1.54      21.5
"-"                 0.06     0.77       0.06      0.77       5.4
"floor"             0.04     0.51       0.04      0.51      10.2
"<="                0.02     0.26       0.02      0.26       5.3

$by.total
               total.time total.pct mem.total self.time self.pct
"qsort_kernel"       7.80    100.00    1671.4      6.72    86.15
"<Anonymous>"        7.80    100.00    1671.4      0.00     0.00
"eval"               7.80    100.00    1671.4      0.00     0.00
"qsort"              7.80    100.00    1671.4      0.00     0.00
"<GC>"               0.68      8.72     112.2      0.68     8.72
"+"                  0.16      2.05      31.5      0.16     2.05
"<"                  0.12      1.54      21.5      0.12     1.54
"-"                  0.06      0.77       5.4      0.06     0.77
"floor"              0.04      0.51      10.2      0.04     0.51
"<="                 0.02      0.26       5.3      0.02     0.26

$sample.interval
[1] 0.02

$sampling.time
[1] 7.8

#####    wfqs4    ###########################################
$by.self
            self.time self.pct total.time total.pct mem.total
"wfqs4"          0.98    75.38       1.30    100.00     258.0
"<"              0.10     7.69       0.12      9.23      26.2
"<GC>"           0.08     6.15       0.08      6.15      11.8
"c"              0.06     4.62       0.08      6.15      12.0
"!"              0.04     3.08       0.04      3.08       9.0
"-"              0.02     1.54       0.02      1.54       4.6
".External"      0.02     1.54       0.02      1.54       0.0

$by.total
              total.time total.pct mem.total self.time self.pct
"wfqs4"             1.30    100.00     258.0      0.98    75.38
"<Anonymous>"       1.30    100.00     258.0      0.00     0.00
"eval"              1.30    100.00     258.0      0.00     0.00
"<"                 0.12      9.23      26.2      0.10     7.69
"<GC>"              0.08      6.15      11.8      0.08     6.15
"c"                 0.08      6.15      12.0      0.06     4.62
"!"                 0.04      3.08       9.0      0.04     3.08
"-"                 0.02      1.54       4.6      0.02     1.54
".External"         0.02      1.54       0.0      0.02     1.54
"runif"             0.02      1.54       0.0      0.00     0.00

$sample.interval
[1] 0.02

$sampling.time
[1] 1.3

JIT compiling

The JIT is a great equalizer. Everybody profits with the exception of base:sort(). It is kind of obvious base:sort() does not profit from the JIT, the part that does the work should be in machine code. In the R code, it seems the various implementations are much closer to each other. Small refinements are apparently swamped by the JIT. Nevertheless, wfsq4() is still fastest after base:sort(). It takes half the time of qsort(). 




Discussion

It cannot be stated enough: When programming in R, vectorize. Vectors make compact code. Vectors are efficient; a vectorized calculation which does more actual computations can beat complex loops which avoid computations. The speed effect is reduced when the JIT is used. Starting the JIT then should be the first step when an R algorithm is too slow. The second step is starting the profiler and looking for (vectorized) optimization. When these two fail it is time to roll out the big guns; (inline) C code, faster computer or switching to a faster language altogether.

Additonal code

la <- lapply(1:15,function(i) {
      n <- 10*2^i
      rr <- runif(n)
      mb <- microbenchmark(
          wfqs1(rr),
          wfqs2(rr),
          wfqs3(rr),
          wfqs4(rr),
          wfqsx(rr),
          sort(rr),
          qsort(rr),
          times=10)
      ag <- aggregate(mb$time,list(method=mb$expr),median)
      ag$x <- ag$x/1000
      ag$n <- n
      ag
    })
res <- do.call(rbind,la)
png('qs1.nc.png')
p <- ggplot(res, aes(y=x, x=n,col=method))
p +geom_line() +
    scale_y_log10('Median sorting time (ms)') +
    scale_x_log10() +
    labs(x='Number of items',title='No compiling' ) 
dev.off()

resref <- subset(res,res$method=='sort(rr)',-method)
names(resref)[names(resref)=='x'] <- 'tsort' 
res2 <- merge(res,resref)
res2$reltime <- res2$x/res2$tsort
png('qs2.nc.png')
p <- ggplot(res2, aes(y=reltime, x=n,col=method))
p +geom_line() +
    scale_x_log10() +
    labs(x='Number of items',
        y='Median relative sorting time',
        title='No compiling' ) 
dev.off()

Rprof("Rprofqs.out",memory.profiling = TRUE, gc.profiling = TRUE)
y = qsort(runif(100000))
Rprof(NULL)
Rprof("Rprofw4.out",memory.profiling = TRUE, gc.profiling = TRUE)
y = wfqs4(runif(100000))
Rprof(NULL)

summaryRprof(filename = "Rprofqs.out",memory='both')
summaryRprof(filename = "Rprofw4.out",memory='both')

require(compiler)
enableJIT(3)
la <- lapply(1:15,function(i) {
      n <- 10*2^i
      rr <- runif(n)
      mb <- microbenchmark(
          wfqs1(rr),
          wfqs2(rr),
          wfqs3(rr),
          wfqs4(rr),
          wfqsx(rr),
          sort(rr),
          qsort(rr),
          times=10)
      ag <- aggregate(mb$time,list(method=mb$expr),median)
      ag$x <- ag$x/1000
      ag$n <- n
      ag
    })
res <- do.call(rbind,la)
enableJIT(0)
png('qs1.jit3.png')
p <- ggplot(res, aes(y=x, x=n,col=method))
p +geom_line() +
    scale_y_log10('Median sorting time (ms)') +
    scale_x_log10() +
    labs(x='Number of items',title='Enable JIT 3' ) 
dev.off()

resref <- subset(res,res$method=='sort(rr)',-method)
names(resref)[names(resref)=='x'] <- 'tsort' 
res2 <- merge(res,resref)
res2$reltime <- res2$x/res2$tsort
png('qs2.jit3.png')
p <- ggplot(res2, aes(y=reltime, x=n,col=method))
p +geom_line() +
    scale_x_log10() +
    labs(x='Number of items',
        y='Median relative sorting time',
        title='Enable JIT 3' ) 
dev.off()

Sunday, August 10, 2014

Guns are cool - Regions

This was supposed to be a post in which General Social Surveys (GSS) data were used to understand a bit more about the causation of differences between states. Thus it was to give additioanl insight than my previous post; Guns are Cool - Differences between states. Unfortunately, that did not work so good, and it ended as a kind of investigation of regions.

Data

The data are the shootingtracker data. Code see bottom of the post.

Analysis

The GSS have data by region rather than by state. Hence incidences by region were calculated. After aggregation the following results were obtained. From analysis point of view, New England forms somewhat an outlier. It is so much lower that any model should cater to this feature. Since there are only nine regions, which is quite low to do an analysis, and potentially tens of independent variables, I gave up on linking this incidence to GSS.
         StateRegion shootings Population  Incidence
5        New England        12   14618806 0.05148008
4           Mountain        30   22881245 0.08222644
8 West North Central        30   20885710 0.09008280
9 West South Central        58   37883604 0.09601666
3    Middle Atlantic        68   41970716 0.10160906
6            Pacific        91   51373178 0.11108997
7     South Atlantic       110   61137198 0.11283843
2 East South Central        36   18716202 0.12062981
1 East North Central       101   46662180 0.13574575

Regions and days

I had been wondering if the week day effect observed before would vary between states. But spreading the data over all states would thin the data too much and make plots unappealing. The nine regions are much better. While the Sunday/Weekend effect is present in all regions, its size differs markedly. Sundays seem particularly bad in West North Central. Three regions, New England, West South Central and Mountain have no shootings on one weekday. 

Hierarchical model for estimates of incidence

In Guns are Cool - States I lamented that some states may have looked worse because the model pulled them too strong. The regions may provide a solution for this, like goes with like. After some twiddling of the model and quite some tweaking of hyperpriors, a new plot of States' incidence was made. It should be noted that I am not happy about parameters a[] and b[], their Rhat is 1.15 and effective sample size 20. Partly this is because they covary. The beta parameter calculated from a[] and b[] looks much better. It should also be noted that, during development an older version of the 2014 data was used, which made New England look better. 
In the plot the states are a bit closer to each other than before, which is not what I expected. New England is near the bottom, as is to be expected. New Hampshire and Vermont hence look better. Illinois is now worst, it has less states to pull its score downwards.
For completeness sake it should be noted that this calculation was done on the Census Bureau regions. A look on Wikipedia provides more US regions, the Standard Federal Regions may provide a better split for these data. States in Region VIII (Colorado, Montana, North Dakota, South Dakota, Utah, Wyoming) and X (Alaska, Idaho, Oregon, Washington) then have lower posterior incidences. But, since this division was tested because it looked more suitable especially for states in VIII and X, it has an unknown selection bias and hence its results are not presented. 

Code

r13 <- readLines('raw13.txt')
r14 <- readLines('raw14.txt')
r1 <- c(r13,r14)
r2 <- gsub('\\[[a-zA-Z0-9]*\\]','',r1)
r3 <- gsub('^ *$','',r2)
r4 <- r3[r3!='']
r5 <- gsub('\\t$','',r4)
r6 <- gsub('\\t References$','',r5)
r7 <- read.table(textConnection(r6),
    sep='\t',
    header=TRUE,
    stringsAsFactors=FALSE)
r7$Location[r7$Location=='Washington DC'] <-
    'WashingtonDC, DC'
r8 <- read.table(textConnection(as.character(r7$Location)),
    sep=',',
    col.names=c('Location','State'),
    stringsAsFactors=FALSE)
r8$State <- gsub(' ','',r8$State)
r8$State[r8$State=='Tennessee'] <- 'TN'
r8$State[r8$State=='Ohio'] <- 'OH'
r8$State[r8$State %in% c('Kansas','KA')] <- 'KS'
r8$State[r8$State=='Louisiana'] <- 'LA'
r8$State[r8$State=='Illinois'] <- 'IL'
r8$State <- toupper(r8$State)
table(r8$State)
r7$StateAbb <- r8$State
r7$Location <- r8$Location
r7 <- r7[! (r7$State %in% c( 'PUERTORICO','PR')),]
r7$Date <- gsub('/13$','/2013',r7$Date)
r7$date <- as.Date(r7$Date,format="%m/%d/%Y")

states <- data.frame(
    StateAbb=as.character(state.abb),
    StateRegion=state.division,
    State=as.character(state.name)
)
states <- rbind(states,data.frame(StateAbb='DC',
        State='District of Columbia',
        StateRegion= 'Middle Atlantic'))
# http://www.census.gov/popest/data/state/totals/2013/index.html
inhabitants <- read.csv('NST-EST2013-01.treated.csv')
#put it all together

states <- merge(states,inhabitants)
r9 <- merge(r7,states)
Sys.setlocale(category = "LC_TIME", locale = "C")
r9$weekday <- factor(format(r9$date,'%a'),
    levels=format(as.Date('07/07/2014',format="%m/%d/%Y")+0:6,'%a'))
alldays <- data.frame(
    date=seq(min(r9$date),max(r9$date),by=1))
alldays$weekday <- factor(format(alldays$date,'%a'),
    levels=levels(r9$weekday))

agg1 <- as.data.frame(xtabs(~StateRegion+weekday,data=r9))
names(agg1)[names(agg1)=='Freq'] <- 'shootings'
agg2 <- with(states,
    aggregate(Population,
        by=list(StateRegion=StateRegion),
        FUN=sum))
names(agg2)[names(agg2)=='x'] <- 'population'

agg12 <- merge(agg1,agg2,all=TRUE)
agg3 <- as.data.frame(table(alldays$weekday))
names(agg3) <- c('weekday','ndays')
agg123 <- merge(agg12,agg3)
agg123$incidence <- 100000*agg123$shootings/agg123$population*365/agg123$ndays

#########################

agg4 <- as.data.frame(xtabs(~StateRegion,data=r9))
names(agg4)[names(agg4)=='Freq'] <- 'shootings'
agg5 <- as.data.frame(xtabs(Population ~StateRegion,data=states))
names(agg5)[names(agg5)=='Freq'] <- 'Population'
agg45 <- merge(agg4,agg5)
agg45$Incidence <- 100000*365*
    agg45$shootings/agg45$Population/
    as.numeric((max(r7$date)-min(r7$date)))
agg45[order(agg45$Incidence),c(1:4)]
#########################

library(ggplot2)
agg123$Region <- agg123$StateRegion
levels(agg123$Region) <-  gsub(' ','\n',levels(agg123$Region))

ggplot(data=agg123,
        aes(y=incidence,x=weekday)) +
    geom_bar(stat='identity') +
    ylab('Incidence') +
    xlab('') +
    coord_flip() +
    facet_grid(~Region  )

########

r10 <- merge(as.data.frame(xtabs(~StateAbb,data=r9)),states,all=TRUE)
r10$Freq[is.na(r10$Freq)] <- 0
r10$Incidence <- r10$Freq*100000*365/r10$Population/
    as.numeric((max(r7$date)-min(r7$date)))

library(R2jags)
datain <- list(
    count=r10$Freq,
    Population = r10$Population,
    n=nrow(r10),
    nregion =nlevels(r10$StateRegion),
    Region=(1:nlevels(r10$StateRegion))[r10$StateRegion],
    scale=100000*365/
        as.numeric((max(r7$date)-min(r7$date))))

model1 <- function() {
  for (i in 1:n) {
    count[i] ~ dbin(p[i],Population[i])
    p[i] ~ dbeta(a[Region[i]],b[Region[i]])
    pp[i] <- p[i]*scale
  }
  for (i in 1:nregion) {
     a[i] ~ dnorm(aa,tauaa) %_% T(0,)
     b[i] ~ dnorm(bb,taubb) %_% T(0,)
     betamean[i] <- scale * a[i]/(a[i]+b[i])
   }
   tauaa <- pow(sigmaaa,-2)
   sigmaaa ~dunif(0,1e5)
   taubb <- pow(sigmabb,-2)
   sigmabb ~dunif(0,1e8)
   aa ~ dunif(0,1e5)
   bb ~ dunif(0,1e8) 
}

parameters <- c('a','b','betamean','pp','aa','bb','sigmaaa','sigmabb')
jagsfit <- jags(datain, 
    model=model1, 
    parameters=parameters,
    n.iter=250000,
    n.burnin=100000,
    n.chain=4,
    progress.bar="gui")
jagsfit
traceplot(jagsfit)
#dimnames(jagsfit$BUGSoutput$sims.matrix)[2]
samples <- jagsfit$BUGSoutput$sims.matrix[,31:81]
r10$pred <- apply(samples,2,mean)
r10$l05 <- apply(samples,2,function(x) quantile(x,.05))
r10$l95 <- apply(samples,2,function(x) quantile(x,.95))

r11 <- subset(r10,,c(State,l05,l95,pred,Incidence,StateRegion))
r11 <- r11[order(r11$pred),]
r11$State <- factor(r11$State,levels=r11$State)
p <- ggplot(r11, aes(y=pred, x=State,col=StateRegion))
limits <- aes(ymin =l05, ymax=l95)
dodge <- position_dodge(width=0.9)

p +geom_point(position=dodge, stat="identity") +
    geom_errorbar(limits, position=dodge, width=0.25) +
    theme(legend.position = "bottom") +
    geom_point(aes(y=Incidence,x=State),
        position=dodge, stat="identity",col='black') +
    ylab('Incidence') +
    guides(colour=guide_legend(title='Region',nrow=3)) +
    coord_flip()  


Sunday, August 3, 2014

Guns are Cool - Differences between states

Last week my blog showed that there are differences between states in the shootingtracker database. This week it is attempted to understand why states are different. A number of variables were extracted from a few sources, among which gun laws, % gun owners and a few more general variables. It appeared that among the examined variables, % highschool or higher and % Gun Owners are most important. The latter has a markedly non-linear relation, worst amount of guns is around 30%.

Data

Shootingtracker data is same as last post.
Gun laws were difficult to capture. Laws are made full of exemptions and details, where statistics loves having variables with a few levels. Wikipedia has a table for each state, and the rules listed vary with each state. Luckily I also found a summary from Nick Leghorn containing 2011 laws in a nice tabular format. Some additional coding has been done to make this better usable. This coding categorized to Yes, No and Other.
Percentage guns is another difficult to find variable. A lucky find was on about.com where 2012 data could be found.
As additional variables, States IN Profile was used to extract income per capita and education information. For this a full dataset excel sheet was downloaded, after which data was selected into a much smaller version, by among others removing all but the latest year in each sheet, via editing in LibreOffice.
All in all there were quite a number of choices made to obtain the data. I feel quite certain different choices would bring about different analysis results.

Analysis

Three analysis methods were used. RandomForest because it allows easy mixing of variables and factors. I find it usually gives me a good idea which variables are of influence. Generalized linear regression, because it should be a default approach. Generalized Additive Model because it can handle non linearity in continuous variables without getting back to the standard polynomials. I do not really believe in a world with only linear relations, that is just too simple.

RandomForest

Random Forest shows that percentage High School (PerHSplus) is the most important variable, followed by the percentage gun owners (PercGunOwner). The gun laws are at the bottom of the list.

Generalized Linear Regression

Generalized linear regression showed again PercGunOwner and PerHSplus as most important variables. Though both have a p value of approximately 0.09, which is not strong.
Analysis of Deviance Table (Type II tests)

Response: cbind(x, Population - x)
                 LR Chisq Df Pr(>Chisq)  
PermitPurchHandG  0.66719  1    0.41403  
PermitPurchRifle  0.07659  1    0.78198  
PermitPossess     1.32612  2    0.51527  
Register          0.13776  2    0.93344  
AssualtWBan       2.31216  2    0.31472  
WaitingPeriod     0.16775  2    0.91955  
Other             1.26820  1    0.26011  
PercGunOwner      2.80036  1    0.09424 .
income            1.29039  1    0.25598  
PerHSplus         2.77038  1    0.09602 .
PerBAplus         1.83276  1    0.17580  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Generalized Additive Model

The GAM package allows some model selection, which I decided to use. Only the continuous variables are used, but then the factors, which encompassed the gun laws, did not really show influence in previous analysis. It appears both PercGunOwner and PerHSplus have a non-linear relation with shootings.
Call:
gam(formula = cbind(x, Population - x) ~ 
    s(PercGunOwner, df = 2) + 
    s(PerHSplus, df = 2), 
    family = binomial, data = r13, trace = FALSE)

Degrees of Freedom: 48 total; 43.99996 Residual
Residual Deviance: 85.09242 

Plot

The figure shows a marked curvature in gun ownership. Worst is less education and medium level of gun ownership. Best is high education, where gun ownership does not matter very much. 

Discussion

Getting three times the same variables is a good indication that, within the data used, these are the variables of relevance. However, it does not preclude that other variables are relevant.
It is odd that gun laws did not have influence. However, a number of explanations may be found. The variables may not express the laws essential properties. Gun laws may not be effective, at least not the range in laws within the USA. The relation may be different than 'laws reduce shootings', for instance: 'shootings cause laws'. To make that more explicit: Why would a law maker in Montana, with high % gun owners and no shootings make a gun law? It won't get voters. On the other hand, the amount of shootings last week's post had in DC shootings indicates something needs to be done. 
A low level of education having more shootings makes some kind of sense. It might be a direct relation, but just as probable an indicator for a whole number of related variables.
The relation with ownership is interesting. Over some percentage of ownership shootings decrease. Whether there is a common cultural factor influencing both or a direct causation cannot be stated based on these data. The relation less guns gives less shootings under 25% ownership is at least a bit more intuitively logical.

Code

Reading the shootingtracker data has not changed since last week.
r13 <- readLines('raw13.txt')
r14 <- readLines('raw14.txt')
r1 <- c(r13,r14)
r2 <- gsub('\\[[a-zA-Z0-9]*\\]','',r1)
r3 <- gsub('^ *$','',r2)
r4 <- r3[r3!='']
r5 <- gsub('\\t$','',r4)
r6 <- gsub('\\t References$','',r5)
r7 <- read.table(textConnection(r6),
    sep='\t',
    header=TRUE,
    stringsAsFactors=FALSE)
r7$Location[r7$Location=='Washington DC'] <-
    'WashingtonDC, DC'
r8 <- read.table(textConnection(as.character(r7$Location)),
    sep=',',
    col.names=c('Location','State'),
    stringsAsFactors=FALSE)
r8$State <- gsub(' ','',r8$State)
r8$State[r8$State=='Tennessee'] <- 'TN'
r8$State[r8$State=='Ohio'] <- 'OH'
r8$State[r8$State %in% c('Kansas','KA')] <- 'KS'
r8$State[r8$State=='Louisiana'] <- 'LA'
r8$State[r8$State=='Illinois'] <- 'IL'
r8$State <- toupper(r8$State)
table(r8$State)
r7$StateAbb <- r8$State
r7$Location <- r8$Location
r7 <- r7[! (r7$State %in% c( 'PUERTORICO','PR')),]
r7$Date <- gsub('/13$','/2013',r7$Date)
r7$date <- as.Date(r7$Date,format="%m/%d/%Y")

r7$one <- 1
agg1 <- with(r7,
    aggregate(one,by=list(StateAbb=StateAbb),FUN=sum))
library(datasets)
states <- data.frame(StateAbb=as.character(state.abb),
    StateArea=state.area,
    State=as.character(state.name),
    StateRegion=state.region,
    Frost=state.x77[,'Frost']
)
# http://www.census.gov/popest/data/state/totals/2013/index.html
inhabitants <- read.csv('NST-EST2013-01.treated.csv')
#put it all together
states <- rbind(states,data.frame(StateAbb='DC',
        StateArea=68.3,
        State='District of Columbia',
       StateRegion= 'South',Frost=NA))
r9 <- merge(states,inhabitants)
r10 <- merge(r9,agg1,all=TRUE)
# No data in some states
r10$x[is.na(r10$x)] <- 0

r10$incidence <- 100000*r10$x/r10$Population*365/
        as.numeric((max(r7$date)-min(r7$date)))

Gun laws

library(XML)
rh <- readHTMLTable('http://nickleghorn.com/flaws2011.html',
    stringsAsFactors = FALSE)
laws <- rh[[1]]
names(laws) <- c('State',
    'PermitPurchHandG',
    'PermitPurchRifle',
    'PermitPossess',
    'Register',
    'AssualtWBan',
    'WaitingPeriod',
    'Other')
for (i in 2:(ncol(laws)-1)) {
  laws[!(laws[,i] %in% c('Yes','No')),i] <- 'Other'
  laws[,i] <- factor(laws[,i],levels=c('Yes','No','Other'))
}
laws[!(laws[,8] %in% c('None')),8] <- 'Other'
laws[,8] <- factor(laws[,8],levels=c('None','Other'))
r11 <- merge(r10,laws)

Gun ownership

# http://usliberals.about.com/od/Election2012Factors/a/Gun-Owners-As-Percentage-Of-Each-States-Population.htm
lines1 <- readLines('ownership.txt')
lines1 <- gsub('\\(.*\\)','',lines1)
lines1 <- gsub('^[0-9]*. ','',lines1)
lines1 <- gsub('% *$','',lines1)
lines1 <- gsub(' - ',',',lines1,fixed=TRUE)
r12 <- merge(r11,
    read.csv(textConnection(lines1),col.names=c('State','PercGunOwner')))

General/states in profile

library(xlsx)
sheet1 <- read.xlsx('statesInProfile_dataExtract.selected.xlsx',
    sheetName='Ed Attainment (ACS 2008)')
sheet1 <- sheet1[sheet1$state != 'US',]
sheet1x <- data.frame(StateAbb=sheet1$state,
    PerHSplus = as.numeric(as.character(sheet1$X._Receiving_HS_Diploma_or_more)),
    PerBAplus = as.numeric(as.character(sheet1$Total_Completing_BA_Degree_or_More)))
sheet2 <- read.xlsx('statesInProfile_dataExtract.selected.xlsx',
    sheetName='BEA Per Capita Income')
sheet2 <- sheet2[sheet2$stabb != 'US',]
sheet2x <- data.frame(StateAbb=factor(sheet2$stabb),
    income=as.numeric(as.character(sheet2$data)))
sheets <- merge(merge(sheet1x,sheet2x),sheet1x)
r13 <- merge(r12,sheets)

Analysis

Random forest

library(randomForest)
rf1 <- randomForest(incidence ~
        PermitPurchHandG + PermitPurchRifle +
        PermitPossess + Register +
        AssualtWBan + WaitingPeriod +
        Other +
        PercGunOwner +
        income + PerHSplus + PerBAplus,
    data=r13,
    importance= TRUE)
varImpPlot(rf1)

GLM

library(car)
g1 <- glm(cbind(x,Population-x) ~
        PermitPurchHandG + PermitPurchRifle +
        PermitPossess + Register +
        AssualtWBan + WaitingPeriod +
        Other +
        PercGunOwner +
        income + PerHSplus + PerBAplus,
    data=r13,
    family=binomial)
Anova(g1)

GAM

g.obj <- gam(cbind(x,Population-x) ~
        s(PercGunOwner,df=2) +
        s(income,df=2) + 
        s(PerHSplus,df=2) + 
        s(PerBAplus,df=2),
    data=r13,
    family=binomial)
g.step <- step.gam(g.obj,scope=list(
        "PercGunOwner"=~1  + PercGunOwner + s(PercGunOwner,df=2) + s(PercGunOwner,df=3),
        "income"      =~1  + income       + s(income,df=2)       + s(income,df=3),
        "PerHSplus"   =~1  + PerHSplus    + s(PerHSplus,df=2)    + s(PerHSplus,df=3),
        "PerBAplus"   =~1  + PerBAplus    + s(PerBAplus,df=2)    + s(PerBAplus,df=3))
        )

g.step

Plot

series <- function(x,n) seq(min(x),max(x),length.out=n)
topred <- expand.grid(
    PercGunOwner=series(r13$PercGunOwner,11),
    PerHSplus=series(r13$PerHSplus,12))
preds <- predict(g.step,topred,type='response')
preds <- preds*100000*365/
    as.numeric(max(r7$date)-min(r7$date))
png('contour1.png')
contour(x=series(r13$PercGunOwner,11),
    y=series(r13$PerHSplus,12),
    z=preds,
    xlab='% Gun Owner',
    ylab='% High School Plus')
text(x=r13$PercGunOwner,
    y=r13$PerHSplus,
    labels=paste(r13$StateAbb,format(r13$incidence,digits=1)),
    cex=.65)
dev.off()