### Data

The data is in a spreadsheet, one page per location. There is quite a big header on each page and each data column has an associated column marking data below detection limit. This being real world data there are also the odd missing data.Following Read Excel Data from R I chose to read the data via XLConnect. The whole process is wrapped in a function, So I can read all pages (all locations).

library(XLConnect)

wb = loadWorkbook("lmre_1992-2005.xls")

sheets <- getSheets(wb)[-1]

readsheet <- function(cursheet) {

df = readWorksheet(wb, sheet = cursheet , header = TRUE)

topline <- 8

test <- which(df[6,]=='C')+1

numin <- df[topline:nrow(df),test]

names(numin) <- make.names( gsub(' ','',df[6,test]))

for(i in 1:ncol(numin)) numin[,i] <- as.numeric(gsub(',','.',numin[,i]))

status = df[topline:nrow(df),test-1]

names(status) <- paste('C',make.names( gsub(' ','',df[6,test])),sep='')

numin$StartDate <- as.Date(substr(df[topline:nrow(df),1],1,10))

numin$EndDate <- as.Date(substr(df[topline:nrow(df),2],1,10))

numin <- cbind(numin,status)

tall <- reshape(numin,

varying=list(make.names( gsub(' ','',df[6,test])),

paste('C',make.names( gsub(' ','',df[6,test])),sep='')),

v.names=c('Amount','Status'),

timevar='Compound',

idvar=c('StartDate','EndDate'),

times=paste(df[6,test],'(',df[5,test],')',sep=''),

direction='long')

tall$Compound <- factor(gsub(' )',')',gsub(' +',' ',tall$Compound)))

row.names(tall)<- NULL

tall$Location <- cursheet

tall

}

tt <- lapply(sheets,readsheet)

tt2 <- do.call(rbind,tt)

tt2$Location <- factor(tt2$Location)

### Plots

There are certainly many variables to be analyzed, more than I would do in a blog post. I have chosen pH (acid rain) and Iron (Fe).

levels(tt2$Compound)

[1] "As (umol/l)" "Ca (umol/l)" "Cd (umol/l)"

[4] "Cl (umol/l)" "Co (umol/l)" "Cr (umol/l)"

[7] "Cu (umol/l)" "F (umol/l)" "Fe (umol/l)"

[10] "K (umol/l)" "K25 (uS/cm)" "massa_hc (g)"

[4] "Cl (umol/l)" "Co (umol/l)" "Cr (umol/l)"

[7] "Cu (umol/l)" "F (umol/l)" "Fe (umol/l)"

[10] "K (umol/l)" "K25 (uS/cm)" "massa_hc (g)"

[13] "massa_zm (g)" "Mg (umol/l)" "Mn (umol/l)"

[16] "Na (umol/l)" "NH4 (umol/l)" "Ni (umol/l)"

[19] "NO3 (umol/l)" "nsl (mm)" "Pb (umol/l)"

[22] "pH (pH-eenheid)" "PO4 (umol/l)" "SO4 (umol/l)"

[16] "Na (umol/l)" "NH4 (umol/l)" "Ni (umol/l)"

[19] "NO3 (umol/l)" "nsl (mm)" "Pb (umol/l)"

[22] "pH (pH-eenheid)" "PO4 (umol/l)" "SO4 (umol/l)"

[25] "st.zr. (umol/l)" "V (umol/l)" "Zn (umol/l)"

#### pH

Acid rain was quite a scare last century. We should be able to see progress, especially in the nineties, when there was quite some activity on it. One location is not shown; Leiduin has data complimentary to 'De Zilk'. Removal of this location gave a much nicer layout for the remaining data.

The plots show progress, but some locations (e.g. De Zilk, Rotterdam) there is still a way to go. Is Vredepeel going more acid again? I may need to add more data after all.

library(ggplot2)

library(scales)

(cc <- levels(tt2$Compound)[22])

pp <- ggplot(tt2[tt2$Compound==cc & tt2$Location!='Leiduin',],

aes( StartDate,Amount))

pp + geom_point() +

facet_wrap(~Location) +

ylab(cc) + xlab('Year') +

scale_x_date(breaks='4 years',labels = date_format("%Y"))

#### Iron

Iron is plotted on a logarithmic y scale. Disadvantage is that this hides extreme high (Kollumerwaard) and below zero's. However, on this scale the data seem more or less 'normal'.

(cc <- levels(tt2$Compound)[9])

pp <- ggplot(tt2[tt2$Compound==cc & tt2$Location!='Leiduin',], aes( StartDate,Amount))

pp + geom_point() +

scale_y_log10() +

facet_wrap(~Location) +

ylab(cc) + xlab('Year') +

scale_x_date(breaks='4 years',labels = date_format("%Y"))

### Mixed model for Iron

Just to get some quantification on the reduction I ran a model. The obvious choice is again to use logarithmic transformed data. However, some zeros unleashed havoc on the subsequent estimates. Hence these data are made NA prior to computations.library(lme4)

Fe <- tt2[tt2$Compound==levels(tt2$Compound)[9],]

# plot(density(Fe$Amount[!is.na(Fe$Status) & Fe$Status!=0],adjust=.5))

Fe$LAmount <- log10(Fe$Amount)

Fe$LAmount[Fe$Amount<=0] <- NA

ll <- lmer(LAmount ~ StartDate + (1|Location) +(StartDate|Location ) ,

data=Fe)

summary(ll)

Linear mixed model fit by REML ['lmerMod']

Formula: LAmount ~ StartDate + (1 | Location) +

(StartDate | Location)

Data: Fe

REML criterion at convergence: 2170.498

Random effects:

Groups Name Variance Std.Dev. Corr

Location (Intercept) 4.727e-11 6.875e-06

Location.1 (Intercept) 1.456e-04 1.207e-02

StartDate 1.348e-10 1.161e-05 1.00

Residual 1.436e-01 3.789e-01

Number of obs: 2338, groups: Location, 17

Fixed effects:

Estimate Std. Error t value

(Intercept) 5.001e-01 5.971e-02 8.376

StartDate -6.975e-05 6.338e-06 -11.005

Correlation of Fixed Effects:

(Intr)

StartDate -0.862

A parameter of interest is effect of StartDate. It is the size of the effect per day. This translates in 5% reduction per year. The t-value suggests this is significant, as does the associated ANOVA.

> (10^-6.975e-5)^365

[1] 0.9430642

ll2 <- lmer(LAmount ~ 1 + (1|Location) +(StartDate|Location ) ,

data=Fe)

anova(ll,ll2)

Data: Fe

Models:

ll2: LAmount ~ 1 + (1 | Location) + (StartDate | Location)

ll: LAmount ~ StartDate + (1 | Location) + (StartDate | Location)

Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)

ll2 6 2196.0 2230.5 -1092.0 2184.0

ll 7 2157.3 2197.6 -1071.7 2143.3 40.656 1 1.815e-10 ***

---

Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A bootstrap simulation of this significance needs some more processing. It did not work unless I removed all data with missing values prior to calculation. The result was 0 out of 1000 simulations, which in hindsight is obvious. Even if I don't trust the ANOVA, I don't believe it would be a factor 10^7 off either.

## No comments:

## Post a Comment