Sunday, September 29, 2013

Mixed Models: Influence in Heterogeneous Variance Model

In this post I extend my knowledge of mixed models by redoing section 59.7 (page 5048) of the SAS/STAT user guide. I don't think this particular example can be run in lme4, so that leaves nlme and MCMCglmm. MCMCglmm has less ability for influence measures, so this is only touched upon.

data

To quote the SAS/STAT User guide: 'a one-way classification model with heterogeneous variances is fit. The data, (...), represent amounts of different types of fat absorbed by batches of doughnuts during cooking, measured in grams'. Data are read below. The final statement is just ordering the data similar to SAS, which is convenient for comparison on my side. 
r1 <- read.table(textConnection('
1 164 1 172 1 168 1 177 1 156 1 195
2 178 2 191 2 197 2 182 2 185 2 177
3 175 3 193 3 178 3 171 3 163 3 176
4 155 4 166 4 149 4 164 4 170 4 168
') ,col.names=paste('a',1:12,sep=''))
r2 <- data.frame(FatType=factor(as.matrix(r1[,seq(1,11,by=2)])),
    Absorbed=as.numeric(as.matrix(r1[,seq(2,12,by=2)])))
r2 <- r2[c(seq(1,24,by=4),seq(2,24,by=4),seq(3,24,by=4),seq(4,24,by=4)),]

nlme

For a change, the gls() function is used from the nlme package. A few preparations are needed. A variable x is used as dummy variable in the varIdent sub-statement. Variable i is used in influence diagnostic plotting. SAS contrasts speak for themselves. 
r2$x <- 1
r2$i <-1:24
options(contrasts=c( 'contr.SAS','contr.SAS'))
library(nlme)
lme1 <- gls(Absorbed ~ FatType,
    weights=varIdent(form=~x |  FatType),data=r2 )    
summary(lme1)
Generalized least squares fit by REML
  Model: Absorbed ~ FatType 
  Data: r2 
       AIC      BIC    logLik
  170.3109 178.2767 -77.15543

Variance function:
 Structure: Different standard deviations per stratum
 Formula: ~x | FatType 
 Parameter estimates:
        1         2         3         4 
1.0000000 0.5825169 0.7404827 0.6162593 

Coefficients:
            Value Std.Error  t-value p-value
(Intercept)   162  3.356586 48.26332  0.0000
FatType1       10  6.397916  1.56301  0.1337
FatType2       23  4.618803  4.97965  0.0001
FatType3       14  5.247222  2.66808  0.0148

 Correlation: 
         (Intr) FtTyp1 FtTyp2
FatType1 -0.525              
FatType2 -0.727  0.381       
FatType3 -0.640  0.336  0.465

Standardized residuals:
          Min            Q1           Med            Q3           Max 
-1.581138e+00 -6.625646e-01 -6.533961e-15  5.473172e-01  1.723923e+00 

Residual standard error: 13.34166 
Degrees of freedom: 24 total; 20 residual

As ever, nlme shows results differently from SAS. The missing bit is the variances. These can be displayed by this statement.
(lme1$sigma*
      coef(lme1$modelStruct$varStruct,
          uncons = FALSE, allCoef = TRUE))^2
        1         2         3         4 
177.99996  60.39999  97.59999  67.60003 

diagnostics

Having set up the model, leave-one-out statistics is just a sapply() away. For brevity only a few are listed. 
sa <- sapply(1:nrow(r2) ,function(x) {
      rx <- r2
      rx$Absorbed[x] <- NA
      lmex <- gls(Absorbed ~ FatType,
          weights=varIdent(form=~x |  FatType),data=rx,na.action=na.omit )    
      cc <- coef(lmex$modelStruct$varStruct,uncons = FALSE, allCoef = TRUE)
      cc <- cc[order(names(cc))]
      c(coef(lmex),
         (lmex$sigma*cc)^2)
   })
tsa <- as.data.frame( t(sa))
head(tsa)
  (Intercept) FatType1 FatType2 FatType3         1        2        3        4
1         162     11.6       23       14 203.29918 60.40017 97.59971 67.60023
2         162     10.0       23       14 222.50049 60.39999 97.59995 67.59992
3         162     10.8       23       14 217.70003 60.40000 97.59999 67.59999
4         162      9.0       23       14 214.99969 60.40013 97.59964 67.60018
5         162     13.2       23       14 145.70021 60.39986 97.60000 67.60007
6         162      5.4       23       14  63.79997 60.40006 97.59999 67.59996
Plots are more interesting. 
tsa$i <- 1:nrow(tsa)
library(lattice)
par(mfrow=c(2,2))
plot(`(Intercept)`~i,data=tsa,xlab='')
plot(FatType1~i,data=tsa,xlab='')
plot(FatType2~i,data=tsa,xlab='')
plot(FatType3~i,data=tsa,xlab='')
It is a bit funny in influences, since fat-types are displayed equivalent to intercept. In terms of influence on the FatTypes means a plot not shown in SAS/STAT:
names(tsa)[1] <- 'Intercept'
xyplot(Intercept + I(Intercept+FatType1) 
        + I(Intercept+FatType2)+ I(Intercept+FatType3) ~ i   ,
    data=tsa,
    outer=TRUE,
    strip=function(..., factor.levels)
      strip.default(  ...,factor.levels=paste('FatType',c(4,1:3))))
The next plot shows influence on the estimated variances
names(tsa)[5:8] <- paste('FT',1:4,sep='')
xyplot(FT1 + FT2 + FT3 + FT4 ~  i   ,data=tsa,outer=TRUE)

residuals

The ground work for residuals is also present. Note that the studentized residuals come out slightly different; there seems to be a factor (5/6)^2 difference. 
resdf <- data.frame(Absorbed=r2$Absorbed,
    predicted=predict(lme1),
    residual=residuals(lme1),
    std=attr(residuals(lme1),'std')
    )
resdf$lofit <- 0
resdf$lofit[r2$FatType=='1'] <- 
   rowSums(tsa[,c('Intercept','FatType1')])[r2$FatType=='1']
resdf$lofit[r2$FatType=='2'] <- 
   rowSums(tsa[,c('Intercept','FatType2')])[r2$FatType=='2']
resdf$lofit[r2$FatType=='3'] <- 
   rowSums(tsa[,c('Intercept','FatType3')])[r2$FatType=='3']
resdf$lofit[r2$FatType=='4'] <- tsa[,c('Intercept')][r2$FatType=='4']
resdf$PRESSres <- resdf$Absorbed-resdf$lofit
resdf$StudRes <- resdf$residual/resdf$std
head(resdf)
   Absorbed predicted      residual      std lofit      PRESSres       StudRes
1       164       172 -8.000000e+00 13.34166 173.6 -9.600000e+00 -5.996254e-01
5       172       172 -2.842171e-14 13.34166 172.0  5.684342e-14 -2.130297e-15
9       168       172 -4.000000e+00 13.34166 172.8 -4.800000e+00 -2.998127e-01
13      177       172  5.000000e+00 13.34166 171.0  6.000000e+00  3.747659e-01
17      156       172 -1.600000e+01 13.34166 175.2 -1.920000e+01 -1.199251e+00
21      195       172  2.300000e+01 13.34166 167.4  2.760000e+01  1.723923e+00

nlme's own tools

nlme is not without tools for diagnostics as shown below. I labelled a number of points mentions in SAS/STAT Guide. The cut-off for labeling is determined by id. The value of 0.15 coincides with a two sided 85% interval, is chosen so as to label the four points mentioned in SAS/STAT Guide. It is certainly not a limit I would choose and I have some doubt about four influential points on a total of 24 points.
plot(lme1,form= resid(., type = "p") ~ i | FatType,id=.15)
Second diagnostic plot contains normal plots. The marked points seem reasonable in line with the others.
qqnorm(lme1, ~ resid(., type = "p") | FatType,id=.15)

MCMCglmm

While the nlme sections is really long, MCMCglmm is really short. To run the model there were some tricks. An additional variable avar is nested under FatTypes. MCMCglmm gives a units variance, which does not make sense in the context of this model. I did not know how to remove it, but a suitable prior was used to suppress it. 
library(MCMCglmm)
r2$avar <- factor(rep(1:6,4))
prior1 <- list(R=list(V=diag(1)*.1,nu=10),
    G=list(G1=list(V=diag(4),nu=.01) )
)
mc1 <- MCMCglmm(Absorbed ~ FatType,
    random=~ idh(FatType):avar,
    data=r2 ,pr=TRUE,verbose=FALSE,
    prior=prior1,    nitt=50000)
summary(mc1)
 Iterations = 3001:49991
 Thinning interval  = 10
 Sample size  = 4700 

 DIC: 36.68339 

 G-structure:  ~idh(FatType):avar

       post.mean l-95% CI u-95% CI eff.samp
1.avar     295.5    42.52    773.6     4700
2.avar     100.1    14.96    259.5     4712
3.avar     164.9    21.59    443.2     4700
4.avar     110.4    18.25    291.0     4700

 R-structure:  ~units

      post.mean l-95% CI u-95% CI eff.samp
units    0.1256  0.03751   0.2588     4359

 Location effects: Absorbed ~ FatType 

            post.mean l-95% CI u-95% CI eff.samp   pMCMC    
(Intercept)  161.9363 153.0445 169.8837     4700 < 2e-04 ***
FatType1      10.0627  -6.9373  25.8132     4969 0.18681    
FatType2      22.9649  12.0911  34.9963     4700 0.00128 ** 
FatType3      13.9831   0.8961  27.4354     4700 0.04043 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As before, the variances are larger than expected. There is also a function which looks dfferently at this. These values are lower than nlme & PROC MIXED.
posterior.mode(mc1$VCV)
      1.avar       2.avar       3.avar       4.avar        units 
133.71520455  42.92154371  73.72997958  51.18475597   0.09115953 

Influence diagnostics

It is easy enough to extract parameter estimates from the simulations, so a sapply() to get leave-one-out estimates similar to nlme is obvious. However,  residuals() is not yet implemented, so that is far away. It was my choice not to do this.

No comments:

Post a Comment