Sunday, February 3, 2013

For descriptive statistics, values below LLOQ set to ...

That is what I read the other day. For calculation of descriptive statistics, values below the LLOQ (lower limit of quantification)  were set to.... Then I wondered, wasn't there a trick in JAGS to incorporate the presence of missing data while estimating parameters of a distribution. How would that compare with standard methods such as imputing LLOQ at 0, half LLOQ, LLOQ or just ignoring the missing data? A simulation seems to be called for.

Bayes estimation

A bit of searching gave me a way function to estimate the mean and standard deviation, here wrapped in a function so resulting data has a nice shape.
library(R2jags)
library(plyr)
library(ggplot2)

Bsensnorm <- function(yy,limit) {
  isCensored <- is.na(yy)
  model1 <- function() {
    for ( i in 1:N ) {
      isCensored[i] ~ dinterval( yy[i] , limit )
      yy[i] ~ dnorm( mu , tau ) 
    }
    tau <- 1/pow(sigma,2)
    sigma ~ dunif(0,100)
    mu ~ dnorm(0,1E-6)
  }
    datain <- list(yy=yy,isCensored=1-as.numeric(isCensored),N=length(yy),limit=limit)
  params <- c('mu','sigma')
  inits <- function() {
    list(mu = rnorm(0,1),
        sigma = runif(0,1))
  }
  
  jagsfit <- jags(datain, model=model1, inits=inits, 
      parameters=params,progress.bar="gui",
      n.chains=1,n.iter=7000,n.burnin=2000,n.thin=2)
  data.frame( system='Bayes Uninf',
      mu_out=jagsfit$BUGSoutput$mean$mu,
      sigma_out=jagsfit$BUGSoutput$mean$sigma)
}

Somewhat informative prior

As the original model did not perform very well at some points, I created an informative prior on the standard deviation. It is not the best prior, but just to give me an idea. To defend having a prior, once a method has a LLOQ, there should be some idea about precision. Otherwise, how is LLOQ determined? Also, in a real world there may be data points with complete data, which allows estimation of the standard deviation in a more hierarchical model.
BsensInf <- function(yy,limit) {
  isCensored <- is.na(yy)
  model2 <- function() {
    for ( i in 1:N ) {
      isCensored[i] ~ dinterval( yy[i] , limit )
      yy[i] ~ dnorm( mu , tau ) 
    }
    tau <- 1/pow(sigma,2)
    sigma ~ dnorm(1,1) %_% T(0.001,)
    mu ~ dnorm(0,1E-6) 
  }
  datain <- list(yy=yy,isCensored=1-as.numeric(isCensored),N=length(yy),limit=limit)
  params <- c('mu','sigma')
  inits <- function() {
    list(mu = abs(rnorm(0,1))+.01,
        sigma = runif(.1,1))
  }
  
  jagsfit <- jags(datain, model=model2, inits=inits, parameters=params,progress.bar="gui",
      n.chains=1,n.iter=7000,n.burnin=2000,n.thin=2)
  data.frame( system='Bayes Inf',
      mu_out=jagsfit$BUGSoutput$mean$mu,
      sigma_out=jagsfit$BUGSoutput$mean$sigma)
}

Simple imputation

These are the simple imputations each in a separate function.
L0sensnorm <- function(yy,limit) {
  yy[is.na(yy)] <- 0
  data.frame(system='as 0',mu_out=mean(yy),sigma_out=sd(yy))
}
Lhsensnorm <- function(yy,limit) {
  yy[is.na(yy)] <- limit/2
data.frame(system='as half LOQ',mu_out=mean(yy),sigma_out=sd(yy))
}
Llsensnorm <- function(yy,limit) {
  yy[is.na(yy)] <- limit
  data.frame(system='as LOQ',mu_out=mean(yy),sigma_out=sd(yy))
}
LNAsensnorm <- function(yy,limit) {
  yy <- yy[!is.na(yy)]
  data.frame(system='as missing',mu_out=mean(yy),sigma_out=sd(yy))
}

Simulations

The script below is used to the methods with some settings I chose. Basically, the LLOQ is set at 1. The SD is 1. The mean ranges fro 1 to 3.5. The data is cut a a level of 1. The data is simulated from a normal distribution. If more than half of the data is below LLOQ then new samples are drawn until more than half of the data is above LLOQ. 
limitest <- function(yy,limit) {
  yy[yy<limit] <- NA
  do.call(rbind,list(
    Bsensnorm(yy,limit),
    BsensInf(yy,limit),
    L0sensnorm(yy,limit),
    Lhsensnorm(yy,limit),
    Llsensnorm(yy,limit), 
    LNAsensnorm(yy,limit)))
}
simtest <- function(n,muin,sdin,limit) {
  numnotna<- 0
  while(numnotna<n/2) {
    yy <- rnorm(n,muin,sdin)
    numnotna <- sum(yy>limit)
  }
  ll <- limitest(yy,limit)
  ll$n_na <- sum(yy<limit)
  ll
}
datain <- expand.grid(n=c(5,6,8,10,12,15,20,50,80),muin=seq(1,3.5,.05),sdin=1,limit=1)
dataout <- mdply(datain,simtest)

Results for means

After all simulations, a difference is calculated between the simulation mean and the estimated means. This made interpreting the plots much more easy.
dataout$dif <- dataout$mu_out-dataout$muin

Uninformed prior did not perform with very little data

With very little data, the standard deviation gets large and the mean is low. This may just be because there is no information stating they are not large and small respectively, and the chain just wanders of. Anyway, reason enough to find that with only few samples it is not a feasible method. Adding information about the standard deviation much improves this.  

ggplot(dataout[dataout$n %in% c(5,6,8) & dataout$system %in% c('Bayes Uninf','Bayes Inf'),],aes(x=muin,y=dif))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('difference observed and simulation mu') 

At low number of observations variation of the mean swamps bias.  

The variation in estimates is rather large. Even so, there is a trend visible that removing missing data and imputing LLOQ give too high estimates when mu is low and hence the number of missings is high. When mu is high then all methods obtain the same result.

ggplot(dataout[dataout$n %in% c(5,6,8) & dataout$system!='Bayes Uninf',],aes(x=muin,y=dif))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('difference observed and simulation mu') 

At intermediate number of observations method choice gets more important 

Missing and imputation of LLOQ are clearly biased. Half LLOQ and 0 do much better. At 15 observations the informed prior is fairly similar to uninformed prior. 
ggplot(dataout[dataout$n %in% c(10,12,15) & dataout$system!='Bayes xUninf',],aes(x=muin,y=dif))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('difference observed and simulation mu') 

At large number of observations bias gets far too large

At this point the missing option is not plotted any more, it is just not competitive. Setting at LLOQ gets a big positive bias, while 0 gets a negative bias, especially for 80 observations. There is no point in setting up an informed prior. Half LLOQ seems least biased out of the simple methods.

ggplot(dataout[dataout$n  %in% c(20,50,80) & dataout$system!='as missing',],aes(x=muin,y=dif))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('difference observed and simulation mu') 

Standard deviation is negative biased for simple imputation

The uninformed prior estimation is too bad to plot and retain reasonable axes (and setting limits on y means the data outside those limits get eliminated from the smoother too, so that is not an option). Especially setting missing as LLOQ or ignoring them gets too low estimates of standard deviation. The informed prior gets a bit too large standard deviation.
ggplot(dataout[dataout$n%in% c(5,6,8) & dataout$system!='Bayes Uninf',],aes(x=muin,y=sigma_out))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('estimated sigma') 

ggplot(dataout[dataout$n%in% c(10,12,15) & dataout$system!='Bayes Uninf',],aes(x=muin,y=sigma_out))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('estimated sigma') 
With a large number of observations half LLOQ results in a negative bias on the standard deviation.

ggplot(dataout[dataout$n %in% c(20,50,80) & dataout$system!='as missing',],aes(x=muin,y=sigma_out))  + #group=Source,
    geom_smooth(method='loess') +
    geom_point() + 
    facet_grid(system ~n  , drop=TRUE) +
    scale_x_continuous('simulation mu') +
    scale_y_continuous('estimated sigma') 

Conclusion

At low number of observations using a Bayesian model and a good prior on the standard deviation is the best way to obtain descriptive statistics. Lacking those setting <LLOQ values to 1/2 LLOQ for estimation of means and to 0 for estimation of standard deviations seems best. But some care and more extensive simulations for the problem a hand are needed.

Finally, setting data to <LLOQ is something that has implications. As a statistician, I think using <LLOQ to display data in listings and tables is a good thing. However, when subsequent calculations by statisticians are needed, they may cause more trouble than they resolve. I have seen simple PCA go down the drain by <LLOQ, which is, truth to say, a shame given the effort in getting data.

No comments:

Post a Comment