[R-sig-ME] Residuals look "mirrored" when using lmer with imputed data

Mon Aug 7 12:29:14 CEST 2017

@Phillip: Thank you for the tip, I'll be sure to be more explicit in  
the future regarding the term AUC.

I guess you are right, the data is indeed bounded because of the  
measuring method's own limits, I didn't think of that.

@Emmanuel: I log transforming did not help, because as Phillip  
mentioned, there are too many repeated values.

I guess the question then is, is this data usable at all? It seems  
it's impossible to get safe results from lmer, probably simple ttests  
won't be much better either.

Thank you both for taking the time to look into this.

Best Joao Santiago

Quoting Emmanuel Curis <emmanuel.curis at parisdescartes.fr>:

> Just a complement on this: in a pharmacokinetics context, which seems
> the one of Joao, AUC is most often estimated using the trapeze rules
> than using a rectangular approximation. So that would be a first
> improvment for Joao's computation.
>
> Most often also, these data are log-transformed before analysis,
> because of the > 0 constraint of the concentrations. This may (or not)
> improve the models….
>
> I'm not sure if imputation is relevant here, if a few data points are
> missing for some patients: after all, in a way, a tremendous amount of
> points are missing since a very few set of times is sampled on the
> continuous time axis. And the trapeze method in the case of a missing
> point is equivalent to having a linear interpolation for the missing
> point, using the two surrounding points. But that neglects the noises
> effects, of course. However, would it improve the final conclusions?
> If anyone has references...
>
> Except in one case: if the missing point is the last one (or the first
> one), because then the AUC are difficult to compare since the X spans
> differ. Unless there is an approximation for « last point to
> infinity », which I did not saw in Joao's code.
>
> By the way, if it is an hormon, you may have to include a term for the
> baseline before comparing AUC if there is both internal and external
> origin of the hormon — but it depends on the experiment. Removing the
> baseline by substraction is a way, but I guess tend to neglect the
> associated uncertainty.
>
> On Sun, Aug 06, 2017 at 01:00:47PM +0000, Alday, Phillip wrote:
> « Uh ... part of my problem is the term "AUC". Without some very explicit
> « note to the contrary, AUC on a stats forum often refers to a cumulative
> « probability or the Receiver Operating Characteristic (ROC AUC).
> « Remember, we're doing this in our free time, so I am often quick and
> « don't read every line of code ....
> «
> « You're effectively using a rectangular approximation for your integral
> « and that's something else to be careful with in terms of approximation.
> « But why impute separately here? You might be able to use a suitable
> « integral approximation to skip imputation:
> «
> « >  dfAUC <- subset(df,!is.na(value)) %>% arrange(sampleNum) %>%
> « group_by(ID, treatment) %>%  mutate(AUC =
> « sum(diff(sampleNum)*rollmean(value,2)))
> « > fit2 <- lmer(AUC ~ sex * treatment + (1|ID), data = dfAUC)
> « > qqnorm(resid(fit2))
> «
> « yields similarly flat tails on the qqplot. (Okay, maybe don't use the
> « rectangular approximation without proper imputation; this didn't work
> « as well as I'd hoped.)
> «
> « In both cases, the residual plot looks not horrible to me, but the
> « qqnorm plot suggests some issues with bounding -- those flat tails look
> « you're running into some minimal/maximal value. At the lower end, the
> « obvious answer is "zero"; at the upper end, maybe you're hitting the
> « limits of your measurement device? We see which values actually occur,
> « both with the imputed data and the implicit-imputation-via-integral-
> « approximation data:
> «
> « > xtabs(~AUC,dfImpAUC)
> « > xtabs(~AUC,dfAUC)
> «
> « In both cases, we see that the data only take on a relatively small
> « number of values and indeed there are lots of repetitions at the
> « bounds.
> «
> « Phillip
> «
> « > On Sat, 2017-08-05 at 17:59 +0200, João C P Santiago wrote:
> « > >
> « > > I'm sorry I may be misunderstanding, but why do you say my data
> « > > is  
> « > > bounded between 0 and 1? The data I shared is from multiple  
> « > > measurements of a blood hormone, which can be 0, but is certainly
> « > > not  
> « > > bounded at 1. Next I calculated the AUC i.e. the "total exposure"
> « > > of  
> « > > that hormone. X is time and y is the value of the hormone at each
> « > > time  
> « > > point. The AUC is also not bounded at 0 and 1.
> « > >
> « > > Am I missing something?
> « > >
> « > > Thanks for your reply!
> « > > Santiago
> « > >
> « > >
> « > > Quoting "Alday, Phillip" <Phillip.Alday at mpi.nl>:
> « > >
> « > > >
> « > > >
> « > > > You're fitting a normal/Gaussian LMM to data bounded on [0, 1].
> « > > > The
> « > > > model assumptions about the residuals simply won't hold for
> « > > > bounded,
> « > > > binomial-like values. 
> « > > >
> « > > > Why not fit a binomial model to the data and then use
> « > > > fitted(model)
> « > > > to
> « > > > compute AUC of the entire model? 
> « > > >
> « > > > Phillip  
> « > > >
> « > > > On Fri, 2017-08-04 at 08:52 +0200, João C P Santiago wrote:
> « > > > >
> « > > > >
> « > > > > I'm trying to assess if a treatment had any effect on the
> « > > > > levels
> « > > > > of
> « > > > > a  
> « > > > > hormone. To do this I need to calculate the area under the
> « > > > > curve
> « > > > > and  
> « > > > > then adjust it for sex (a known confounder) and smoking status
> « > > > > (not  
> « > > > > included in the demo data below to keep things simpler).
> « > > > >
> « > > > > Here's a dput of the data: https://pastebin.com/VYcQGkwb
> « > > > >
> « > > > > There's some missing values, so first step is to impute them
> « > > > > using
> « > > > > the  
> « > > > > mice package, then calculate AUC and finally fit the model:
> « > > > >
> « > > > > library(dplyr)
> « > > > > library(lme4)
> « > > > > library(mice)
> « > > > > library(zoo)
> « > > > >
> « > > > > ## Impute missing values
> « > > > > dfMids <- mice(df, m = 10, maxit = 15, seed = 2535)
> « > > > > dfImp  <- complete(dfMids)
> « > > > >
> « > > > > ## Calculate AUC
> « > > > > dfImpAUC <- dfImp %>%
> « > > > >    arrange(sampleNum) %>%
> « > > > >    group_by(ID, treatment) %>%
> « > > > >    mutate(AUC = sum(diff(sampleNum)*rollmean(value,2)))
> « > > > >
> « > > > > ## Fit model
> « > > > > fit <- lmer(AUC ~ sex * treatment + (1|ID), data = dfImpAUC)
> « > > > >
> « > > > > ## Plot residuals
> « > > > > plot(fit)  # output: https://imgur.com/a/vfL1R
> « > > > > qqnorm(resid(fit))
> « > > > >
> « > > > >
> « > > > >
> « > > > > I know it's possible to fit a model to each iteration of mids
> « > > > > model,  
> « > > > > but then I can't calculate the AUC, which is what I actually
> « > > > > need.
> « > > > > Any  
> « > > > > ideas why the residuals look like that?
> « > > > >
> « > > > > Best
> « > > > > Santiago
> « > > > >
> « > > > >
> « > > > >
> « > > > >
> « _______________________________________________
> « R-sig-mixed-models at r-project.org mailing list
> « https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
>
> --
>                                 Emmanuel CURIS
>                                 emmanuel.curis at parisdescartes.fr
>
> Page WWW: http://emmanuel.curis.online.fr/index.html

-- 
João C. P. Santiago
Institute for Medical Psychology & Behavioral Neurobiology
Center of Integrative Neuroscience
University of Tuebingen
Otfried-Mueller-Str. 25
72076 Tuebingen, Germany

Phone: +49 7071 29 88981
Fax: +49 7071 29 25016