[R-sig-ME] standard error and statistical significance in lmer versus lm
Vincenzo Lagani
vlagani at ics.forth.gr
Fri Jan 2 19:16:17 CET 2015
Dear Peter,
thanks for your suggestion. Unfortunately, even on my most powerful
machine I could not fit the models you suggested: R returns a memory
error, due to the inability of allocating ~60GB of memory (more than
18000 probesets in my data).
Moreover, I am not totally sure if the model you suggested is suitable
for my application. While I fully agree with you that it would be
interesting to observe what happens when the variability due to the
probesets is explicitly modeled, I must say that I am not interested in
the specific effect of each single gene. Moreover, the set of genes I am
using can be considered as a (more or less) random sample from the set
of all mouse genes. Thus, it seems to me that the term "probesets"
should be definitely modeled as a random term.
Thanks again and kind regards,
Vincenzo
On 1/2/2015 4:34 PM, Peter Claussen wrote:
> Vincenzo,
>
> If I’m reading this correctly, I don’t think you’re comparing the same models, lm vs lmer
>
> Using lm, you have two models
> values ~ week + genotype
> values ~ week * genotype
>
> Using lmer, you have two different models.
> week + genotype + (week | probesets)
> week * genotype + (week | probesets)
>
> If you were to include (week:probesets) in your lm models, and compare
> values ~ week + genotype + week:probesets
> values ~ week * genotype + week:probesets
>
> would that clarify the confusion about the different results?
>
> Peter
>
>
>> On Dec 26, 2014, at 1:11 PM, Vincenzo Lagani <vlagani at ics.forth.gr> wrote:
>>
>> Dear all,
>>
>> I am modelling gene expression data with mixed models using lme4. My
>> goal is to assess whether gene expression globally decreases or
>> increases with time.
>>
>> Specifically, the data consist in whole expression profiles measured at
>> five different time points in two different strains of mice. For each
>> time point and each strain there are three replicates. The data are not
>> longitudinal, i.e., different mice are used at each time point. We had
>> to remove one profile from a time point because it was not matching our
>> quality criteria, so the data became not properly balanced.
>>
>> On these data I am fitting the following model:
>>
>> lme4Model.full <- lmer(values ~ week * genotype + (week | probesets),
>> data = dataset, REML = FALSE)
>>
>> where 'values' stands for the gene expression, 'week' is a scaled
>> numeric reporting the age of the mice in weeks, and 'genotype' is a
>> factor with two levels representing the two different genetic
>> backgrounds. Each single gene is let having its own random intercept and
>> slope.
>>
>> What puzzles me is that when I compare this model with its simpler,
>> non-mixed version:
>>
>> lmModel.full <- lm(values ~ week * genotype, data = dataset)
>>
>> I obtain the same coefficients but different standard errors (see
>> below). Furthermore, while the interaction coefficient is not
>> significant in the simple linear model, it becomes highly significant in
>> the mixed model, at least according to these ANOVA tests:
>>
>> lmModel <- lm(values ~ week + genotype, data = dataset)
>> lme4Model <- lmer(values ~ week + genotype + (week | probesets), data =
>> dataset, REML = FALSE)
>>
>>> anova(lmModel.full, lmModel)
>> Analysis of Variance Table
>>
>> Model 1: values ~ week * genotype
>> Model 2: values ~ week + genotype
>> Res.Df RSS Df Sum of Sq F Pr(>F)
>> 1 414341 2162664
>> 2 414342 2162666 -1 -2.4697 0.4732 0.4915
>>
>>> anova(lme4Model.full, lme4Model)
>> Data: dataset
>> Models:
>> lme4Model: values ~ week + genotype + (week | probesets)
>> lme4Model.full: values ~ week * genotype + (week | probesets)
>> Df AIC BIC logLik deviance Chisq Chi Df Pr(>Chisq)
>> lme4Model 7 72376 72452 -36181 72362
>> lme4Model.full 8 72325 72413 -36155 72309 52.472 1 4.364e-13 ***
>>
>>
>>
>> Assessing whether the interaction coefficient is significant is actually
>> the aim of my study, and having two totally different answers confuses
>> me. My understanding is that the mixed model better catches the variance
>> structure of the data and thus it is able to better estimate the
>> standard errors and p-values of the coefficients. Is this correct? In
>> other words, can I confidently claim that the p-values obtained from the
>> mixed models are "the correct ones" and that the interaction term is
>> actually significant?
>>
>> My apologies if this issue has been already posted on this list. Despite
>> having seen multiple posts here on similar topics, I have not been able
>> to find an answer to these questions.
>>
>> Thanks in advance for your help. Any suggestion is very welcome.
>>
>> Regards,
>>
>> Vincenzo
>>
>>
>>> summary(lme4Model.full)
>> Linear mixed model fit by maximum likelihood ['lmerMod']
>> Formula: values ~ week * genotype + (week | probesets)
>> Data: dataset
>>
>> AIC BIC logLik deviance df.resid
>> 72325.3 72412.8 -36154.7 72309.3 414337
>>
>> Scaled residuals:
>> Min 1Q Median 3Q Max
>> -13.2622 -0.5246 0.0128 0.5270 23.3456
>>
>> Random effects:
>> Groups Name Variance Std.Dev. Corr
>> probesets (Intercept) 5.167338 2.27318
>> week 0.005029 0.07092 -0.23
>> Residual 0.047063 0.21694
>> Number of obs: 414345, groups: probesets, 18015
>>
>> Fixed effects:
>> Estimate Std. Error t value
>> (Intercept) 6.5727983 0.0169414 388.0
>> week -0.0151336 0.0006823 -22.2
>> genotypeCSB 0.2405300 0.0007121 337.8
>> week:genotypeCSB 0.0050430 0.0006962 7.2
>>
>> Correlation of Fixed Effects:
>> (Intr) week gntCSB
>> week -0.177
>> genotypeCSB -0.015 -0.028
>> wk:gntypCSB -0.001 -0.392 -0.054
>>
>>
>>> summary(lmModel.full)
>> Call:
>> lm(formula = values ~ week * genotype, data = dataset)
>>
>> Residuals:
>> Min 1Q Median 3Q Max
>> -5.7560 -1.9881 0.0083 1.6823 7.6407
>>
>> Coefficients:
>> Estimate Std. Error t value Pr(>|t|)
>> (Intercept) 6.572798 0.004407 1491.384 < 2e-16 ***
>> week -0.015134 0.004547 -3.329 0.000873 ***
>> genotypeCSB 0.240530 0.007500 32.072 < 2e-16 ***
>> week:genotypeCSB 0.005043 0.007331 0.688 0.491537
>> ---
>> Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
>>
>> Residual standard error: 2.285 on 414341 degrees of freedom
>> Multiple R-squared: 0.002491, Adjusted R-squared: 0.002484
>> F-statistic: 344.9 on 3 and 414341 DF, p-value: < 2.2e-16
>>
>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-mixed-models at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-sig-mixed-models
More information about the R-sig-mixed-models
mailing list