[R-sig-ME] factor-specific variances in lmer

mikhail matz matz at mail.utexas.edu
Tue Dec 13 00:07:25 CET 2011


Dear colleagues,

A reviewer of my manuscript keeps bringing up this issue, and I ran out ideas and expertise trying to address his concern. 
I am looking at application of lmer to analyze gene expression. Typical gene expression problem involves many samples across different conditions, each sample containing measurements for the same collection of genes. Mixed modeling is primarily needed because of the random sample effects (different amount of biological material in each sample, the effect therefore is the same for all genes in the sample). The issue is that both the gene-specific effects of conditions and the residuals would tend to have their own variances. The reviewer requires that the model should explicitly include such gene-specific variances as parameters. He cites a paper that uses SAS to formulate such models. Here is what he says exactly:

"Residual and/or sample*gene effects
should have gene-specific variances. This has been observed for a number of
dataset (Steibel et al., 2009) and can be proved to be the case for the dataset
used in this model. Ignoring heterogeneous variances can potentially lead to
wrong inferences. Authors should made readers aware of this and demonstrate
why in this particular dataset homogeneous variance could be safely assumed. If
that can’t be proved, then the methods presented here should be modified to
account for this."

Now, lmer does not even include such an option, and honestly I never before came across such a concern in linear modeling. I wonder why, though. As far as I could gather from Doug's lmer algorithm paper with my rudimentary math training was that this issue is taken care of by the fitting algorithm. Here is what I wrote in rebuttal:

"It is true that a straightforward linear fitting based on least-squares would 
be affected by unequal variances depending on the values of factors 
(in our case, genes): the most highly variable gene will have the most leverage 
in determining the resulting parameter estimates.  It is also true that, for a general 
case, it is unrealistic to assume that the variances of expression of all genes 
would be equal. Fortunately, there are efficient algorithmic solutions to this problem. 
In particular, the lmer function used here relies on “penalized, iteratively 
reweighted, least squares” (PIRLS) algorithm, which is specifically designed to 
counter the artifacts induced by this type of heteroscedasticity. In other words, 
lmer assumes by default that the variances are factor-specific, and does not 
require specification and estimation of their exact parameters to deal with it."

As you might guess from the fact that I am still bugging you with all this, the reviewer was not satisfied by this reply:

"I insist that the use of a homoskedastic model under these circumstances (heterogeneous
variances) is incorrect. The author says that PIRLS algorithm is robust to
heteroskedasticity and it cites a technical document from Dr. Douglas Bates (http://cran.rproject.
org/web/packages/lme4/vignettes/Theory.pdf). The problem is that the cited
document does not support the claim made by the author. Maybe the confusion comed
from section 2.5 of the cited document that says: ”(It may look as if we have missed the
dependence on sigma on the left-hand side but it turns out that the scale parameter does
not affect the location of the optimal values of quantities in the linear predictor.)”. The
fact that “the scale parameter does not directly affect the location of the optimal values of
quantities in the linear predictor” does not mean the “algorithm is robust to
heteroskedasticity”. Actually, if you look at equation 36-39, section 4.2 of the technical
document, it is clear that the updating equations do not depend on the absolute value of
the variances, but on variance ratios. If the variances change freely across genes, so will
their ratios. This is a very well known fact. For example, borrowing from the microarray
literature, when multiple genes are analyzed, heterogeneous variances should be
explicitly modeled [2]. In summary, a fitting algorithm can not make up for a wrongly
specified model in general, and this is particularly true for the linear mixed model and
Fisher scoring (PIRLS) algorithms. Consequently, the answer provided and the
corresponding statement in the paper is originated in a clear misinterpretation of the
companion documentation to lmer. In conclusion, the statement in lines 112-115 is
incorrect."

Now, this already goes way beyond my level of expertise. I would tremendously appreciate your comment and advice as to how I am to deal with this.

best regards,

Misha

Mikhail V. Matz
Assistant Professor
University of Texas at Austin
Integrative Biology Section
1 University station C0930
Austin, TX 78712
phone 512-992-8086 cell, 512-475-6424 lab
fax 512-471-3878
web http://www.bio.utexas.edu/research/matz_lab




More information about the R-sig-mixed-models mailing list