[R-sig-ME] R-sig-mixed-models Digest, Vol 18, Issue 4

David Airey david.airey at Vanderbilt.Edu
Mon Jun 9 16:35:35 CEST 2008

Thank you Dr.s Bates and Gorjanc.

I hope to bother you for additional detail later, once we finish  
analysis with a first pass using SAS Proc Mixed to give a reference  
point. My purpose was to see what was possible in R, that seems a  
little problematic in Stata (at least with my hands).

My collaborator (Lily Wang, Biostatistician, SAS user) has a paper  
coming out in PLoS Genetics, describing the improvements that use of  
mixed models for parametric gene set enrichment analysis has over the  
usual nonparametric GSEA approach by the Broad Institute. Perhaps,  
these approaches will wind their way into SAS JMP Genomics 4.

The kind of structure involved generally includes a fixed effect  
treatment of interest in two or more groups, and random effects for  
gene set, genes in the set, and gene hybridization probes. Gene  
expression is measured at the probe level; gene sets are based on a  
priori knowledge of function.

In Stata, for example, there would be two ways of doing this model  
(thanks to Yulia at Stata Corp), by brute force or multilevel:

xi: xtmixed depvar i.Group || _all: R.Gene || _all: R.GeXPr || _all:  
R.GrXGe || _all: R.GrXGeXPr, variance

which, if we have 50 genes, 5 probes per gene, and 2 groups, requires  
900 columns for the design matrix.

xi: xtmixed depvar i.Group || Gene: R.Probes, cov(exchangeable) ||  
Group: || Probes:, variance

which requires just 7 columns for the same model.

But, as stated in the earlier thread, Stata just stops when there is a  
zero variance component, because "a ridge is formed by an interval of  
values near zero, which produce the same likelihood and look equally  
good to the optimizer [manual entry]." The default optimizer is Newton- 
Raphson. There may yet be ways of dealing with this problem within  
Stata; I have yet to query their support line.

The reason the above is interesting to me is that Lily works with a  
related approach where the comparison is not by treatment within a  
given gene set, but by enrichment within a gene set compared to the  
rest of the genes on the microarray. This is the better approach to  
take, but then I think the brute force approach could do with some  
"slimming" offered by the multilevel approach.

I know far too little statistics to be asking these questions, but I'd  
like to know about possibilities in R and Stata (and it is  
interesting!). I use Stata until I run into problems, and then I  
suffer to use R. Dang, that sounds damning of both, but it should only  
sound damning of the user.

Anyway, it does sound like R lme4 will gracefully handle variance  
components that estimate to zero, and therefore allow an uncomplicated  
loop to be performed across gene sets.

Thank you again.


On Jun 9, 2008, at 5:00 AM, r-sig-mixed-models-request at r-project.org  

> ----------------------------------------------------------------------
> Message: 1
> Date: Sun, 8 Jun 2008 11:07:43 +0000 (UTC)
> From: Gregor Gorjanc <gregor.gorjanc at bfro.uni-lj.si>
> Subject: Re: [R-sig-ME] variance components models with zero estimates
> To: r-sig-mixed-models at r-project.org
> Message-ID: <loom.20080608T110615-268 at post.gmane.org>
> Content-Type: text/plain; charset=us-ascii
> David Airey <david.airey at ...> writes:
>> When a variance components mixed model is run in Stata, if some of  
>> the
>> variance components are zero, the model may not converge, for  
>> rational
>> reasons according to the manual entry. However, when the same model  
>> is
>> run in SAS, the models with variance components that estimate to zero
> ...
> I remember that with lmer() you get a warning/message that says that
> variance component for a particular effect is effectively zero in  
> such cases and
> the model converges!
> gg
> ------------------------------
> Message: 2
> Date: Sun, 8 Jun 2008 11:33:31 -0500
> From: "Douglas Bates" <bates at stat.wisc.edu>
> Subject: Re: [R-sig-ME] variance components models with zero estimates
> To: "David Airey" <david.airey at vanderbilt.edu>
> Cc: r-sig-mixed-models at r-project.org
> Message-ID:
> 	<40e66e0b0806080933u2a779a9csa6e90c969441d6b0 at mail.gmail.com>
> Content-Type: text/plain; charset=ISO-8859-1
> On 6/7/08, David Airey <david.airey at vanderbilt.edu> wrote:
>> When a variance components mixed model is run in Stata, if some of  
>> the
>> variance components are zero, the model may not converge, for  
>> rational
>> reasons according to the manual entry. However, when the same model  
>> is run
>> in SAS, the models with variance components that estimate to zero
>> nonetheless converge. According to some SAS user friends, this is  
>> normal SAS
>> behavior (I'm new to SAS as of yesterday). If I'm interested in  
>> looping
>> through a set of such models, the SAS behavior is preferred.  
>> However, in
>> Stata such models can be formulated as multilevel models that can
>> dramatically reduce the dimension of the design matrix. The context  
>> where
>> both behaviors is important is mixed models for gene set enrichment
>> analysis, where there is a possibility of hundreds of models.
> That's a very interesting topic and something that I am tempted to
> answer in far too much detail.
> Let me provide the brief answer.  Currently there are two versions of
> the lme4 package available: a released version and a development
> version.  To install the development version use
> install.packages("lme4", repos = "http://r-forge.r-project.org")
> The major difference between these versions is how they handle
> evaluation of the log-likelihood when variance components are zero.
> That is, does the log-likelihood get evaluated in a way that behaves
> smoothly as a variance component goes to zero (and in the more general
> situation of a variance-covariance matrix approaching singularity)?
> The released version doesn't behave smoothly at zero and it is
> necessary to restrict variances to being slightly greater than zero.
> The development version does go to zero smoothly.
> Gregor Gorjanc's response relates to the released version.  Neither
> version throws an error condition when the estimates are zero or
> near-zero.  However, if you are concerned about estimates that may be
> zero I recommend using the development version of the lme4 package.
> It is more robust and reliable at determining the parameter estimates.
> (The natural question to ask is then why not release the development
> version?  I have not done so because there are some support routines,
> notably the Markov chain Monte Carlo sampler, whose versions in the
> new formulation are incomplete, and some software supporting books
> depends on those.)
>> Does R lme4 handle variance components mixed models that have  
>> estimates of
>> zero for some of the variance components like SAS or Stata?
> Neither.  The lmer function handles these situations correctly :-)
>> Is it possible
>> to loop through variance components models when some of the variance
>> components are zero?
> Yes.  Do bear in mind that much of the effort in fitting such models
> is spent establishing the structures representing the model and data -
> often more time will be spent formulating the numerical representation
> of the model than is spent in optimizing the parameter estimates.  If
> you find that your loops are taking a very long time to run you may
> want to dig deeper into the code and see if you can by-pass some of
> the steps and go directly to the optimization phase.  There is now a
> function called "refit" in lme4 that can be used to go directly to the
> optimization phase if the model and design are fixed and only response
> has been changed.
>> What is a suggested procedure for doing so in R? In
>> Stata, I would probably use an EM only guess at which variance  
>> components
>> were substantive, and then fit one of several models. What about R  
>> lme?
> i think we would need more detail to be able to answer such a  
> question.
> ------------------------------

More information about the R-sig-mixed-models mailing list