[BioC] combining GSA and lmFit

Wed Apr 29 02:12:26 CEST 2009

Dear Dick,

On Tue, 28 Apr 2009, Dick Beyer wrote:

> Hi Gordon,
>
> Thanks for sharing your views on this topic.
>
> I was wondering, when you say "Thirdly, GSA computes p-values from 
> permuation, and permutation does not perform well for linear models," 
> what is it about the permutation approach that does not perform very 
> well?  Is it due to the null hypothesis being equality of distributions 
> rather than assuming your are testing equality of means?

Well, you could put it like that.  Suppose you have a one-way layout. 
The trouble is that the groups other than the two you want to compare 
interfere rather than help with the permutations.  Have a look at the 
research literature -- there's very little on permutation outside the two 
group problem.

> I see that the Bioconductor package GSEAlm which uses linear models with 
> GSEA and uses sample permutations might have similar problems as 
> combining GSA and lmFit.  I guess I was thinking a GSA/lmFit combination 
> would be OK because of people using GSEAlm.  But if the underlying null 
> assumption is equality of distributions rather than a weaker null of 
> equality of means, then that would be important to keep in mind when 
> interpreting the resulting p-values.

My understanding is that GSEAlm implements proposals from a nice paper 
published in Bioinformatics and is concerned with issues different to your 
GSA/lm combination, but the authors may have more to say.

> I wonder if there would be a useful way to test the issues you raise 
> about the effect on gene set analysis when using limma or SAM statistics 
> which depend on ensembles of genes, as well as the effect of the 
> moderated statistics of limma or SAM on the GSA standardization method. 
> I'm not sure I understand the GSA steps enough to know if those are 
> designed to take care of problems you might otherwise use a SAM-type 
> statistic to deal with.

If it was easy to do GSA for linear models, I expect that Efron and
Tibshirini would have done it.  I'm not sure it is wise to cobble
something ad hoc together if you don't understand the theory very well.

I didn't reply in order to point you to my work, but my group has taken an 
approach to GSEA for linear models which we're very happy with, which is 
implemented in the roast() and romer() functions of the limma package.

Best wishes
Gordon

> Lots to think about.  Thanks very much for your comments.
>
> Cheers,
> Dick
> *******************************************************************************
> Richard P. Beyer, Ph.D.	University of Washington
> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
> 			Seattle, WA 98105-6099
> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
> http://staff.washington.edu/~dbeyer
> *******************************************************************************
>
> On Tue, 28 Apr 2009, Gordon K Smyth wrote:
>
>> Dear Dick,
>>
>> Anything in GSA which works with the SAM statistic should also work fine
>> with limma moderated t-statistics.
>>
>> However there are several issues that come to my mind which affect both
>> statistics.  Firstly, both SAM and limma statistics depend on the whole
>> ensemble of genes, i.e., they are not merely computed genewise.  This is
>> unlike the floored mean statistics assumed in the GSA theory paper.  This
>> has clear computational implications, but also could give rise to some
>> theoretical issues.
>>
>> Secondly, it's not too clear to me whether it makes sense to compute
>> regularized or moderated statistics after the standardization steps that GSA
>> does.
>>
>> Thirdly, GSA computes p-values from permuation, and permutation does not
>> perform well for linear models.
>>
>> These are simply my thoughts, which you asked for.  You may have ways around
>> all these issues.
>>
>> Best wishes
>> Gordon
>>
>>> Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT)
>>> From: Dick Beyer <dbeyer at u.washington.edu>
>>> Subject: [BioC] combining GSA and lmFit
>>> To: Bioconductor <bioconductor at stat.math.ethz.ch>
>>>
>>> Hi All,
>>>
>>> I have extended the GSA code (http://www-stat.stanford.edu/~tibs/GSA/) to
>>> include lmFit() from the limma package so as to have linear model
>>> capabilities with GSA.  Basically, I'm using the modified t-statistic
>>> values from lmFit just like the SAM-like t-statistic values are used in
>>> the GSA code.
>>>
>>> I was wondering if anyone had any thoughts on whether this was, in
>>> principle, an OK thing to be doing.  I am worrying about whether there is
>>> an underlying issue I'm not aware of in using the moderated t-statistic
>>> values from limma as opposed to the SAM t-statistic values that uses the
>>> s0 term in the denominator.
>>>
>>> My tests on some microarray data I have shows that in a qqplot of
>>> t-statistic values from the two methods, they are in pretty close
>>> agreement except for large values of the t-values.
>>>
>>> If anyone knows of reasons not to be doing this or could point me to
>>> places with possible explanations, I'd be very grateful.
>>>
>>> Cheers,
>>> Dick
>>>
>>> *******************************************************************************
>>> Richard P. Beyer, Ph.D.	University of Washington
>>> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
>>> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>>> 			Seattle, WA 98105-6099
>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>>> http://staff.washington.edu/~dbeyer