[BioC] combining GSA and lmFit

Wed Apr 29 19:28:45 CEST 2009

Hi Gordon,

Thanks for your thoughtful comments.  I think you have steered me in the right direction.  I see after a more careful reading of your methods in limma and those in GSEAlm and GSA, that not knowing how to do permutations correctly when there are more than two groups is a problem.

I'm glad for all the help in keeping me from doing something dumb.

Cheers,
Dick

*******************************************************************************
Richard P. Beyer, Ph.D.	University of Washington
Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
 			Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
http://staff.washington.edu/~dbeyer
*******************************************************************************

On Wed, 29 Apr 2009, Gordon K Smyth wrote:

> Dear Dick,
>
> On Tue, 28 Apr 2009, Dick Beyer wrote:
>
>> Hi Gordon,
>> 
>> Thanks for sharing your views on this topic.
>> 
>> I was wondering, when you say "Thirdly, GSA computes p-values from 
>> permuation, and permutation does not perform well for linear models," what 
>> is it about the permutation approach that does not perform very well?  Is it 
>> due to the null hypothesis being equality of distributions rather than 
>> assuming your are testing equality of means?
>
> Well, you could put it like that.  Suppose you have a one-way layout. The 
> trouble is that the groups other than the two you want to compare interfere 
> rather than help with the permutations.  Have a look at the research literature 
> -- there's very little on permutation outside the two group problem.
>
>> I see that the Bioconductor package GSEAlm which uses linear models with 
>> GSEA and uses sample permutations might have similar problems as combining 
>> GSA and lmFit.  I guess I was thinking a GSA/lmFit combination would be OK 
>> because of people using GSEAlm.  But if the underlying null assumption is 
>> equality of distributions rather than a weaker null of equality of means, 
>> then that would be important to keep in mind when interpreting the resulting 
>> p-values.
>
> My understanding is that GSEAlm implements proposals from a nice paper 
> published in Bioinformatics and is concerned with issues different to your 
> GSA/lm combination, but the authors may have more to say.
>
>> I wonder if there would be a useful way to test the issues you raise about 
>> the effect on gene set analysis when using limma or SAM statistics which 
>> depend on ensembles of genes, as well as the effect of the moderated 
>> statistics of limma or SAM on the GSA standardization method. I'm not sure I 
>> understand the GSA steps enough to know if those are designed to take care 
>> of problems you might otherwise use a SAM-type statistic to deal with.
>
> If it was easy to do GSA for linear models, I expect that Efron and
> Tibshirini would have done it.  I'm not sure it is wise to cobble
> something ad hoc together if you don't understand the theory very well.
>
> I didn't reply in order to point you to my work, but my group has taken an 
> approach to GSEA for linear models which we're very happy with, which is 
> implemented in the roast() and romer() functions of the limma package.
>
> Best wishes
> Gordon
>
>> Lots to think about.  Thanks very much for your comments.
>> 
>> Cheers,
>> Dick
>> *******************************************************************************
>> Richard P. Beyer, Ph.D.	University of Washington
>> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>> 			Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>> http://staff.washington.edu/~dbeyer
>> *******************************************************************************
>> 
>> On Tue, 28 Apr 2009, Gordon K Smyth wrote:
>> 
>>> Dear Dick,
>>> 
>>> Anything in GSA which works with the SAM statistic should also work fine
>>> with limma moderated t-statistics.
>>> 
>>> However there are several issues that come to my mind which affect both
>>> statistics.  Firstly, both SAM and limma statistics depend on the whole
>>> ensemble of genes, i.e., they are not merely computed genewise.  This is
>>> unlike the floored mean statistics assumed in the GSA theory paper.  This
>>> has clear computational implications, but also could give rise to some
>>> theoretical issues.
>>> 
>>> Secondly, it's not too clear to me whether it makes sense to compute
>>> regularized or moderated statistics after the standardization steps that 
>>> GSA
>>> does.
>>> 
>>> Thirdly, GSA computes p-values from permuation, and permutation does not
>>> perform well for linear models.
>>> 
>>> These are simply my thoughts, which you asked for.  You may have ways 
>>> around
>>> all these issues.
>>> 
>>> Best wishes
>>> Gordon
>>> 
>>>> Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT)
>>>> From: Dick Beyer <dbeyer at u.washington.edu>
>>>> Subject: [BioC] combining GSA and lmFit
>>>> To: Bioconductor <bioconductor at stat.math.ethz.ch>
>>>> 
>>>> Hi All,
>>>> 
>>>> I have extended the GSA code (http://www-stat.stanford.edu/~tibs/GSA/) 
>>>> to
>>>> include lmFit() from the limma package so as to have linear model
>>>> capabilities with GSA.  Basically, I'm using the modified t-statistic
>>>> values from lmFit just like the SAM-like t-statistic values are used in
>>>> the GSA code.
>>>> 
>>>> I was wondering if anyone had any thoughts on whether this was, in
>>>> principle, an OK thing to be doing.  I am worrying about whether there 
>>>> is
>>>> an underlying issue I'm not aware of in using the moderated t-statistic
>>>> values from limma as opposed to the SAM t-statistic values that uses the
>>>> s0 term in the denominator.
>>>> 
>>>> My tests on some microarray data I have shows that in a qqplot of
>>>> t-statistic values from the two methods, they are in pretty close
>>>> agreement except for large values of the t-values.
>>>> 
>>>> If anyone knows of reasons not to be doing this or could point me to
>>>> places with possible explanations, I'd be very grateful.
>>>> 
>>>> Cheers,
>>>> Dick
>>>> 
>>>> *******************************************************************************
>>>> Richard P. Beyer, Ph.D.	University of Washington
>>>> Tel.:(206) 616 7378	Env. & Occ. Health Sci. , Box 354695
>>>> Fax: (206) 685 4696	4225 Roosevelt Way NE, # 100
>>>> 			Seattle, WA 98105-6099
>>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>>>> http://staff.washington.edu/~dbeyer
>