[BioC] combining GSA and lmFit
Dick Beyer
dbeyer at u.washington.edu
Wed Apr 29 19:28:45 CEST 2009
Hi Gordon,
Thanks for your thoughtful comments. I think you have steered me in the right direction. I see after a more careful reading of your methods in limma and those in GSEAlm and GSA, that not knowing how to do permutations correctly when there are more than two groups is a problem.
I'm glad for all the help in keeping me from doing something dumb.
Cheers,
Dick
*******************************************************************************
Richard P. Beyer, Ph.D. University of Washington
Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
Seattle, WA 98105-6099
http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
http://staff.washington.edu/~dbeyer
*******************************************************************************
On Wed, 29 Apr 2009, Gordon K Smyth wrote:
> Dear Dick,
>
> On Tue, 28 Apr 2009, Dick Beyer wrote:
>
>> Hi Gordon,
>>
>> Thanks for sharing your views on this topic.
>>
>> I was wondering, when you say "Thirdly, GSA computes p-values from
>> permuation, and permutation does not perform well for linear models," what
>> is it about the permutation approach that does not perform very well? Is it
>> due to the null hypothesis being equality of distributions rather than
>> assuming your are testing equality of means?
>
> Well, you could put it like that. Suppose you have a one-way layout. The
> trouble is that the groups other than the two you want to compare interfere
> rather than help with the permutations. Have a look at the research literature
> -- there's very little on permutation outside the two group problem.
>
>> I see that the Bioconductor package GSEAlm which uses linear models with
>> GSEA and uses sample permutations might have similar problems as combining
>> GSA and lmFit. I guess I was thinking a GSA/lmFit combination would be OK
>> because of people using GSEAlm. But if the underlying null assumption is
>> equality of distributions rather than a weaker null of equality of means,
>> then that would be important to keep in mind when interpreting the resulting
>> p-values.
>
> My understanding is that GSEAlm implements proposals from a nice paper
> published in Bioinformatics and is concerned with issues different to your
> GSA/lm combination, but the authors may have more to say.
>
>> I wonder if there would be a useful way to test the issues you raise about
>> the effect on gene set analysis when using limma or SAM statistics which
>> depend on ensembles of genes, as well as the effect of the moderated
>> statistics of limma or SAM on the GSA standardization method. I'm not sure I
>> understand the GSA steps enough to know if those are designed to take care
>> of problems you might otherwise use a SAM-type statistic to deal with.
>
> If it was easy to do GSA for linear models, I expect that Efron and
> Tibshirini would have done it. I'm not sure it is wise to cobble
> something ad hoc together if you don't understand the theory very well.
>
> I didn't reply in order to point you to my work, but my group has taken an
> approach to GSEA for linear models which we're very happy with, which is
> implemented in the roast() and romer() functions of the limma package.
>
> Best wishes
> Gordon
>
>> Lots to think about. Thanks very much for your comments.
>>
>> Cheers,
>> Dick
>> *******************************************************************************
>> Richard P. Beyer, Ph.D. University of Washington
>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
>> Seattle, WA 98105-6099
>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>> http://staff.washington.edu/~dbeyer
>> *******************************************************************************
>>
>> On Tue, 28 Apr 2009, Gordon K Smyth wrote:
>>
>>> Dear Dick,
>>>
>>> Anything in GSA which works with the SAM statistic should also work fine
>>> with limma moderated t-statistics.
>>>
>>> However there are several issues that come to my mind which affect both
>>> statistics. Firstly, both SAM and limma statistics depend on the whole
>>> ensemble of genes, i.e., they are not merely computed genewise. This is
>>> unlike the floored mean statistics assumed in the GSA theory paper. This
>>> has clear computational implications, but also could give rise to some
>>> theoretical issues.
>>>
>>> Secondly, it's not too clear to me whether it makes sense to compute
>>> regularized or moderated statistics after the standardization steps that
>>> GSA
>>> does.
>>>
>>> Thirdly, GSA computes p-values from permuation, and permutation does not
>>> perform well for linear models.
>>>
>>> These are simply my thoughts, which you asked for. You may have ways
>>> around
>>> all these issues.
>>>
>>> Best wishes
>>> Gordon
>>>
>>>> Date: Sun, 26 Apr 2009 20:43:28 -0700 (PDT)
>>>> From: Dick Beyer <dbeyer at u.washington.edu>
>>>> Subject: [BioC] combining GSA and lmFit
>>>> To: Bioconductor <bioconductor at stat.math.ethz.ch>
>>>>
>>>> Hi All,
>>>>
>>>> I have extended the GSA code (http://www-stat.stanford.edu/~tibs/GSA/)
>>>> to
>>>> include lmFit() from the limma package so as to have linear model
>>>> capabilities with GSA. Basically, I'm using the modified t-statistic
>>>> values from lmFit just like the SAM-like t-statistic values are used in
>>>> the GSA code.
>>>>
>>>> I was wondering if anyone had any thoughts on whether this was, in
>>>> principle, an OK thing to be doing. I am worrying about whether there
>>>> is
>>>> an underlying issue I'm not aware of in using the moderated t-statistic
>>>> values from limma as opposed to the SAM t-statistic values that uses the
>>>> s0 term in the denominator.
>>>>
>>>> My tests on some microarray data I have shows that in a qqplot of
>>>> t-statistic values from the two methods, they are in pretty close
>>>> agreement except for large values of the t-values.
>>>>
>>>> If anyone knows of reasons not to be doing this or could point me to
>>>> places with possible explanations, I'd be very grateful.
>>>>
>>>> Cheers,
>>>> Dick
>>>>
>>>> *******************************************************************************
>>>> Richard P. Beyer, Ph.D. University of Washington
>>>> Tel.:(206) 616 7378 Env. & Occ. Health Sci. , Box 354695
>>>> Fax: (206) 685 4696 4225 Roosevelt Way NE, # 100
>>>> Seattle, WA 98105-6099
>>>> http://depts.washington.edu/ceeh/ServiceCores/FC5/FC5.html
>>>> http://staff.washington.edu/~dbeyer
>
More information about the Bioconductor
mailing list