[BioC] "romer"ing and "roast"ing around gene sets

Mon Jul 19 01:57:21 CEST 2010

Dear Ramon,

I agree.  Using roast() on a database on gene sets is fine as long as you 
allow for multiple testing in an appropriate way.  We provide the mroast() 
function to try to make this easier.  My lab recently had occasion to use 
mroast() with all canonical pathways and we found that it took only a few 
minutes on an oldish PC with nrotations=9999.

You're right, romer() and roast() are answering different questions.  As 
long as you're aware of this, then you're on firm ground.  And the reason 
why we suggest for romer() for really large scale testing is simply 
because roast() can give so many statistically significant results as to 
be harder to interpret, especially if you use set.statistic="msq".  This 
might not be a problem for you.

At this stage, roast() is the more mature software product.  While we've 
used romer() for a study published in Blood, we haven't yet published the 
methodology in its own right, and it will probably be refined a bit more 
before we do so.

Thanks for the P.S. about the documentation.  I've updated it now.

Best regards
Gordon

On Fri, 16 Jul 2010, Ramon Diaz-Uriarte wrote:

> Dear Gordon,
>
> Reading your email, I think there is something I am not following 
> completely. You say, regarding the GSEA-like approach in "romer"
>
>> This is actually a biologically well-motivated approach when you are 
>> testing large numbers of sets.
>>
>> If you want to test every set in the MSigDB, then testing one by one 
>> with roast() would probably be just too slow anyway.  romer() is more 
>> efficient when the number of sets is very large.
>
> What I found very attractive about roast is that the differential 
> expression test is done for groups of genes so, in addition to possible 
> increases in power, interpretation is simplified (e.g., if we use all 
> the GO categories, we deal only with ~ 1500 entities).  Even if the 
> examples in your Bioinformatics paper involve just a few sets, I was 
> thinking about systematically using roast in, say, all GO categories, or 
> all the 690 canonical pathways.
>
> Moreover, if we want to use the "focused gene testing", even if roast 
> takes longer, I do not see how the larger efficiency of romer would make 
> it an alternative procedure: they are answering different questions, 
> right?
>
>
> But now, I am starting to think that maybe the idea of systematically 
> testing all 1500 go categories might be a bad idea.
>
>
> Best,
>
> R.
>
> P.S. The help for roast says y it must be a numeric matrix. But I think 
> it works fine with ExpressionSet objects directly, too.
>
>
>
>
>
> On Thursday 15 July 2010 03:29:49 Gordon K Smyth wrote:
>> Dear Robert,
>>
>> I'm just adding briefly to Di's comments.
>>
>>> From: "Robert M. Flight" <rflight79 at gmail.com>
>>> To: bioconductor at stat.math.ethz.ch
>>> Subject: [BioC] "romer"ing and "roast"ing around gene sets
>>>
>>> Hi All,
>>>
>>> I am having trouble with the distinction between the functions "roast"
>>> and "romer" in the limma package. From the publication describing
>>> "roast" (http://dx.doi.org/10.1093/bioinformatics/btq401), it seems that
>>> it tests a particular gene set for differential expression, whereas
>>> "romer" tests a battery of sets to find those that are differentially
>>> expressed compared to the rest?
>>
>> Yes.
>>
>>> I am really having trouble discerning the true difference between these
>>> two, and how they compare to GSEA. I always thoght that the primary
>>> purpose of GSEA was to determine those gene sets that are significantly
>>> associated with a phenotypic comparison, i.e. those gene sets showing
>>> differential expression.
>>
>> This is an understandable assumption, which isn't quite true!  GSEA
>> actually tries to pick out the sets that stand out as more strongly
>> differentially expressed (DE) than others.  So, if all the sets were DE to
>> exactly the same degree, then GSEA wouldn't find anything significant,
>> because no set would stand out from the others.  This is actually a
>> biologically well-motivated approach when you are testing large numbers of
>> sets.
>>
>> If you want to test every set in the MSigDB, then testing one by one with
>> roast() would probably be just too slow anyway.  romer() is more efficient
>> when the number of sets is very large.
>>
>> Beware that romer(), like GSEA, tends to give pretty modest p-values.
>> The ranking of the sets may be more useful than the absolute p-values.
>>
>> Best wishes
>> Gordon
>>
>>> If any one can help me clear this up, that would be great, because as of
>>> now I am thoroughly confused. To me, if I have a dataset, and I want to
>>> know which gene sets (from say MSigDB) are differentially expressed,
>>> then it sounds like I would use "roast", but the way it is described in
>>> the publication (and the help in limma), this isn't what I would do, but
>>> rather I should use "romer", and see if any of the sets show
>>> differential expression compared to the rest in the database.
>>>
>>> Color me confused,
>>>
>>> -Robert
>>>
>>> Robert M. Flight, Ph.D.
>>> Bioinformatics and Biomedical Computing Laboratory
>>> University of Louisville
>>> Louisville, KY
>>>
>>> PH 502-852-0467
>>> EM robert.flight at louisville.edu
>>> EM rflight79 at gmail.com
>>>
>>> Williams and Holland's Law:
>>> ? ? ?? If enough data is collected, anything may be proven by
>>> statistical methods.
> -- 
> Ramon Diaz-Uriarte
> Structural Biology and Biocomputing Programme
> Spanish National Cancer Centre (CNIO)
> http://ligarto.org/rdiaz
> Phone: +34-91-732-8000 ext. 3019
> Fax: +-34-91-224-6972

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}