[BioC] Simultaneous FDR control for multiple contrasts in edgeR
Hilary Smith
Hilary.A.Smith.964 at nd.edu
Tue May 14 14:25:46 CEST 2013
Thank you very much. I looked online, and between that and what you note
below this makes more sense to me now: the total number of false
discoveries scales with more tests, such that for an FDR at 0.05, there
could be one contrast with 100 tests (genes), where there are 5 false
discoveries out of 100 tests (genes), and a second contrast with say 50
out of 1000 tests (genes). Then if one combines/analyzes both contrasts,
there are 55 false discoveries out of 1100 tests, leading to an FDR =
55/1100, which still is 0.05. This is excellent (esp. as after talking
with my supervisor, we may want to add yet another contrast or two by
testing for the time*treatment interaction within each species (in
addition to the prior tests of treatment*species and main effects within
each time point, albeit the total will probably only be ~8 contrasts). I
definitely will look through the limma user guide as well, as you
suggested, for more information.
Thank you again; I really appreciate your insight as I definitely (for the
study, and simply to understand/curiosity) want to know and apply the
correct analysis.
Best,
Hilary
--------------------------------------------------
Dr. Hilary April Smith
Postdoctoral Research Associate
University of Notre Dame
Department of Biological Sciences
321 Galvin Life Sciences
On 5/13/13 8:02 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:
>Hi Hilary,
>
>Just to make it clear what I mean by scalability: Suppose that you
>successfully control the FDR to be less than 0.05 for each contrast
>separately. Then it is an automatic consequence that the FDR for all the
>tests pooled together is also less than 0.05.
>
>Best wishes
>Gordon
>
>On Tue, 14 May 2013, Gordon K Smyth wrote:
>
>> Dear Hilary,
>>
>> Unlike p-values (Type I error control), FDR is a scalable loss
>>function, so
>> it is not generally necessary to decrease the cutoff when adding more
>>tests.
>> It is not correct or necessary to divide FDR by the number of tests
>>done (a
>> la Bonferroni p-values).
>>
>> The limma package (decideTests function) has a number of strategies for
>> combining FDR across multiple contrasts as well as multiple genes.
>>However
>> the most commmon practice is the default, which is to simply do the FDR
>> adjustment separately for each contrast. For the limited number of
>>contrasts
>> you are considering, I would be happy enough with this strategy.
>>
>> Best wishes
>> Gordon
>>
>> ---------------------------------------------
>> Professor Gordon K Smyth,
>> Bioinformatics Division,
>> Walter and Eliza Hall Institute of Medical Research,
>> 1G Royal Parade, Parkville, Vic 3052, Australia.
>> http://www.statsci.org/smyth
>>
>> On Mon, 13 May 2013, Hilary Smith wrote:
>>
>>> One further question. I'm glad to hear it's acceptable to run the
>>> cpm>1)>=4 filtering, as well as the glmFit(Y, design), once on all 8
>>> groups (32 libraries), and then just use the makeContrasts function.
>>>Yet
>>> if I use makeContrasts to specify the 6 tests as noted in the prior
>>> email/post, should I lower my "significant" FDR cutoff from 0.05 to
>>> (0.05/6) = 0.0083? I know edgeR does a BH correction / FDR to account
>>>for
>>> the multiple testing of all the different genes or tags. However, I am
>>> unsure whether it's necessary to also have a correction for the 6
>>> different contrasts I am testing, or if simply making the model
>>>>design =
>>> model.matrix(~0+Group, data=Y$samples) where Y = DGEList(counts=D,
>>> group=Group), and Group contains all 8 groups (2 species * 2
>>>treatments *
>>> 2 times), intrinsically accounts for this in how it sets up the model.
>>>I
>>> assume makeContrasts or the design function accounts for this, but I
>>>just
>>> want to be sure.
>>>
>>> Thank you again,
>>> Hilary
>>>
>>>
>>>
>>> --------------------------------------------------
>>> Dr. Hilary April Smith
>>> Postdoctoral Research Associate
>>> University of Notre Dame
>>> Department of Biological Sciences
>>> 321 Galvin Life Sciences
>>>
>>>
>>>
>>>
>>>
>>> On 5/11/13 11:53 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:
>>>
>>>> Dear Hilary,
>>>>
>>>> Makes sense to me.
>>>>
>>>> Personally, I would analyse all the libraries together (with 8 groups)
>>>> instead of using separate glmFits for the two ages. The same
>>>>contrasts
>>>> will still work. Then there is no issue with different numbers of
>>>>rows
>>>> etc.
>>>>
>>>> Best wishes
>>>> Gordon
>>>>
>>>> On Sat, 11 May 2013, Hilary Smith wrote:
>>>>
>>>>> Dear Gordon,
>>>>> Thank you. We cannot really remove the 3-way interaction, because
>>>>>there
>>>>> are some genes that respond to the 3-way interaction (even though a
>>>>> classic parameterization with the intercept, leads to about an order
>>>>>of
>>>>> magnitude more genes responding to a 2-way vs. 3-way interaction).
>>>>>
>>>>> Testing for species*treatment at each time, definitely seems the
>>>>> closest way to address the questions we have. Roughly following
>>>>> section 3.3.1 of the edgeR user guide ("Defining each treatment
>>>>> combination as a group") on the creation of specific contrasts (and
>>>>> many thanks for your updated User Guides to show this formulation in
>>>>> detail), I set up tests as below. If this is valid, that's great, as
>>>>> it does make much more intuitive sense to me (and biological in
>>>>>terms
>>>>> of addressing our questions of interest).
>>>>>
>>>>> Thank you for your help; I REALLY appreciate it!
>>>>> Best,
>>>>> Hilary
>>>>>
>>>>>> design1 = model.matrix(~0+Group_time1, data=Ytime1$samples)
>>>>>> design2 = model.matrix(~0+Group_time2, data=Ytime2$samples)
>>>>>
>>>>>> myC_TimeI = makeContrasts(
>>>>> + Treatment1_vs_Treatment2_SpeciesA_TimeI =
>>>>>SpeciesA_TimeI_Treatment2 -
>>>>> SpeciesA_TimeI_Treatment1,
>>>>> + Treatment1_vs_Treatment2_SpeciesB_TimeI =
>>>>>SpeciesB_TimeI_Treatment2 -
>>>>> SpeciesB_TimeI_Treatment1,
>>>>> + Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeI =
>>>>> (SpeciesA_TimeI_Treatment2 - SpeciesA_TimeI_Treatment1)-(
>>>>> SpeciesB_TimeI_Treatment2 - SpeciesB_TimeI_Treatment1),
>>>>> + levels=design1)
>>>>>>
>>>>>> myC_TimeII = makeContrasts(
>>>>> + Treatment1_vs_Treatment2_SpeciesA_TimeII =
>>>>>SpeciesA_TimeII_Treatment2
>>>>> -
>>>>> SpeciesA_TimeII_Treatment1,
>>>>> + Treatment1_vs_Treatment2_SpeciesB_TimeII =
>>>>>SpeciesB_TimeII_Treatment2
>>>>> -
>>>>> SpeciesB_TimeII_Treatment1,
>>>>> + Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeII =
>>>>> (SpeciesA_TimeII_Treatment2 - SpeciesA_TimeII_Treatment1)-(
>>>>> SpeciesB_TimeII_Treatment2 - SpeciesB_TimeII_Treatment1),
>>>>> + levels=design2)
>>>>>
>>>>> Then, after using glmFit on each of the two makeContrasts above, I
>>>>>ran
>>>>> glmLRT 6 times, once on each of the 6 contrasts (i.e., on
>>>>> Treatment1_vs_Treatment2_SpeciesA_TimeI,
>>>>> Treatment1_vs_Treatment2_SpeciesB_TimeI, etc. through
>>>>> Treatment1_vs_Treatment2_ SpeciesAvsSpeciesB_TimeI.
>>>>>
>>>>> Ultimately I may then try to use the VennCounts/Diagram from limma
>>>>>to
>>>>> show the overlap between the 4 sets visually:
>>>>> Treatment1vsTreatment2_SpeciesA_TimeI; Treatment1
>>>>> vsTreatment2_SpeciesA_TimeII; Treatment1
>>>>>vsTreatment2_SpeciesB_TimeI;
>>>>> and Treatment1 vsTreatment2_SpeciesB_TimeII. However, this may take
>>>>>a
>>>>> bit to set up, because my two time points have slightly different
>>>>> gene lists (i.e., to run the analyses on the Times differently,
>>>>>after
>>>>> filtering out reads to only retain those with cpm>1 in „4
>>>>> libraries, TimesI and II did not retain exactly the same genes,
>>>>> though it¹s rather close).
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> --------------------------------------------------
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>>
>>>>> On 5/10/13 8:32 PM, "Gordon K Smyth" <smyth at wehi.EDU.AU> wrote:
>>>>>
>>>>>> Dear Hilary,
>>>>>>
>>>>>> Your test for the 3-way interaction is correct, although 3-way
>>>>>> interactions are pretty hard to interpret.
>>>>>>
>>>>>> However testing for the 2-way interaction in the presence of a
>>>>>>3-way
>>>>>> interaction does not make statistical sense. This is because the
>>>>>> parametrization of the 2-way interaction as a subset of the 3-way
>>>>>>is
>>>>>> somewhat arbitrary. Before you can test the 2-way interaction
>>>>>> species*treatment in a meaningful way you would need to accept that
>>>>>> the 3-way interaction is not necessary and remove it from the model.
>>>>>>
>>>>>> In general, I am of the opinion that classical statistical
>>>>>>factorial
>>>>>> interation models do not usually provide the most meaningful
>>>>>> parametrizations for genomic experiments. In most cases, I prefer
>>>>>> to fit the saturated model (a different level for each treatment
>>>>>> combination) and make specific contrasts. There is some discussion
>>>>>> of this in the limma User's Guide.
>>>>>>
>>>>>> In your case, I guess that you might want to test for
>>>>>> species*treatment interaction separately at each time point. It is
>>>>>> almost impossible to do this within the classical 3-way factorial
>>>>>> setup. However it is easy with the one-way approach I just
>>>>>> mentioned, or else you could use:
>>>>>>
>>>>>> ~Age + Age:Species + Age:Treatment + Age:Species:Treatment
>>>>>>
>>>>>> Best wishes
>>>>>> Gordon
>>>>>>
>>>>>>
>>>>>>> Date: Thu, 9 May 2013 14:55:46 -0400
>>>>>>> From: Hilary Smith <Hilary.A.Smith.964 at nd.edu>
>>>>>>> To: "bioconductor at r-project.org" <bioconductor at r-project.org>
>>>>>>> Subject: [BioC] Statistics question for multi-factor interaction
>>>>>>>test
>>>>>>> in edgeR
>>>>>>>
>>>>>>> Hi. I need to generate two GLM tests of a factorial design with
>>>>>>> RNA-Seq count data. I have 3 factors with 2 levels apiece (2
>>>>>>> species X 2 treatments X 2 times), and 4 separate replicates each
>>>>>>> (i.e., we made a total of 2*2*2*4 = 32 separate libraries). Our
>>>>>>> main interest is in the interaction of species*treatment, as we
>>>>>>> think species A will alter gene expression in the treatment stress
>>>>>>> vs. treatment benign, whereas species B is expected to show little
>>>>>>> change. However, we¹d like to also do another test of
>>>>>>> species*treatment*time, because it is possible that the ability of
>>>>>>> species A to alter gene expression in response to the stress
>>>>>>> treatment may differ at the 1st versus 2nd time point.
>>>>>>>
>>>>>>> I think the way to set this up, is to create a design matrix as
>>>>>>> follows, with the lrt test with coef 5 giving the differentially
>>>>>>> expressed genes for the species*treatment test, and coef 8 giving
>>>>>>> the the differentially expressed gene for the
>>>>>>> species*treatment*time test (after calling topTags that is). Yet
>>>>>>>to
>>>>>>> ensure I have the statistics correct, my questions are: (1) is
>>>>>>>this
>>>>>>> thinking correct, as I don¹t see many 3x2 factorial models to
>>>>>>> follow, and (2) do I need to set up a reference somehow (which I
>>>>>>> assume would be the set of four samples with
>>>>>>> TreatmentBenign*SpeciesB*Time2, but I¹m not fully sure if that is
>>>>>>> correct or needed).
>>>>>>>
>>>>>>> Many thanks in advance for your insight!
>>>>>>> ~Hilary
>>>>>>>
>>>>>>>> designFF <- model.matrix(~Treatment*Species*Age)
>>>>>>>> colnames(designFF)
>>>>>>> [1] "(Intercept)"
>>>>>>> [2] " TreatmentStress"
>>>>>>> [3] "SpeciesA "
>>>>>>> [4] "Time1"
>>>>>>> [5] "TreatmentStress:SpeciesA"
>>>>>>> [6] "TreatmentStress:Time1"
>>>>>>> [7] "SpeciesA:Time1"
>>>>>>> [8] "TreatmentStress:SpeciesA:Time1"
>>>>>>>
>>>>>>> And then to run tests with:
>>>>>>>> fit <- glmFit(y, designFF)
>>>>>>>
>>>>>>>> lrtInteractionStressSpecies <- glmLRT(fitFF, coef=5)
>>>>>>>> lrtInteractionStressSpeciesTime <- glmLRT(fitFF, coef=8)
>
>______________________________________________________________________
>The information in this email is confidential and intended solely for the
>addressee.
>You must not disclose, forward, print or use it without the permission of
>the sender.
>______________________________________________________________________
More information about the Bioconductor
mailing list