[BioC] edgeR - coefficients in 3-factor experiment, complex contrasts and decideTestsDGE
Gordon K Smyth
smyth at wehi.EDU.AU
Tue Oct 18 01:19:12 CEST 2011
Dear Emanuel,
> From: Emanuel Heitlinger <emanuelheitlinger at googlemail.com>
> To: bioconductor at r-project.org
> Subject: [BioC] edgeR - coefficients in 3-factor experiment, complex
> contrasts and decideTestsDGE
>
> Hi all,
>
> I have some questions regarding multi-factor-glms in edgeR. I am working
> on a RNA-seq experiment:
>
> I have 24 samples from 3 "treatments" each having two levels. This means
> 3 biological replicates per treatment combination. I want to model the
> full set of possible interactions (sex.conds*eel.conds*pop.conds), as
> expecially two-fold interactions could be very interetsing for my
> research-question.
>
> I want to categorize genes (contigs form a 454-transcriptome assembly,
> genome is unsequenced) I mapped my reads against into categoris:
> a) only different for sex
> b) only different for eel
> c) only different for pop
> d1)d2)d3) different for sex:eel, sex:pop, eel:pop
I am always a bit uneasy about the use of factorial models in the context
of genomic research. As a statistician, I've used factorial models all my
working life, but the hierarchical hypotheses implied by two and three
level interactions in a three factor model don't seem to me to correspond
to scientific questions that one would want to ask in genomic research. I
have to admit that is why I haven't made it particularly easy to input
factorial models into limma or edgeR. In your case, I'd probably feel
more comfortable making direct contrasts between your eight distinct
groups.
I wonder what you will do with genes that show a significant 3-way
interaction? In the factorial model framework, there is nothing you can
do with such genes, no further hypotheses you can test. You would have to
put them into all of your categories d1, d2 and d3.
Anyway, I'll try to answer your questions directly. Thanks for giving a
code example. Let's say that you want to find genes whose expression does
or does not depend on eel or pop in any way. If not, then sex would be
the only predictor. In your example, sex in the second column of the
design matrix:
> colnames(design)
[1] "(Intercept)" "sex.condsmale"
[3] "eel.condsAj" "pop.condsTW"
[5] "sex.condsmale:eel.condsAj" "sex.condsmale:pop.condsTW"
[7] "eel.condsAj:pop.condsTW" "sex.condsmale:eel.condsAj:pop.condsTW"
So you can use:
d <- DGEList(y)
d <- estimateGLMCommonDisp(d, design=design)
fit <- glmFit(d, design)
lrt <- glmLRT(d, fit, coef=3:8)
topTags(lrt)
This does a test on 6 degrees of freedom. Genes that appear in the
toptable do depend on eel or pop either in a main effect or in an
interaction.
Some of your questions are composite questions that don't have simple
statistical answers even for univariate data. For example, when you want
genes that depend only on sex, you want sex to be significant as main
effect but not the test given above. Traditional statistics has a bit of
difficulty with testing whether a difference is not present. I think
you'll have to make your own ad hoc judgement as to what constitutes
non-significance, then make up a truth vector yourself for each hypothesis
you want to test, then find overlaps between the results yourself, rather
than using decideTestsDGE.
> In glmLRT giving simple coefficients would compare the complete model to
> a model removing one coefficient at a time. From application of glms in
> ecology I remember that an interaction effect should not be left in the
> model if the main effect is removed. Does this apply here? Should I
> compare the full model against e.g. the model minus pop, sex:pop,
> eel:pop and sex:eel:pop, when I want to remove condition "pop" from the
> model?
Yes, that's true. To test whether pop makes any contribution to
expression changes, you need:
lrt <- glmFit(d,design,coef=grep("pop",colnames(design)))
etc. That does a test on 4 degrees of freedom.
Best wishes
Gordon
> Hope this code demonstrates what I mean:
>
> ## CODE start
> library(edgeR)
>
> ## generate a df of neg. bionm. counts
> y <- as.data.frame(matrix(rnbinom(6000,mu=10,size=10),ncol=24))
> names(y) <- c("AA_R11M", "AA_R16M", "AA_R18F", "AA_R28F",
> "AA_R2M", "AA_R8F", "AA_T12F", "AA_T20F",
> "AA_T24M", "AA_T3M", "AA_T42M", "AA_T45F",
> "AJ_R1F", "AJ_R1M", "AJ_R3F", "AJ_R3M",
> "AJ_R5F", "AJ_R5M", "AJ_T19M", "AJ_T20M",
> "AJ_T25M", "AJ_T26F", "AJ_T5F", "AJ_T8F")
>
> sex.conds <- factor(ifelse(grepl("M$", names(y)), "male", "female" ))
> eel.conds <- factor(ifelse(grepl("^AA", names(y)), "Aa", "Aj" ))
> pop.conds <- factor(ifelse(grepl("\\w\\w_R.*", names(y)), "EU", "TW" ))
>
> design <- model.matrix(~sex.conds*eel.conds*pop.conds)
>
> ## Counts frame to full DGEList
> d <- DGEList(y, lib.size=colSums(y))
> d <- calcNormFactors(d)
> d <- estimateGLMCommonDisp(d, design=design)
> d <- estimateGLMTrendedDisp(d, design=design)
> d <- estimateGLMTagwiseDisp(d, design=design)
>
> glmfit <- glmFit(d, design, dispersion=d$tagwise.dispersion)
>
> glm.wrapper <- function(de.obj, fit.obj, conds.regex){
> glm.list <- list()
> coe <- names(as.data.frame(fit.obj$design))
> coe.l <- lapply(conds.regex, function (x) grep(x, coe))
> for (i in 1:length(conds.regex)){
> glm.list[[conds.regex[[i]]]] <- glmLRT(de.obj, fit.obj, coef=coe.l[[i]])
> }
> return(glm.list)
> }
>
> ## selects all coefficients being contained in each other hierachically
> combi.conds <- gsub(":", ".*", names(as.data.frame(glmfit$design))[2:8])
> glm.l <- glm.wrapper(d, glmfit, combi.conds)
>
> ## show what is compared
> lapply(glm.l, function (x) x$comparison)
>
> ## topTags works (same as using p.adjust directly)
> topTags.l <- lapply(glm.l, function (x){
> tt <- topTags(x, n=40000) ## set as high as the length
> tt[tt$table$adj.P.Val<0.05] ## get only below adj.P
> })
>
> ## Code End
>
> Then I would look through the topTags list to categorize the contigs
> as stated above. E.g. from topTags.l[[1]] I would take only those not
> in topTags.l[[c(4, 5, 7]] to get category a) stated above, from
> topTags.l[[4]] only those not in topTagl.l[[7]] to get d1. This seems
> all a bit complicated to me, is this a correct way of doing this?
>
> I am alos a bit worried that decideTestsDGE seems to not work on
> DGELRT-objects with complicated coefficients. Eg.
>
> ## Code Start
> ## decideTestsDGE does not work
> decideTestsDGE.l <- lapply(glm.l, function (x){
> subset(x$table, (decideTestsDGE(x, p.value = .05))!=0)})
> ## Code End
>
> I saw that for simple coefficients the results of decideTestsDGE
> differ from topTags. Is this expected, what is the difference between
> the two, thought both do p-value adjustment the same way (I could
> provide code if these differenced would not be the expected
> behaviour)?
>
> These are my questions for now. I would be very greatful for help!
>
> All the Best,
> Emanuel
>
>
> sessionInfo()
> R version 2.13.0 (2011-04-13)
> Platform: x86_64-redhat-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> LC_TIME=en_US.UTF-8
> [4] LC_COLLATE=en_US.UTF-8 LC_MONETARY=C
> LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> LC_ADDRESS=C
> [10] LC_TELEPHONE=C LC_MEASUREMENT=en_US.UTF-8
> LC_IDENTIFICATION=C
>
> attached base packages:
> [1] splines stats graphics grDevices utils datasets
> methods base
>
> other attached packages:
> [1] edgeR_2.2.6
>
> loaded via a namespace (and not attached):
> [1] limma_3.8.3 tools_2.13.0
>
> --
> Emanuel Heitlinger
>
> Karlsruhe Institute of Technology (KIT)
> Zoological Institute; Ecology/Parasitology
> Kornblumenstr. 13
> 76131 Karlsruhe
> Germany
> Telephone +49 (0)721-608 47654
>
> or
> University of Edinburgh
> Institute of Evolutionary Biology
> Kings Buildings, Ashworth Laboratories, West Mains Road
> Edinburgh EH9 3JT
> Scotland, UK
> Telephone:+44 (0)131-650 7403
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list