[BioC] GSEA and edgeR
Iain Gallagher
iaingallagher at btopenworld.com
Sat Sep 3 09:10:01 CEST 2011
Hello List
As a follow-up to my original query on GSEA with edgeR I recieved the following advice from Gordon Smyth:
"Of the gene set tests in limma, the only ones that can be used as part of a negative binomial based analysis of count data are wilcoxGST and geneSetTest. The others, roast and romer, make multivariate normal assumptions that are incompatible with count distributions.
To make a gene set test using geneSetTest, you would do something like
index <- lrtFilter$genes$ID %in% MyGeneSetSymbols
p.value <- geneSetTest(index, lrtFiltered$table$LR.statistic, type="f")
The approach does treat the genes as statistically independent, so gives p-values that are bit optimistic, but otherwise it is a productive approach.
To use romer, you would have to make some normal approximations:
effective.lib.size <- d$samples$lib.size * d$samples$norm.factors
y <- log2( t(t(d$counts+0.5) / (effective.lib.size+0.5)) )
Then you can use
test <- romer(geneIndex, y, design, contrast=7)
Actually, you don't even need to set the contrast argument, as the last coefficient is the default. The contrast argument of romer is like a combination of the coef and contrast arguments of glmLRT. If it is of length one, it is taken to be a coef. If it's a vector, it's taken to be a contrast."
I hope this advice is useful to others wishing to undertake GSEA with RNA-Seq data (it is for me!).
Best
Iain
--- On Thu, 1/9/11, Iain Gallagher <iaingallagher at btopenworld.com> wrote:
> From: Iain Gallagher <iaingallagher at btopenworld.com>
> Subject: Re: [BioC] GSEA and edgeR
> To: "Sean Davis" <sdavis2 at mail.nih.gov>
> Cc: smyth at wehi.edu.au, "bioconductor" <bioconductor at stat.math.ethz.ch>
> Date: Thursday, 1 September, 2011, 16:21
> Hi Sean
>
> Thanks for your reply.
>
> I have already carried out a goseq analysis on my data.
>
> As I understand it goseq (like GOstats) looks for
> enrichment in a list of genes defined a priori as regulated
> and flags categories those genes belong to as enriched or
> not in that list of regulated genes.
>
> I was more interested in testing 'sets' as a whole without
> defining a list of regulated genes up front.
>
> I should have mentioned in my original message that I'd
> used goseq already since the two techniques are related.
> Aplologies for that.
>
> Best
>
> Iain
>
>
>
> --- On Thu, 1/9/11, Sean Davis <sdavis2 at mail.nih.gov>
> wrote:
>
> > From: Sean Davis <sdavis2 at mail.nih.gov>
> > Subject: Re: [BioC] GSEA and edgeR
> > To: "Iain Gallagher" <iaingallagher at btopenworld.com>
> > Cc: "bioconductor" <bioconductor at stat.math.ethz.ch>,
> smyth at wehi.edu.au
> > Date: Thursday, 1 September, 2011, 16:11
> > Hi, Iain.
> >
> > Before proceeding, you might want to take a look at:
> >
> > http://bioconductor.org/packages/release/bioc/html/goseq.html
> >
> > This paper and others like it describe the problem
> and
> > another solution:
> >
> > http://www.ncbi.nlm.nih.gov/pubmed/21252076
> >
> > Sean
> >
> >
> > On Thu, Sep 1, 2011 at 9:13 AM, Iain Gallagher
> > <iaingallagher at btopenworld.com>
> > wrote:
> > > Dear List / Gordon
> > >
> > > I would like to carry out a GSEA analysis on
> some
> > RNA-Seq data. I have analysed the data using edgeR and
> have
> > a list of regulated genes. The data is paired - 6
> biological
> > samples, cells from each are infected or uninfected.
> > >
> > > Looking around there is little advice on the use
> of
> > GSEA in RNA-Seq data. Using edgeR (and having a
> relatively
> > small sample size) I was hoping to make use of the
> romer
> > algorithm which is implemented in limma. However I am
> having
> > a conceptual problem setting up my data for this.
> > >
> > > As the study uses paired sample data I have
> followed
> > the guidance for this type of analysis in the
> marvellous
> > edgeR manual. Searching for advice on how to conduct
> GSEA I
> > came across this post (http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg02020.html)
> > by Gordon Smyth (author of edgeR / limma) wherein he
> > describes a suitable strategy for edgeR and the
> rotational
> > GSEA algorithms.
> > >
> > > These algorithms require the specification of a
> > contrast (in my case between infected and uninfected
> > animals).
> > >
> > > e.g. test <- romer(geneIndex, countData,
> > design=design, contrast=contr[,1], nrot=9999)
> > >
> > > My design matrix looks like this:
> > >
> > > library(limma)
> > >
> > > samples <- rep(c('s1', 's2', 's3', 's4',
> 's5',
> > 's6'),2)
> > > infection <- c(rep(1,6), rep(2,6))
> > >
> > > s <- as.factor(samples)
> > > i <- as.factor(infection)
> > >
> > > design <- model.matrix(~s+i)
> > > colnames(design)[7] <- 'infection'
> > >
> > >> design
> > > (Intercept) ss2 ss3 ss4 ss5 ss6 infection
> > > 1 1 0 0 0 0 0
>
> > 0
> > > 2 1 1 0 0 0 0
>
> > 0
> > > 3 1 0 1 0 0 0
>
> > 0
> > > 4 1 0 0 1 0 0
>
> > 0
> > > 5 1 0 0 0 1 0
>
> > 0
> > > 6 1 0 0 0 0 1
>
> > 0
> > > 7 1 0 0 0 0 0
>
> > 1
> > > 8 1 1 0 0 0 0
>
> > 1
> > > 9 1 0 1 0 0 0
>
> > 1
> > > 10 1 0 0 1 0 0
>
> > 1
> > > 11 1 0 0 0 1 0
>
> > 1
> > > 12 1 0 0 0 0 1
>
> > 1
> > > attr(,"assign")
> > > [1] 0 1 1 1 1 1 2
> > > attr(,"contrasts")
> > > attr(,"contrasts")$s
> > > [1] "contr.treatment"
> > >
> > > attr(,"contrasts")$i
> > > [1] "contr.treatment"
> > >
> > > and in edgeR I carry out the following fit of a
> GLM to
> > get the genes regulated between infected and
> uninfected
> > animals i.e. fit a GLM for infection and no infection
> then
> > edgeR carries out likeliehood tests to find the genes
> where
> > the two models differ (I think!).
> > >
> > > #dispersion estimate
> > > d <- estimateGLMCommonDisp(d, design)
> > >
> > > #fit the NB GLM for each gene
> > > fitFiltered <- glmFit(d, design, dispersion =
> > d$common.dispersion)
> > >
> > > #carry out the likliehood ratio test
> > > lrtFiltered <- glmLRT(d, fitFiltered, coef =
> 7)
> > >
> > > #how many genes are DE?
> > > summary(decideTestsDGE(lrtFiltered))#DE genes
> > >
> > > So (finally) my question is how do I set up a
> contrast
> > (whilst keeping the pairing) for romer?
> > >
> > > Thanks
> > >
> > > Best
> > >
> > > Iain
> > >
> > >> sessionInfo()
> > > R version 2.13.1 (2011-07-08)
> > > Platform: x86_64-pc-linux-gnu (64-bit)
> > >
> > > locale:
> > > [1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C
> > > [3] LC_TIME=en_GB.utf8
> > LC_COLLATE=en_GB.utf8
> > > [5] LC_MONETARY=C
> > LC_MESSAGES=en_GB.utf8
> > > [7] LC_PAPER=en_GB.utf8 LC_NAME=C
> > > [9] LC_ADDRESS=C
> LC_TELEPHONE=C
> > > [11] LC_MEASUREMENT=en_GB.utf8
> LC_IDENTIFICATION=C
> > >
> > > attached base packages:
> > > [1] stats graphics grDevices utils
> > datasets methods base
> > >
> > > other attached packages:
> > > [1] annotate_1.30.0 GO.db_2.5.0
>
> > goseq_1.4.0
> > > [4] geneLenDataBase_0.99.7 BiasedUrn_1.04
>
> > org.Bt.eg.db_2.5.0
> > > [7] RSQLite_0.9-4 DBI_0.2-5
>
> > AnnotationDbi_1.14.1
> > > [10] Biobase_2.12.2 edgeR_2.2.5
>
> > limma_3.8.3
> > >
> > > loaded via a namespace (and not attached):
> > > [1] biomaRt_2.8.1 Biostrings_2.20.3
>
> > BSgenome_1.20.0
> > > [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8
>
> > grid_2.13.1
> > > [7] IRanges_1.10.6 lattice_0.19-33
>
> > Matrix_0.9996875-3
> > > [10] mgcv_1.7-6 nlme_3.1-102
>
> > RCurl_1.6-9
> > > [13] rtracklayer_1.12.4 tools_2.13.1
>
> > XML_3.4-2
> > > [16] xtable_1.5-6
> > >>
> > >
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >
> >
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
More information about the Bioconductor
mailing list