[BioC] GSEA and edgeR

Sat Sep 3 09:10:01 CEST 2011

Hello List

As a follow-up to my original query on GSEA with edgeR I recieved the following advice from Gordon Smyth:

"Of the gene set tests in limma, the only ones that can be used as part of a negative binomial based analysis of count data are wilcoxGST and geneSetTest.  The others, roast and romer, make multivariate normal assumptions that are incompatible with count distributions.

To make a gene set test using geneSetTest, you would do something like

  index <- lrtFilter$genes$ID %in% MyGeneSetSymbols
  p.value <- geneSetTest(index, lrtFiltered$table$LR.statistic, type="f")

The approach does treat the genes as statistically independent, so gives p-values that are bit optimistic, but otherwise it is a productive approach.

To use romer, you would have to make some normal approximations:

  effective.lib.size <- d$samples$lib.size * d$samples$norm.factors
  y <- log2( t(t(d$counts+0.5) / (effective.lib.size+0.5)) )

Then you can use

  test <- romer(geneIndex, y, design, contrast=7)

Actually, you don't even need to set the contrast argument, as the last coefficient is the default.  The contrast argument of romer is like a combination of the coef and contrast arguments of glmLRT.  If it is of length one, it is taken to be a coef.  If it's a vector, it's taken to be a contrast."

I hope this advice is useful to others wishing to undertake GSEA with RNA-Seq data (it is for me!).

Best

Iain

--- On Thu, 1/9/11, Iain Gallagher <iaingallagher at btopenworld.com> wrote:

> From: Iain Gallagher <iaingallagher at btopenworld.com>
> Subject: Re: [BioC] GSEA and edgeR
> To: "Sean Davis" <sdavis2 at mail.nih.gov>
> Cc: smyth at wehi.edu.au, "bioconductor" <bioconductor at stat.math.ethz.ch>
> Date: Thursday, 1 September, 2011, 16:21
> Hi Sean
> 
> Thanks for your reply.
> 
> I have already carried out a goseq analysis on my data. 
> 
> As I understand it goseq (like GOstats) looks for
> enrichment in a list of genes defined a priori as regulated
> and flags categories those genes belong to as enriched or
> not in that list of regulated genes.
> 
> I was more interested in testing 'sets' as a whole without
> defining a list of regulated genes up front.
> 
> I should have mentioned in my original message that I'd
> used goseq already since the two techniques are related.
> Aplologies for that.
> 
> Best
> 
> Iain
> 
> 
> 
> --- On Thu, 1/9/11, Sean Davis <sdavis2 at mail.nih.gov>
> wrote:
> 
> > From: Sean Davis <sdavis2 at mail.nih.gov>
> > Subject: Re: [BioC] GSEA and edgeR
> > To: "Iain Gallagher" <iaingallagher at btopenworld.com>
> > Cc: "bioconductor" <bioconductor at stat.math.ethz.ch>,
> smyth at wehi.edu.au
> > Date: Thursday, 1 September, 2011, 16:11
> > Hi, Iain.
> > 
> > Before proceeding, you might want to take a look at:
> > 
> > http://bioconductor.org/packages/release/bioc/html/goseq.html
> > 
> > This paper and others like it describe the problem
> and
> > another solution:
> > 
> > http://www.ncbi.nlm.nih.gov/pubmed/21252076
> > 
> > Sean
> > 
> > 
> > On Thu, Sep 1, 2011 at 9:13 AM, Iain Gallagher
> > <iaingallagher at btopenworld.com>
> > wrote:
> > > Dear List / Gordon
> > >
> > > I would like to carry out a GSEA analysis on
> some
> > RNA-Seq data. I have analysed the data using edgeR and
> have
> > a list of regulated genes. The data is paired - 6
> biological
> > samples, cells from each are infected or uninfected.
> > >
> > > Looking around there is little advice on the use
> of
> > GSEA in RNA-Seq data. Using edgeR (and having a
> relatively
> > small sample size) I was hoping to make use of the
> romer
> > algorithm which is implemented in limma. However I am
> having
> > a conceptual problem setting up my data for this.
> > >
> > > As the study uses paired sample data I have
> followed
> > the guidance for this type of analysis in the
> marvellous
> > edgeR manual. Searching for advice on how to conduct
> GSEA I
> > came across this post (http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg02020.html)
> > by Gordon Smyth (author of edgeR / limma) wherein he
> > describes a suitable strategy for edgeR and the
> rotational
> > GSEA algorithms.
> > >
> > > These algorithms require the specification of a
> > contrast (in my case between infected and uninfected
> > animals).
> > >
> > > e.g. test <- romer(geneIndex, countData,
> > design=design, contrast=contr[,1], nrot=9999)
> > >
> > > My design matrix looks like this:
> > >
> > > library(limma)
> > >
> > > samples <- rep(c('s1', 's2', 's3', 's4',
> 's5',
> > 's6'),2)
> > > infection <- c(rep(1,6), rep(2,6))
> > >
> > > s <- as.factor(samples)
> > > i <- as.factor(infection)
> > >
> > > design <- model.matrix(~s+i)
> > > colnames(design)[7] <- 'infection'
> > >
> > >> design
> > >   (Intercept) ss2 ss3 ss4 ss5 ss6 infection
> > > 1            1   0   0   0   0   0  
>    
> >   0
> > > 2            1   1   0   0   0   0  
>    
> >   0
> > > 3            1   0   1   0   0   0  
>    
> >   0
> > > 4            1   0   0   1   0   0  
>    
> >   0
> > > 5            1   0   0   0   1   0  
>    
> >   0
> > > 6            1   0   0   0   0   1  
>    
> >   0
> > > 7            1   0   0   0   0   0  
>    
> >   1
> > > 8            1   1   0   0   0   0  
>    
> >   1
> > > 9            1   0   1   0   0   0  
>    
> >   1
> > > 10           1   0   0   1   0   0  
>    
> >   1
> > > 11           1   0   0   0   1   0  
>    
> >   1
> > > 12           1   0   0   0   0   1  
>    
> >   1
> > > attr(,"assign")
> > > [1] 0 1 1 1 1 1 2
> > > attr(,"contrasts")
> > > attr(,"contrasts")$s
> > > [1] "contr.treatment"
> > >
> > > attr(,"contrasts")$i
> > > [1] "contr.treatment"
> > >
> > > and in edgeR I carry out the following fit of a
> GLM to
> > get the genes regulated between infected and
> uninfected
> > animals i.e. fit a GLM for infection and no infection
> then
> > edgeR carries out likeliehood tests to find the genes
> where
> > the two models differ (I think!).
> > >
> > > #dispersion estimate
> > > d <- estimateGLMCommonDisp(d, design)
> > >
> > > #fit the NB GLM for each gene
> > > fitFiltered <- glmFit(d, design, dispersion =
> > d$common.dispersion)
> > >
> > > #carry out the likliehood ratio test
> > > lrtFiltered <- glmLRT(d, fitFiltered, coef =
> 7)
> > >
> > > #how many genes are DE?
> > > summary(decideTestsDGE(lrtFiltered))#DE genes
> > >
> > > So (finally) my question is how do I set up a
> contrast
> > (whilst keeping the pairing) for romer?
> > >
> > > Thanks
> > >
> > > Best
> > >
> > > Iain
> > >
> > >> sessionInfo()
> > > R version 2.13.1 (2011-07-08)
> > > Platform: x86_64-pc-linux-gnu (64-bit)
> > >
> > > locale:
> > >  [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
> > >  [3] LC_TIME=en_GB.utf8      
> >  LC_COLLATE=en_GB.utf8
> > >  [5] LC_MONETARY=C            
> > LC_MESSAGES=en_GB.utf8
> > >  [7] LC_PAPER=en_GB.utf8       LC_NAME=C
> > >  [9] LC_ADDRESS=C            
>  LC_TELEPHONE=C
> > > [11] LC_MEASUREMENT=en_GB.utf8
> LC_IDENTIFICATION=C
> > >
> > > attached base packages:
> > > [1] stats     graphics  grDevices utils    
> > datasets  methods   base
> > >
> > > other attached packages:
> > >  [1] annotate_1.30.0        GO.db_2.5.0  
>    
> >      goseq_1.4.0
> > >  [4] geneLenDataBase_0.99.7 BiasedUrn_1.04    
>  
> >   org.Bt.eg.db_2.5.0
> > >  [7] RSQLite_0.9-4          DBI_0.2-5    
>  
> >        AnnotationDbi_1.14.1
> > > [10] Biobase_2.12.2         edgeR_2.2.5    
>  
> >      limma_3.8.3
> > >
> > > loaded via a namespace (and not attached):
> > >  [1] biomaRt_2.8.1         Biostrings_2.20.3
>  
> >   BSgenome_1.20.0
> > >  [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8
>  
> > grid_2.13.1
> > >  [7] IRanges_1.10.6        lattice_0.19-33
>    
> >   Matrix_0.9996875-3
> > > [10] mgcv_1.7-6            nlme_3.1-102  
>    
> >    RCurl_1.6-9
> > > [13] rtracklayer_1.12.4    tools_2.13.1    
>    
> >  XML_3.4-2
> > > [16] xtable_1.5-6
> > >>
> > >
> > >
> > > _______________________________________________
> > > Bioconductor mailing list
> > > Bioconductor at r-project.org
> > > https://stat.ethz.ch/mailman/listinfo/bioconductor
> > > Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
> > >
> >
> 
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>