[BioC] GSEA and edgeR

Sat Sep 3 06:38:45 CEST 2011

Dear Iain,

The short answer is that the edgeR analysis you describe seems perfectly 
correct, but there is no way to use romer() as part of this analysis.

Of the gene set tests in limma, the only ones that can be used as part of 
a negative binomial based analysis of count data are wilcoxGST and 
geneSetTest.  The others, roast and romer, make multivariate normal 
assumptions that are incompatible with count distributions.

To make a gene set test using geneSetTest, you would do something like

   index <- lrtFilter$genes$ID %in% MyGeneSetSymbols
   p.value <- geneSetTest(index, lrtFiltered$table$LR.statistic, type="f")

The approach does treat the genes as statistically independent, so gives 
p-values that are bit optimistic, but otherwise it is a productive 
approach.

To use romer, you would have to make some normal approximations:

   effective.lib.size <- d$samples$lib.size * d$samples$norm.factors
   y <- log2( t(t(d$counts+0.5) / (effective.lib.size+0.5)) )

Then you can use

   test <- romer(geneIndex, y, design, contrast=7)

Actually, you don't even need to set the contrast argument, as the last 
coefficient is the default.  The contrast argument of romer is like a 
combination of the coef and contrast arguments of glmLRT.  If it is of 
length one, it is taken to be a coef.  If it's a vector, it's taken to be 
a contrast.

Best wishes
Gordon

--------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth

On Thu, 1 Sep 2011, Iain Gallagher wrote:

Dear List / Gordon

I would like to carry out a GSEA analysis on some RNA-Seq data. I have 
analysed the data using edgeR and have a list of regulated genes. The data 
is paired - 6 biological samples, cells from each are infected or 
uninfected.

Looking around there is little advice on the use of GSEA in RNA-Seq data. 
Using edgeR (and having a relatively small sample size) I was hoping to 
make use of the romer algorithm which is implemented in limma. However I 
am having a conceptual problem setting up my data for this.

As the study uses paired sample data I have followed the guidance for this 
type of analysis in the marvellous edgeR manual. Searching for advice on 
how to conduct GSEA I came across this post 
(http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg02020.html) 
by Gordon Smyth (author of edgeR / limma) wherein he describes a suitable 
strategy for edgeR and the rotational GSEA algorithms.

These algorithms require the specification of a contrast (in my case 
between infected and uninfected animals).

e.g. test <- romer(geneIndex, countData, design=design, contrast=contr[,1], nrot=9999)

My design matrix looks like this:

library(limma)

samples <- rep(c('s1', 's2', 's3', 's4', 's5', 's6'),2)
infection <- c(rep(1,6), rep(2,6))

s <- as.factor(samples)
i <- as.factor(infection)

design <- model.matrix(~s+i)
colnames(design)[7] <- 'infection'

> design
    (Intercept) ss2 ss3 ss4 ss5 ss6 infection
1            1   0   0   0   0   0         0
2            1   1   0   0   0   0         0
3            1   0   1   0   0   0         0
4            1   0   0   1   0   0         0
5            1   0   0   0   1   0         0
6            1   0   0   0   0   1         0
7            1   0   0   0   0   0         1
8            1   1   0   0   0   0         1
9            1   0   1   0   0   0         1
10           1   0   0   1   0   0         1
11           1   0   0   0   1   0         1
12           1   0   0   0   0   1         1
attr(,"assign")
[1] 0 1 1 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$s
[1] "contr.treatment"

attr(,"contrasts")$i
[1] "contr.treatment"

and in edgeR I carry out the following fit of a GLM to get the genes regulated between infected and uninfected animals i.e. fit a GLM for infection and no infection then edgeR carries out likeliehood tests to find the genes where the two models differ (I think!).

#dispersion estimate
d <- estimateGLMCommonDisp(d, design)

#fit the NB GLM for each gene
fitFiltered <- glmFit(d, design, dispersion = d$common.dispersion)

#carry out the likliehood ratio test
lrtFiltered <- glmLRT(d, fitFiltered, coef = 7)

#how many genes are DE?
summary(decideTestsDGE(lrtFiltered))#DE genes

So (finally) my question is how do I set up a contrast (whilst keeping the pairing) for romer?

Thanks

Best

Iain

> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_GB.utf8       LC_NUMERIC=C
  [3] LC_TIME=en_GB.utf8        LC_COLLATE=en_GB.utf8
  [5] LC_MONETARY=C             LC_MESSAGES=en_GB.utf8
  [7] LC_PAPER=en_GB.utf8       LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C 
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
  [1] annotate_1.30.0        GO.db_2.5.0            goseq_1.4.0
  [4] geneLenDataBase_0.99.7 BiasedUrn_1.04         org.Bt.eg.db_2.5.0
  [7] RSQLite_0.9-4          DBI_0.2-5              AnnotationDbi_1.14.1 
[10] Biobase_2.12.2         edgeR_2.2.5            limma_3.8.3

loaded via a namespace (and not attached):
  [1] biomaRt_2.8.1         Biostrings_2.20.3     BSgenome_1.20.0
  [4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8   grid_2.13.1
  [7] IRanges_1.10.6        lattice_0.19-33       Matrix_0.9996875-3 
[10] mgcv_1.7-6            nlme_3.1-102          RCurl_1.6-9 
[13] rtracklayer_1.12.4    tools_2.13.1          XML_3.4-2 
[16] xtable_1.5-6 
>
>

______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}