[BioC] GSEA and edgeR
Gordon K Smyth
smyth at wehi.EDU.AU
Sat Sep 3 06:38:45 CEST 2011
Dear Iain,
The short answer is that the edgeR analysis you describe seems perfectly
correct, but there is no way to use romer() as part of this analysis.
Of the gene set tests in limma, the only ones that can be used as part of
a negative binomial based analysis of count data are wilcoxGST and
geneSetTest. The others, roast and romer, make multivariate normal
assumptions that are incompatible with count distributions.
To make a gene set test using geneSetTest, you would do something like
index <- lrtFilter$genes$ID %in% MyGeneSetSymbols
p.value <- geneSetTest(index, lrtFiltered$table$LR.statistic, type="f")
The approach does treat the genes as statistically independent, so gives
p-values that are bit optimistic, but otherwise it is a productive
approach.
To use romer, you would have to make some normal approximations:
effective.lib.size <- d$samples$lib.size * d$samples$norm.factors
y <- log2( t(t(d$counts+0.5) / (effective.lib.size+0.5)) )
Then you can use
test <- romer(geneIndex, y, design, contrast=7)
Actually, you don't even need to set the contrast argument, as the last
coefficient is the default. The contrast argument of romer is like a
combination of the coef and contrast arguments of glmLRT. If it is of
length one, it is taken to be a coef. If it's a vector, it's taken to be
a contrast.
Best wishes
Gordon
--------------------------------------------
Professor Gordon K Smyth,
Bioinformatics Division,
Walter and Eliza Hall Institute of Medical Research,
1G Royal Parade, Parkville, Vic 3052, Australia.
smyth at wehi.edu.au
http://www.wehi.edu.au
http://www.statsci.org/smyth
On Thu, 1 Sep 2011, Iain Gallagher wrote:
Dear List / Gordon
I would like to carry out a GSEA analysis on some RNA-Seq data. I have
analysed the data using edgeR and have a list of regulated genes. The data
is paired - 6 biological samples, cells from each are infected or
uninfected.
Looking around there is little advice on the use of GSEA in RNA-Seq data.
Using edgeR (and having a relatively small sample size) I was hoping to
make use of the romer algorithm which is implemented in limma. However I
am having a conceptual problem setting up my data for this.
As the study uses paired sample data I have followed the guidance for this
type of analysis in the marvellous edgeR manual. Searching for advice on
how to conduct GSEA I came across this post
(http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg02020.html)
by Gordon Smyth (author of edgeR / limma) wherein he describes a suitable
strategy for edgeR and the rotational GSEA algorithms.
These algorithms require the specification of a contrast (in my case
between infected and uninfected animals).
e.g. test <- romer(geneIndex, countData, design=design, contrast=contr[,1], nrot=9999)
My design matrix looks like this:
library(limma)
samples <- rep(c('s1', 's2', 's3', 's4', 's5', 's6'),2)
infection <- c(rep(1,6), rep(2,6))
s <- as.factor(samples)
i <- as.factor(infection)
design <- model.matrix(~s+i)
colnames(design)[7] <- 'infection'
> design
(Intercept) ss2 ss3 ss4 ss5 ss6 infection
1 1 0 0 0 0 0 0
2 1 1 0 0 0 0 0
3 1 0 1 0 0 0 0
4 1 0 0 1 0 0 0
5 1 0 0 0 1 0 0
6 1 0 0 0 0 1 0
7 1 0 0 0 0 0 1
8 1 1 0 0 0 0 1
9 1 0 1 0 0 0 1
10 1 0 0 1 0 0 1
11 1 0 0 0 1 0 1
12 1 0 0 0 0 1 1
attr(,"assign")
[1] 0 1 1 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$s
[1] "contr.treatment"
attr(,"contrasts")$i
[1] "contr.treatment"
and in edgeR I carry out the following fit of a GLM to get the genes regulated between infected and uninfected animals i.e. fit a GLM for infection and no infection then edgeR carries out likeliehood tests to find the genes where the two models differ (I think!).
#dispersion estimate
d <- estimateGLMCommonDisp(d, design)
#fit the NB GLM for each gene
fitFiltered <- glmFit(d, design, dispersion = d$common.dispersion)
#carry out the likliehood ratio test
lrtFiltered <- glmLRT(d, fitFiltered, coef = 7)
#how many genes are DE?
summary(decideTestsDGE(lrtFiltered))#DE genes
So (finally) my question is how do I set up a contrast (whilst keeping the pairing) for romer?
Thanks
Best
Iain
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C
[3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8
[7] LC_PAPER=en_GB.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] annotate_1.30.0 GO.db_2.5.0 goseq_1.4.0
[4] geneLenDataBase_0.99.7 BiasedUrn_1.04 org.Bt.eg.db_2.5.0
[7] RSQLite_0.9-4 DBI_0.2-5 AnnotationDbi_1.14.1
[10] Biobase_2.12.2 edgeR_2.2.5 limma_3.8.3
loaded via a namespace (and not attached):
[1] biomaRt_2.8.1 Biostrings_2.20.3 BSgenome_1.20.0
[4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 grid_2.13.1
[7] IRanges_1.10.6 lattice_0.19-33 Matrix_0.9996875-3
[10] mgcv_1.7-6 nlme_3.1-102 RCurl_1.6-9
[13] rtracklayer_1.12.4 tools_2.13.1 XML_3.4-2
[16] xtable_1.5-6
>
>
______________________________________________________________________
The information in this email is confidential and intend...{{dropped:4}}
More information about the Bioconductor
mailing list