[BioC] GSEA and edgeR
Iain Gallagher
iaingallagher at btopenworld.com
Thu Sep 1 15:13:41 CEST 2011
Dear List / Gordon
I would like to carry out a GSEA analysis on some RNA-Seq data. I have analysed the data using edgeR and have a list of regulated genes. The data is paired - 6 biological samples, cells from each are infected or uninfected.
Looking around there is little advice on the use of GSEA in RNA-Seq data. Using edgeR (and having a relatively small sample size) I was hoping to make use of the romer algorithm which is implemented in limma. However I am having a conceptual problem setting up my data for this.
As the study uses paired sample data I have followed the guidance for this type of analysis in the marvellous edgeR manual. Searching for advice on how to conduct GSEA I came across this post (http://www.mail-archive.com/bioc-sig-sequencing@r-project.org/msg02020.html) by Gordon Smyth (author of edgeR / limma) wherein he describes a suitable strategy for edgeR and the rotational GSEA algorithms.
These algorithms require the specification of a contrast (in my case between infected and uninfected animals).
e.g. test <- romer(geneIndex, countData, design=design, contrast=contr[,1], nrot=9999)
My design matrix looks like this:
library(limma)
samples <- rep(c('s1', 's2', 's3', 's4', 's5', 's6'),2)
infection <- c(rep(1,6), rep(2,6))
s <- as.factor(samples)
i <- as.factor(infection)
design <- model.matrix(~s+i)
colnames(design)[7] <- 'infection'
> design
(Intercept) ss2 ss3 ss4 ss5 ss6 infection
1 1 0 0 0 0 0 0
2 1 1 0 0 0 0 0
3 1 0 1 0 0 0 0
4 1 0 0 1 0 0 0
5 1 0 0 0 1 0 0
6 1 0 0 0 0 1 0
7 1 0 0 0 0 0 1
8 1 1 0 0 0 0 1
9 1 0 1 0 0 0 1
10 1 0 0 1 0 0 1
11 1 0 0 0 1 0 1
12 1 0 0 0 0 1 1
attr(,"assign")
[1] 0 1 1 1 1 1 2
attr(,"contrasts")
attr(,"contrasts")$s
[1] "contr.treatment"
attr(,"contrasts")$i
[1] "contr.treatment"
and in edgeR I carry out the following fit of a GLM to get the genes regulated between infected and uninfected animals i.e. fit a GLM for infection and no infection then edgeR carries out likeliehood tests to find the genes where the two models differ (I think!).
#dispersion estimate
d <- estimateGLMCommonDisp(d, design)
#fit the NB GLM for each gene
fitFiltered <- glmFit(d, design, dispersion = d$common.dispersion)
#carry out the likliehood ratio test
lrtFiltered <- glmLRT(d, fitFiltered, coef = 7)
#how many genes are DE?
summary(decideTestsDGE(lrtFiltered))#DE genes
So (finally) my question is how do I set up a contrast (whilst keeping the pairing) for romer?
Thanks
Best
Iain
> sessionInfo()
R version 2.13.1 (2011-07-08)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.utf8 LC_NUMERIC=C
[3] LC_TIME=en_GB.utf8 LC_COLLATE=en_GB.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_GB.utf8
[7] LC_PAPER=en_GB.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] annotate_1.30.0 GO.db_2.5.0 goseq_1.4.0
[4] geneLenDataBase_0.99.7 BiasedUrn_1.04 org.Bt.eg.db_2.5.0
[7] RSQLite_0.9-4 DBI_0.2-5 AnnotationDbi_1.14.1
[10] Biobase_2.12.2 edgeR_2.2.5 limma_3.8.3
loaded via a namespace (and not attached):
[1] biomaRt_2.8.1 Biostrings_2.20.3 BSgenome_1.20.0
[4] GenomicFeatures_1.4.4 GenomicRanges_1.4.8 grid_2.13.1
[7] IRanges_1.10.6 lattice_0.19-33 Matrix_0.9996875-3
[10] mgcv_1.7-6 nlme_3.1-102 RCurl_1.6-9
[13] rtracklayer_1.12.4 tools_2.13.1 XML_3.4-2
[16] xtable_1.5-6
>
More information about the Bioconductor
mailing list