[BioC] using easyRNASeq examples

Nicolas Delhomme delhomme at embl.de
Sun Feb 12 15:24:43 CET 2012


Hi Francesco,

I've fixed a bug in easyRNASeq and made a couple of modifications to deal with novelties introduced in the IRanges and in the DESeq packages. I've just committed that whole bunch, so it will take ~2 days before a binary package is available. You can always install from SVN in the meanwhile.

I've listed below the different examples you sent me in our email exchange last week and I've added some comments which I hope will prove useful to you. I've created a new "thread" to make it easier to read, and I'm confident it will prove useful to others. This leads me to thanking you for your patience, for providing the data and for giving me the opportunity to post such an example :-)

## load the library
library(easyRNASeq)

## load the chromosome sizes
## Note that you can just provide a named list; using the BSgenome is not necessary (especially as the Hsapiens ones are huge >800Mb)
library(BSgenome.Hsapiens.UCSC.hg19)
chr.sizes=as.list(seqlengths(Hsapiens))

## set the wd directory to wherever you have your bam files
## that's just for this example, you could give that directory
## directly to the filesDirectory argument of the function
setwd("Desktop/Francesco")

## get the bam filenames
bamfiles=dir(getwd(),pattern="*\\.bam$")

## run easyRNASeq to get an RNAseq object.
## we use biomaRt to fetch the annotation directly from ensembl
## 
rnaSeq <- easyRNASeq(filesDirectory=getwd(),
                     organism="Hsapiens",
                     chr.sizes=chr.sizes,
                     readLength=100L,
                     annotationMethod="biomaRt",
                     format="bam",
                     count="exons",
                     filenames=bamfiles[1],
                     outputFormat="RNAseq"
                     )

## Fetching the annotation that way is time consuming
## Using the RNAseq object you can extract the genic information
## and save them as an rda for a faster processing time
## later on
gAnnot <- genomicAnnotation(rnaSeq)

## There are 244 "chromosomes" in that annotation, let's keep only what we need
gAnnot <- gAnnot[space(gAnnot) %in% paste("chr",c(1:22,"X","Y","M"),sep=""),]
save(gAnnot,file="gAnnot.rda")

## you could use it this way
countTable <- easyRNASeq(filesDirectory=getwd(),
                         organism="Hsapiens",
                         chr.sizes=chr.sizes,
                         readLength=100L,
                         annotationMethod="rda",
                         annotationFile="gAnnot.rda",
                         format="bam",
                         count="exons",
                         filenames=bamfiles[1]
                         )

## using different annotation (makeTranscriptDB)
library(GenomicFeatures)
hg19.tx <- makeTranscriptDbFromUCSC(
                                    genome="hg19",
                                    tablename="refGene")

## easyRNASeq can deal with GRangesList object, so no need to modify it much, i.e. no need to convert it to a RangedData
gAnnot <- exons(hg19.tx)
## change the metadata column name to suit easyRNASeq
colnames(elementMetadata(gAnnot)) <- "exon"
## finally turn it into a GrangesList
gAnnot <- split(gAnnot,seqnames(gAnnot))

## you could save it as an rda object, you can as well use it directly
## by using the annotationMethod "env" and the annotationObject arguments.
## In addition we will be selecting for a single chromosome: chr1
## as the corresponding name in your bam file is 1, that's what we'll use
countTable <- easyRNASeq(filesDirectory=getwd(),
                         organism="Hsapiens",
                         chr.sizes=chr.sizes,
                         readLength=100L,
                         annotationMethod="env",
                         annotationObject=gAnnot,
                         format="bam",
                         count="exons",
                         filenames=bamfiles[1],
                         chr.sel="1"
                         )

## applying DESeq
## this will not yield very sensitive results as we have no replicates (biological)
## These are important for DESeq to accurately model the technical and biological variance
## With no replicates for every condition, the dispersion will be based on a "pooled" estimate
## making the differential expression call lose sensitivity. In addition, DESeq is in such case
## using a conservative approach (which is good) so you'd get even less significant results.

## Defining the conditions
conditions <- c("A","B")
names(conditions) <- bamfiles

## running DESeq
## NOTE that the two last arguments are for the DESeq estimateDispersions method. They are just transferred through.
## There are necessary as we have only 1 replicate per condition; see the DESeq vignette for more details.
countDataSet <- easyRNASeq(filesDirectory=getwd(),
                           organism="Hsapiens",
                           chr.sizes=chr.sizes,
                           readLength=100L,
                           annotationMethod="env",
                           annotationObject=gAnnot,
                           format="bam",
                           count="exons",
                           filenames=bamfiles,
                           chr.sel="1",
                           outputFormat="DESeq",
                           normalize=TRUE,
                           conditions=conditions,
                           fitType="local",
                           method="blind"
                         )

## finally my session info
> sessionInfo()
R Under development (unstable) (2012-02-07 r58290)
Platform: x86_64-apple-darwin10.8.0 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
 [1] GenomicFeatures_1.7.16 AnnotationDbi_1.17.15  easyRNASeq_1.1.6      
 [4] ShortRead_1.13.12      latticeExtra_0.6-19    RColorBrewer_1.0-5    
 [7] Rsamtools_1.7.29       DESeq_1.7.6            locfit_1.5-6          
[10] lattice_0.20-0         akima_0.5-7            Biobase_2.15.3        
[13] BSgenome_1.23.2        GenomicRanges_1.7.24   Biostrings_2.23.6     
[16] IRanges_1.13.24        genomeIntervals_1.11.0 intervals_0.13.3      
[19] edgeR_2.5.9            limma_3.11.11          biomaRt_2.11.1        
[22] BiocGenerics_0.1.4    

loaded via a namespace (and not attached):
 [1] annotate_1.33.2    bitops_1.0-4.1     DBI_0.2-5          genefilter_1.37.0 
 [5] geneplotter_1.33.1 grid_2.15.0        hwriter_1.3        RCurl_1.91-1      
 [9] RSQLite_0.11.1     rtracklayer_1.15.7 splines_2.15.0     survival_2.36-12  
[13] tools_2.15.0       XML_3.9-4          xtable_1.6-0       zlibbioc_1.1.1

I hope this helps, do not hesitate to contact me if you have further questions and to suggest improvements.

Cheers,

Nico

---------------------------------------------------------------
Nicolas Delhomme

Genome Biology Computational Support

European Molecular Biology Laboratory

Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany



More information about the Bioconductor mailing list