[BioC] How to combine different annotation (chromosomic, genic) when using easyRNASeq (was easyRNASeq error)
Davis, Wade
davisjwa at health.missouri.edu
Mon Mar 12 15:41:21 CET 2012
Hi Nico,
Thanks for adding these convenient features to easyRNASeq. Everything works fine now.
I'm sure other users will benefit from the detailed example below.
Thanks again,
Wade
-----Original Message-----
From: Nicolas Delhomme [mailto:delhomme at embl.de]
Sent: Friday, March 09, 2012 7:28 AM
To: Bioconductor mailing list
Cc: Davis, Wade
Subject: How to combine different annotation (chromosomic, genic) when using easyRNASeq (was easyRNASeq error)
Hi Wade,
Here is a description on how to deal with the problem you reported in the thread: "easyRNASeq error". I just paste the R code below, check the comments for the explanation of the different steps. I'll add a section in the vignette about this.
Cheers,
Nico
## libs
library(easyRNASeq) ## >= 1.1.10
## set wd
setwd("Your working dir")
## read in your data
## this is inefficient in comparison to a bam file, a you load as well the whole sequences in memory
aln <- readAligned("data",type="SolexaExport",pattern="*.txt.gz")
## free some mem!
gc()
## the names are different from what we expect (UCSC standards), as they have an additional .fa extension
levels(chromosome(aln)) ## faster than a table on the same content: table(chromosome(aln))
## they are different from what biomaRt will return as well, as these are 1:19, X, Y and MT plus others
## therefore it will be a problem from easyRNASeq as you have seen when you want to use biomaRt as an annotation source
## for this easyRNASeq requires the name of the organism, which circumvent the "custom" chromosome name map by-pass
## The solution is to fetch the annotation first
obj <- fetchAnnotation(new('RNAseq',
organismName="Mmusculus"
),
method="biomaRt")
## the annotation are in the genomicAnnotation slot
gAnnot <- genomicAnnotation(obj)
## shall we subset to the chromosome we're interested in
## there are only 1181 NT contigs, so let's keep them
length(grep("NT_",space(gAnnot)))
## rename the chromosomes in the annotation
## that will ease our chr.map table generation
## note that it would not be necessary
names(gAnnot) <- paste("chr",names(gAnnot),".fa",sep="")
## run easyRNASeq using the local annotation
## since we are within a script, I'll directly use the gAnnot object.
## However as you will see this will raise double counting warnings.
## double counting reads is not what ones want, that's why the better way
## would be to rework the obtained annotation to remove/clarify these cases
## and save the obtained annotation as an rda file on your file system
## save(gAnnot,file="myAnnotation.rda")
## easyRNASeq can use such an rda file as annotation as well
## first I need some space
rm(aln,obj)
gc()
## get the chromosome sizes
## Again, note that the use of the BSgenome
## is not mandatory. It's just easy as they are
## available in Bioconductor. Typing in your own
## chromosome size list is as valid.
library(BSgenome.Mmusculus.UCSC.mm9)
## note that this might change to a vector soon, which is more intuitive
## it's historicaly due to the RangedData API
chr.sizes<- as.list(seqlengths(Mmusculus))
## create the chr.map
chr.map <- data.frame(from=paste("chr",c(1:19,"X","Y"),".fa",sep=""),to=paste("chr",c(1:19,"X","Y"),sep=""))
## get the geneModels count within an RNAseq object
## the advantage of using an RNAseq object is that
## you can access the geneModels annotation
## and tailor it to your needs as explained above
## Here we have several solutions
## using the entire set of annotation we got
## We need to define a set of Filter to ensure
## that the data is read properly
## The export file contains all the reads, so the one that do not pass the chastity filter have to be removed.
## In addition, some of the other reads are for internal QC, and they have no position. For that reason, we need
## to filter those out too.
## Reading in aln data is more resource exhausting that reading in bam files, as we are
## reading in the sequence and quality information as well, whereas we do not need them.
## As you will see, this generates a lot of warnings,
## because of the differing annotations
rnaSeq <- easyRNASeq(filesDirectory="data",
organism="custom",
chr.map=chr.map,
chr.sizes=chr.sizes,
filter=compose(
naPositionFilter(),
chastityFilter()),
readLength=50L,
annotationMethod="env",
annotationObject=gAnnot,
format="aln",
count="genes",
summarization= "geneModels",
filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
outputFormat="RNAseq",
nbCore=2
)
## one way to reduce the warnings is to select the chromosome we are interested in
## that's waht we do by adding the chr.sel selector
rnaSeq <- easyRNASeq(filesDirectory="data",
organism="custom",
chr.map=chr.map,
chr.sizes=chr.sizes,
chr.sel=chr.map$from,
filter=compose(
naPositionFilter(),
chastityFilter()),
readLength=50L,
annotationMethod="env",
annotationObject=gAnnot,
format="aln",
count="genes",
summarization= "geneModels",
filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
outputFormat="RNAseq",
nbCore=2
)
## To further reduce the warnings, we can manipulate the RangedData object
## to remove the unnecessary annotation. It's quite straightforward.
sel <- grep("NT_",names(gAnnot))
gAnnot <- RangedData(ranges=ranges(gAnnot)[-sel,],values=values(gAnnot)[-sel,])
colnames(gAnnot) <- gsub("values\\.","",colnames(gAnnot))
## This will only generate two warnings, one that could be easily dealt with (the chrMT)
## The other one is about what was discussed previously. There's a need to adapt the annotation
## to avoid potential double read counting.
rnaSeq <- easyRNASeq(filesDirectory="data",
organism="custom",
chr.map=chr.map,
chr.sizes=chr.sizes,
chr.sel=chr.map$from,
filter=compose(
naPositionFilter(),
chastityFilter()),
readLength=50L,
annotationMethod="env",
annotationObject=gAnnot,
format="aln",
count="genes",
summarization= "geneModels",
filenames="1-Feb_ATCACG_L003_R1_001_export.txt.gz",
outputFormat="RNAseq",
nbCore=2
)
---------------------------------------------------------------
Nicolas Delhomme
Genome Biology Computational Support
European Molecular Biology Laboratory
Tel: +49 6221 387 8310
Email: nicolas.delhomme at embl.de
Meyerhofstrasse 1 - Postfach 10.2209
69102 Heidelberg, Germany
More information about the Bioconductor
mailing list