[BioC] What populates makeTranscriptDbFromBiomart?
Ravi Karra
ravi.karra at gmail.com
Sat Apr 14 22:40:04 CEST 2012
Hi,
Just starting to learn how to look at RNA Seq data, so apologies in advance. I ran my RNA-Seq experiment on a GAII and aligned to the zebrafish genome using Bowtie2/Tophat2. I downloaded the current zebrafish genome (Zv9) and transcript gtf file from Ensembl for the reference indices. I am trying to use edgeR to look at differential expression, but am a little hung up on getting the count data.
As you can see from the code below, I input 8835090 mapped reads, but only 5380643 are overlapped with known transcripts. It seems that I am losing reads in summarizing the count data and I can't really figure out why. Is the transcript information that results from makeTranscriptDbFromBiomart identical to the transcript information in the gtf files that can be downloaded via Ensembl?
Thanks again,
Ravi
zv9txs = makeTranscriptDbFromBiomart(biomart="ensembl",dataset="drerio_gene_ensembl")
tx_by_gene=transcriptsBy(zv9txs,'gene')
#read in the mapped bam hits file
reads=readBamGappedAlignments("accepted_hits.bam")
#check compatibility of names
txs = names(seqlengths(tx_by_gene))
r = as.character(unique(rname(reads)))
table (r %in% txs) #all reads mapped reads have the same chr names
TRUE
752
#make a genomic Ranges object and remove strand ambiguity
reads=GRanges(seqnames=rname(reads),ranges=IRanges(start=start(reads),end=end(reads)), strand=rep("*",length(reads)))
#summarize counts data
counts=countOverlaps(tx_by_gene,reads)
sum (counts)
[1] 5380643
length (reads)
[1] 8835090
> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] BiocInstaller_1.2.1 Rsamtools_1.6.3 Biostrings_2.22.0 biomaRt_2.10.0
[5] GenomicFeatures_1.6.9 AnnotationDbi_1.16.19 Biobase_2.14.0 GenomicRanges_1.6.7
[9] IRanges_1.12.6
loaded via a namespace (and not attached):
[1] bitops_1.0-4.1 BSgenome_1.22.0 DBI_0.2-5 grid_2.14.1 hwriter_1.3
[6] lattice_0.20-6 RCurl_1.91-1 RSQLite_0.11.1 rtracklayer_1.14.4 ShortRead_1.12.4
[11] tools_2.14.1 XM
More information about the Bioconductor
mailing list