[BioC] What populates makeTranscriptDbFromBiomart?

Ravi Karra ravi.karra at gmail.com
Sat Apr 14 22:40:04 CEST 2012


Hi, 

Just starting to learn how to look at RNA Seq data, so apologies in advance.  I ran my RNA-Seq experiment on a GAII and aligned to the zebrafish genome using Bowtie2/Tophat2.  I downloaded the current zebrafish genome (Zv9) and transcript gtf file from Ensembl for the reference indices.   I am trying to use edgeR to look at differential expression, but am a little hung up on getting the count data.  

As you can see from the code below, I input 8835090 mapped reads, but only 5380643 are overlapped with known transcripts.  It seems that I am losing reads in summarizing the count data and I can't really figure out why.   Is the transcript information that results from makeTranscriptDbFromBiomart identical to the transcript information in the gtf files that can be downloaded via Ensembl?  

Thanks again, 
Ravi

zv9txs = makeTranscriptDbFromBiomart(biomart="ensembl",dataset="drerio_gene_ensembl")
tx_by_gene=transcriptsBy(zv9txs,'gene')
 
#read in the mapped bam hits file
reads=readBamGappedAlignments("accepted_hits.bam")

#check compatibility of names
txs = names(seqlengths(tx_by_gene))
r = as.character(unique(rname(reads)))
table (r %in% txs) #all reads mapped reads have the same chr names

TRUE 
 752 

#make a genomic Ranges object and remove strand ambiguity
reads=GRanges(seqnames=rname(reads),ranges=IRanges(start=start(reads),end=end(reads)), strand=rep("*",length(reads)))
 
#summarize counts data
 counts=countOverlaps(tx_by_gene,reads) 
 sum (counts)
[1] 5380643

 length (reads)
[1] 8835090

> sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)

locale:
[1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] BiocInstaller_1.2.1   Rsamtools_1.6.3       Biostrings_2.22.0     biomaRt_2.10.0       
[5] GenomicFeatures_1.6.9 AnnotationDbi_1.16.19 Biobase_2.14.0        GenomicRanges_1.6.7  
[9] IRanges_1.12.6       

loaded via a namespace (and not attached):
 [1] bitops_1.0-4.1     BSgenome_1.22.0    DBI_0.2-5          grid_2.14.1        hwriter_1.3       
 [6] lattice_0.20-6     RCurl_1.91-1       RSQLite_0.11.1     rtracklayer_1.14.4 ShortRead_1.12.4  
[11] tools_2.14.1       XM

 


More information about the Bioconductor mailing list