[BioC] readGAlignmentPairsFromBam and readGAlignmentFromBam differences
Emmanouela Repapi [guest]
guest at bioconductor.org
Mon Feb 10 16:19:19 CET 2014
Hello,
I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function, see the code below.
I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam but it seems to have much more strict criteria for the pairing and therefore, the counts are much lower (mean number of counts: 185 versus 440 for a random sample).
I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me.
example for one of these reads:
$ samtools view 209_accepted_hits.bam | grep "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 97 chr15 77414942 50 51M = 77415180 289 GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG &&&(((((**(**++++++++++++++++++++)+++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 145 chr15 77415180 50 51M = 77414942 -289 TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG *+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$ AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20A30 YT:Z:UU XS:A:- NH:i:1
is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam. Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria.
Also I get some reads in readGAlignmentPairsFromBam that I donât get from readGAlignmentFromBam, which again I donât understand why.
Example for one of these reads:
$samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 161 chr15 77426477 50 51M = 77426641 215 TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT %$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 81 chr15 77426641 50 51M = 77426477 -215 GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG ++++++++++++++++++++++++++*+++++++++++*****(((((&&& AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1
Can someone, please, explain why this is?
Thank you in advance for your time.
Emmanouela
-- output of sessionInfo():
library(GenomicRanges)
library(Rsamtools)
library(GenomicFeatures)
library(rtracklayer)
txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene")
range.exon <- exonsBy(txdb.ensgene, "gene")
my_bam <- "209_accepted_hits.bam.fltrd.bam"
aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T)
aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T)
temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b
temp3_aligns <- aligns[subjectHits(temp3)]
temp4 <- findOverlaps(range.exon["328561"],aligns_s)
temp4_aligns <- aligns_s[subjectHits(temp4)]
setdiff(names(temp3_aligns) , names(temp4_aligns))[1]
[1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
setdiff(names(temp4_aligns),names(temp3_aligns))[1]
[1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
sessionInfo()
R version 3.0.1 (2013-05-16)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C
[3] LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1
[5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1
[7] LC_PAPER=C LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C
attached base packages:
[1] parallel stats graphics grDevices utils datasets methods
[8] base
other attached packages:
[1] org.Mm.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7
[4] annotate_1.40.0 rtracklayer_1.22.1 GenomicFeatures_1.14.2
[7] AnnotationDbi_1.24.0 Biobase_2.22.0 Rsamtools_1.14.2
[10] Biostrings_2.30.1 GenomicRanges_1.14.4 XVector_0.2.0
[13] IRanges_1.20.6 BiocGenerics_0.8.0
loaded via a namespace (and not attached):
[1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1
[5] stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 xtable_1.7-1
[9] zlibbioc_1.8.0
--
Sent via the guest posting facility at bioconductor.org.
More information about the Bioconductor
mailing list