[BioC] [devteam-bioc] readGAlignmentPairsFromBam and readGAlignmentFromBam differences
Hervé Pagès
hpages at fhcrc.org
Tue Feb 11 03:45:06 CET 2014
Hi Emmanouela,
Short answer: It's a strand issue (and you should probably use
ignore.strand=TRUE).
Long answer: see below.
On 02/10/2014 07:19 AM, Maintainer wrote:
>
> Hello,
>
> I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function, see the code below.
> I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam but it seems to have much more strict criteria for the pairing and therefore, the counts are much lower (mean number of counts: 185 versus 440 for a random sample).
>
> I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me.
>
> example for one of these reads:
> $ samtools view 209_accepted_hits.bam | grep "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 97 chr15 77414942 50 51M = 77415180 289 GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG &&&(((((**(**++++++++++++++++++++)+++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 145 chr15 77415180 50 51M = 77414942 -289 TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG *+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$ AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20A30 YT:Z:UU XS:A:- NH:i:1
If I put those 2 records alone in a BAM file, they show up with
readGAlignmentsFromBam():
> gal1 <- readGAlignmentsFromBam("test1.bam", use.names=TRUE)
> gal1
GAlignments with 2 alignments and 0 metadata columns:
seqnames strand
cigar qwidth start end
<Rle> <Rle>
<character> <integer> <integer> <integer>
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 +
51M 51 77414942 77414992
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 -
51M 51 77415180 77415230
width ngap
<integer> <integer>
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 51 0
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 51 0
---
seqlengths:
chr15
103494974
They also get paired by readGAlignmentPairsFromBam():
> galp1 <- readGAlignmentPairsFromBam("test1.bam", use.names=TRUE)
> galp1
GAlignmentPairs with 1 alignment pair and 0 metadata columns:
seqnames strand :
ranges -- ranges
<Rle> <Rle> :
<IRanges> -- <IRanges>
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 chr15 + :
[77414942, 77414992] -- [77415180, 77415230]
---
seqlengths:
chr15
103494974
>
> is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam. Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria.
>
> Also I get some reads in readGAlignmentPairsFromBam that I don’t get from readGAlignmentFromBam, which again I don’t understand why.
>
> Example for one of these reads:
> $samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 161 chr15 77426477 50 51M = 77426641 215 TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT %$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 81 chr15 77426641 50 51M = 77426477 -215 GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG ++++++++++++++++++++++++++*+++++++++++*****(((((&&& AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1
Here too, if I put those 2 records alone in a BAM file:
> gal2 <- readGAlignmentsFromBam("test2.bam", use.names=TRUE)
> gal2
GAlignments with 2 alignments and 0 metadata columns:
seqnames strand
cigar qwidth start end
<Rle> <Rle>
<character> <integer> <integer> <integer>
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 +
51M 51 77426477 77426527
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 -
51M 51 77426641 77426691
width ngap
<integer> <integer>
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 51 0
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 51 0
---
seqlengths:
chr15
103494974
Also:
> galp2 <- readGAlignmentPairsFromBam("test2.bam", use.names=TRUE)
> galp2
GAlignmentPairs with 1 alignment pair and 0 metadata columns:
seqnames strand :
ranges -- ranges
<Rle> <Rle> :
<IRanges> -- <IRanges>
HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 chr15 - :
[77426641, 77426691] -- [77426477, 77426527]
---
seqlengths:
chr15
103494974
So the "count" issue you describe is not a "pairing criteria" issue.
You need to understand how findOverlaps() works on a GAlignmentPairs
object. First with GAlignments object 'gal1':
> findOverlaps(range.exon["328561"], gal1)
Hits of length 1
queryLength: 1
subjectLength: 2
queryHits subjectHits
<integer> <integer>
1 1 2
Your gene (which is on the minus strand) overlaps with the 2nd element
in 'gal1'.
When you pass a GAlignments or GAlignmentPairs object to findOverlaps(),
it's first turned into a GRangesList object internally, with grglist().
The key here is that, grglist() on a GAlignmentPairs object inverts the
strand of the 2nd mate:
> grglist(galp1)
GRangesList of length 1:
$HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259
GRanges with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr15 [77414942, 77414992] +
[2] chr15 [77415180, 77415230] +
---
seqlengths:
chr15
103494974
See ?GAlignmentPairs for more info about this (especially the IMPORTANT
note in the Coercion section). The reason for this behavior is because
the strand of the 2nd mate was already inverted by the sequencing
technology (when the cDNA template was sequenced from the 2nd end).
This is why you don't see any overlap when you do:
> length(findOverlaps(range.exon["328561"], galp1))
[1] 0
(because your gene is on the minus strand).
Most of the time those strand considerations don't matter because the
RNA-seq protocol is not stranded. In that case you should use
ignore.strand=TRUE:
> findOverlaps(range.exon["328561"], gal1, ignore.strand=TRUE)
Hits of length 2
queryLength: 1
subjectLength: 2
queryHits subjectHits
<integer> <integer>
1 1 1
2 1 2
> findOverlaps(range.exon["328561"], galp1, ignore.strand=TRUE)
Hits of length 1
queryLength: 1
subjectLength: 1
queryHits subjectHits
<integer> <integer>
1 1 1
Similar story for gal2/galp2:
> length(findOverlaps(range.exon["328561"], gal2))
[1] 0
No hit with the 1st element in 'gal2' because even though gal2[1]
is in the range of your gene, it's not on the same strand.
No hit with the 2nd element in 'gal2' because even though gal2[2]
is on the same strand as your gene, it's not in its range.
However, we see a hit with the pair:
> length(findOverlaps(range.exon["328561"], galp2))
[1] 1
That's because when converting to GRangesList, everything ends up
on the minus strand:
> grglist(galp2)
GRangesList of length 1:
$HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830
GRanges with 2 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] chr15 [77426477, 77426527] -
[2] chr15 [77426641, 77426691] -
---
seqlengths:
chr15
103494974
But again, if you use ignore.strand=TRUE:
> findOverlaps(range.exon["328561"], gal2, ignore.strand=TRUE)
Hits of length 1
queryLength: 1
subjectLength: 2
queryHits subjectHits
<integer> <integer>
1 1 1
Hope this helps,
H.
>
> Can someone, please, explain why this is?
> Thank you in advance for your time.
> Emmanouela
>
>
> -- output of sessionInfo():
>
>
>
> library(GenomicRanges)
> library(Rsamtools)
> library(GenomicFeatures)
> library(rtracklayer)
>
> txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene")
> range.exon <- exonsBy(txdb.ensgene, "gene")
>
> my_bam <- "209_accepted_hits.bam.fltrd.bam"
> aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T)
> aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T)
>
> temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b
> temp3_aligns <- aligns[subjectHits(temp3)]
>
> temp4 <- findOverlaps(range.exon["328561"],aligns_s)
> temp4_aligns <- aligns_s[subjectHits(temp4)]
>
> setdiff(names(temp3_aligns) , names(temp4_aligns))[1]
> [1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
>
> setdiff(names(temp4_aligns),names(temp3_aligns))[1]
> [1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
>
>
>
> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C
> [3] LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1
> [5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] org.Mm.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7
> [4] annotate_1.40.0 rtracklayer_1.22.1 GenomicFeatures_1.14.2
> [7] AnnotationDbi_1.24.0 Biobase_2.22.0 Rsamtools_1.14.2
> [10] Biostrings_2.30.1 GenomicRanges_1.14.4 XVector_0.2.0
> [13] IRanges_1.20.6 BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1
> [5] stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 xtable_1.7-1
> [9] zlibbioc_1.8.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
>
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list