[BioC] [devteam-bioc] readGAlignmentPairsFromBam and readGAlignmentFromBam differences

Tue Feb 11 03:45:06 CET 2014

Hi Emmanouela,

Short answer: It's a strand issue (and you should probably use
ignore.strand=TRUE).

Long answer: see below.

On 02/10/2014 07:19 AM, Maintainer wrote:
>
> Hello,
>
> I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function, see the code below.
> I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam but it seems to have much more strict criteria for the pairing and therefore, the counts are much lower (mean number of counts: 185 versus 440 for a random sample).
>
> I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me.
>
> example for one of these reads:
> $ samtools view 209_accepted_hits.bam  | grep  "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259	97	chr15	77414942	50	51M	=	77415180	289	GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG	&&&(((((**(**++++++++++++++++++++)+++++++++++++++++	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	XS:A:-	NH:i:1
> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259	145	chr15	77415180	50	51M	=	77414942	-289	TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG	*+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$	AS:i:-3	XN:i:0	XM:i:1	XO:i:0	XG:i:0	NM:i:1	MD:Z:20A30	YT:Z:UU	XS:A:-	NH:i:1

If I put those 2 records alone in a BAM file, they show up with
readGAlignmentsFromBam():

   > gal1 <- readGAlignmentsFromBam("test1.bam", use.names=TRUE)
   > gal1
   GAlignments with 2 alignments and 0 metadata columns:
                                                  seqnames strand 
cigar    qwidth     start       end
                                                     <Rle>  <Rle> 
<character> <integer> <integer> <integer>
     HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259    chr15      + 
   51M        51  77414942  77414992
     HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259    chr15      - 
   51M        51  77415180  77415230
                                                      width      ngap
                                                  <integer> <integer>
     HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259        51         0
     HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259        51         0
     ---
     seqlengths:
          chr15
      103494974

They also get paired by readGAlignmentPairsFromBam():

   > galp1 <- readGAlignmentPairsFromBam("test1.bam", use.names=TRUE)
   > galp1
   GAlignmentPairs with 1 alignment pair and 0 metadata columns:
                                                seqnames strand : 
         ranges --               ranges
                                                   <Rle>  <Rle> : 
      <IRanges> --            <IRanges>
HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259    chr15      + : 
[77414942, 77414992] -- [77415180, 77415230]
   ---
   seqlengths:
        chr15
    103494974

>
> is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam. Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria.
>
> Also I get some reads in readGAlignmentPairsFromBam that I donâ€™t get from readGAlignmentFromBam, which again I donâ€™t understand why.
>
> Example for one of these reads:
> $samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830	161	chr15	77426477	50	51M	=	77426641	215	TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT	%$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	XS:A:-	NH:i:1
> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830	81	chr15	77426641	50	51M	=	77426477	-215	GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG	++++++++++++++++++++++++++*+++++++++++*****(((((&&&	AS:i:0	XN:i:0	XM:i:0	XO:i:0	XG:i:0	NM:i:0	MD:Z:51	YT:Z:UU	NH:i:1

Here too, if I put those 2 records alone in a BAM file:

   > gal2 <- readGAlignmentsFromBam("test2.bam", use.names=TRUE)
   > gal2
   GAlignments with 2 alignments and 0 metadata columns:
                                                   seqnames strand 
  cigar    qwidth     start       end
                                                      <Rle>  <Rle> 
<character> <integer> <integer> <integer>
     HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830    chr15      + 
    51M        51  77426477  77426527
     HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830    chr15      - 
    51M        51  77426641  77426691
                                                       width      ngap
                                                   <integer> <integer>
     HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830        51         0
     HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830        51         0
     ---
     seqlengths:
          chr15
      103494974

Also:

   > galp2 <- readGAlignmentPairsFromBam("test2.bam", use.names=TRUE)
   > galp2
   GAlignmentPairs with 1 alignment pair and 0 metadata columns:
                                                 seqnames strand : 
          ranges --               ranges
                                                    <Rle>  <Rle> : 
       <IRanges> --            <IRanges>
   HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830    chr15      - : 
[77426641, 77426691] -- [77426477, 77426527]
   ---
   seqlengths:
        chr15
    103494974

So the "count" issue you describe is not a "pairing criteria" issue.
You need to understand how findOverlaps() works on a GAlignmentPairs
object. First with GAlignments object 'gal1':

   > findOverlaps(range.exon["328561"], gal1)
   Hits of length 1
   queryLength: 1
   subjectLength: 2
     queryHits subjectHits
      <integer>   <integer>
    1         1           2

Your gene (which is on the minus strand) overlaps with the 2nd element
in 'gal1'.

When you pass a GAlignments or GAlignmentPairs object to findOverlaps(),
it's first turned into a GRangesList object internally, with grglist().
The key here is that, grglist() on a GAlignmentPairs object inverts the
strand of the 2nd mate:

   > grglist(galp1)
   GRangesList of length 1:
   $HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259
   GRanges with 2 ranges and 0 metadata columns:
         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
     [1]    chr15 [77414942, 77414992]      +
     [2]    chr15 [77415180, 77415230]      +

   ---
   seqlengths:
        chr15
    103494974

See ?GAlignmentPairs for more info about this (especially the IMPORTANT
note in the Coercion section). The reason for this behavior is because
the strand of the 2nd mate was already inverted by the sequencing
technology (when the cDNA template was sequenced from the 2nd end).

This is why you don't see any overlap when you do:

   > length(findOverlaps(range.exon["328561"], galp1))
   [1] 0

(because your gene is on the minus strand).

Most of the time those strand considerations don't matter because the
RNA-seq protocol is not stranded. In that case you should use
ignore.strand=TRUE:

   > findOverlaps(range.exon["328561"], gal1, ignore.strand=TRUE)
   Hits of length 2
   queryLength: 1
   subjectLength: 2
     queryHits subjectHits
      <integer>   <integer>
    1         1           1
    2         1           2

   > findOverlaps(range.exon["328561"], galp1, ignore.strand=TRUE)
   Hits of length 1
   queryLength: 1
   subjectLength: 1
     queryHits subjectHits
      <integer>   <integer>
    1         1           1

Similar story for gal2/galp2:

   > length(findOverlaps(range.exon["328561"], gal2))
   [1] 0

No hit with the 1st element in 'gal2' because even though gal2[1]
is in the range of your gene, it's not on the same strand.
No hit with the 2nd element in 'gal2' because even though gal2[2]
is on the same strand as your gene, it's not in its range.
However, we see a hit with the pair:

   > length(findOverlaps(range.exon["328561"], galp2))
   [1] 1

That's because when converting to GRangesList, everything ends up
on the minus strand:

   > grglist(galp2)
   GRangesList of length 1:
   $HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830
   GRanges with 2 ranges and 0 metadata columns:
         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
     [1]    chr15 [77426477, 77426527]      -
     [2]    chr15 [77426641, 77426691]      -

   ---
   seqlengths:
        chr15
    103494974

But again, if you use ignore.strand=TRUE:

   > findOverlaps(range.exon["328561"], gal2, ignore.strand=TRUE)
   Hits of length 1
   queryLength: 1
   subjectLength: 2
     queryHits subjectHits
      <integer>   <integer>
    1         1           1

Hope this helps,
H.

>
> Can someone, please, explain why this is?
> Thank you in advance for your time.
> Emmanouela
>
>
>   -- output of sessionInfo():
>
>
>
> library(GenomicRanges)
> library(Rsamtools)
> library(GenomicFeatures)
> library(rtracklayer)
>
> txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene")
> range.exon <- exonsBy(txdb.ensgene, "gene")
>
> my_bam <- "209_accepted_hits.bam.fltrd.bam"
> aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T)
> aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T)
>
> temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b
> temp3_aligns <- aligns[subjectHits(temp3)]
>
> temp4 <- findOverlaps(range.exon["328561"],aligns_s)
> temp4_aligns <- aligns_s[subjectHits(temp4)]
>
> setdiff(names(temp3_aligns) , names(temp4_aligns))[1]
> [1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
>
> setdiff(names(temp4_aligns),names(temp3_aligns))[1]
> [1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
>
>
>
> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_GB.ISO-8859-1       LC_NUMERIC=C
>   [3] LC_TIME=en_GB.ISO-8859-1        LC_COLLATE=en_GB.ISO-8859-1
>   [5] LC_MONETARY=en_GB.ISO-8859-1    LC_MESSAGES=en_GB.ISO-8859-1
>   [7] LC_PAPER=C                      LC_NAME=C
>   [9] LC_ADDRESS=C                    LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
>   [1] org.Mm.eg.db_2.10.1    RSQLite_0.11.4         DBI_0.2-7
>   [4] annotate_1.40.0        rtracklayer_1.22.1     GenomicFeatures_1.14.2
>   [7] AnnotationDbi_1.24.0   Biobase_2.22.0         Rsamtools_1.14.2
> [10] Biostrings_2.30.1      GenomicRanges_1.14.4   XVector_0.2.0
> [13] IRanges_1.20.6         BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.18.0  bitops_1.0-6    BSgenome_1.30.0 RCurl_1.95-4.1
> [5] stats4_3.0.1    tools_3.0.1     XML_3.98-1.1    xtable_1.7-1
> [9] zlibbioc_1.8.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
>
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319