[BioC] [devteam-bioc] readGAlignmentPairsFromBam and readGAlignmentFromBam differences
Valerie Obenchain
vobencha at fhcrc.org
Mon Feb 10 21:42:24 CET 2014
Hi,
On 02/10/2014 07:19 AM, Maintainer wrote:
>
> Hello,
>
> I am doing an RNA-seq analysis for paired-end reads and I have some questions for readGAlignmentPairsFromBam. I have mapped the reads with tophatv2 and aligned them with Genomic Features using the readGAlignmentPairsFromBam function,
Just to clarify, readGAlignmentsPairsFromBam() does not align reads. It
reads in records that have been previously aligned.
see the code below.
> I understand that since I have paired-end reads I should use readGAlignmentPairsFromBam instead of readGAlignmentFromBam
There is a family of functions available for reading in single and
paired-end data. The functions with the *Bam extension are lower level
and will probably go away (deprecated) in the next release cycle to
eliminate redundancy. The functions below are the higher level counter
parts that call the *Bam functions; these are the functions we recommend
you use.
Single-end:
readGAlignments()
Paired-end"
readGAlignmentPairs()
readGAlignmentsList()
If you have paired-end data you can use either readGAlignmentPairs() or
readGAlignmentsList(). These functions both use a mate-pairing algorithm
but return different classes. In the devel branch the two call the same
algorithm, in release they are different.
readGAlignmentPairs() returns a GAlignmentPairs class. This class can
only hold pairs so any unmapped reads, singletons etc. are dumped. The
dumped reads can be retrieved with getDumpedAlignments().
readGAlignmentsList() returns a List of GAlignments. The class can hold
both pairs and records that cannot be paired. See ?readGAlignmentsList
for details.
but it seems to have much more strict criteria for the pairing and
therefore, the counts are much lower (mean number of counts: 185 versus
440 for a random sample).
>
> I read the pairing criteria trying to understand why lots of the reads are not being counted but I get differences for which the reason of exclusion is not clear to me.
>
> example for one of these reads:
> $ samtools view 209_accepted_hits.bam | grep "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 97 chr15 77414942 50 51M = 77415180 289 GTCTGCAGAGTCTCCACTCTGCCGGAAGTCGGATCCCCTTTTAGACTCCAG &&&(((((**(**++++++++++++++++++++)+++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
> HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259 145 chr15 77415180 50 51M = 77414942 -289 TGCATTTTGATCCAGTCACCGTTTCTTTAGGCTGCTCTGCCCTTAACCCAG *+(#++)+)+++**($)(++++*)(+++*)'++++*)+**(((&&%%&$$$ AS:i:-3 XN:i:0 XM:i:1 XO:i:0 XG:i:0 NM:i:1 MD:Z:20A30 YT:Z:UU XS:A:- NH:i:1
>
> is included in the result from readGAlignmentFromBam but not from readGAlignmentPairsFromBam.
readGAlignmentsFromBam() returns both records because it treats all
reads as single-end; there is no mate-paring here. The output of the two
functions is not expected to be the same because one assumes single-end
reads, the other assumes paired-end.
Are you saying that one of these reads above is not returned by
readGAlignmentPairs()? If they are pairs (which it looks like they are)
either both or neither should come back from readGAlignmentPairs().
Remember the class has right and left pairs that can be accessed with
right() and left(). If neither are returned then they did not meet the
mate-pairing criteria. As an fyi, you can investigate bam flags further
with the bamFlagAsBitMatrix() function.
> Also, many pairs with bitwise flags of 97 and 145 (for pair) are not included in readGAlignmentPairsFromBam although their flags satisfy the pairing criteria.
>
> Also I get some reads in readGAlignmentPairsFromBam that I don’t get from readGAlignmentFromBam, which again I don’t understand why.
readGAlignmentPairsFromBam() returns records that satisfy a mate-pairing
algorithm. readGAlignmentFromBam() treats all records as single-end;
there is no mate-pairing criteria in this function so all records are
returned (sans those you filter out with ScanBamParam).
>
> Example for one of these reads:
> $samtools view 209_accepted_hits.bam.fltrd.bam | grep "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 161 chr15 77426477 50 51M = 77426641 215 TGTGGTTCCAGTCACCAGCTGCACATCTGAGACACACCAGACACAAGCTCT %$%&('&(**)(*+++)++++++++++++++++++++++++++++++++++ AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU XS:A:- NH:i:1
> HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830 81 chr15 77426641 50 51M = 77426477 -215 GCCATAGCTGTCCATGATCTCTGATTACCTTTCTTCTATGATTTAAAAGGG ++++++++++++++++++++++++++*+++++++++++*****(((((&&& AS:i:0 XN:i:0 XM:i:0 XO:i:0 XG:i:0 NM:i:0 MD:Z:51 YT:Z:UU NH:i:1
>
Can you try reading the data in with readGAlignmentsList()? This class
will hold pairs and records that did not meet the pairing criteria.
Using GenomicRanges 1.14.4 in release:
library(GenomicRanges)
library(Rsamtools)
library(pasillaBamSubset)
fl <- untreated3_chr4()
galist <- readGAlignmentsList(fl)
The metadata column 'mates' indicates if they passed the mate criteria.
>> head(galist, 2)
> GAlignmentsList of length 2:
> $1
> GAlignments with 2 alignments and 1 metadata column:
> seqnames strand cigar qwidth start end width ngap | mates
> [1] chr4 + 37M 37 169 205 37 0 | TRUE
> [2] chr4 - 37M 37 326 362 37 0 | TRUE
>
> $2
> GAlignments with 2 alignments and 1 metadata column:
> seqnames strand cigar qwidth start end width ngap | mates
> [1] chr4 + 37M 37 946 982 37 0 | TRUE
> [2] chr4 - 37M 37 986 1022 37 0 | TRUE
>
Value is FALSE for records that did not pass.
>> tail(galist, 2)
> GAlignmentsList of length 2:
> $95788
> GAlignments with 2 alignments and 1 metadata column:
> seqnames strand cigar qwidth start end width ngap | mates
> [1] chr4 + 37M 37 87190 87226 37 0 | FALSE
> [2] chr4 - 37M 37 87476 87512 37 0 | FALSE
>
> $95789
> GAlignments with 2 alignments and 1 metadata column:
> seqnames strand cigar qwidth start end width ngap | mates
> [1] chr4 + 37M 37 12491 12527 37 0 | FALSE
> [2] chr4 - 37M 37 12667 12703 37 0 | FALSE
If these pairs appear in the output from readGAlignmentsList() with
mates=FALSE then they did not meet some aspect of the criteria. In this
case the records can also be found with getDumpedAlignments() after
runing readGAlignmentPairs().
Let me know how it goes.
Valerie
> Can someone, please, explain why this is?
> Thank you in advance for your time.
> Emmanouela
>
>
> -- output of sessionInfo():
>
>
>
> library(GenomicRanges)
> library(Rsamtools)
> library(GenomicFeatures)
> library(rtracklayer)
>
> txdb.ensgene <- makeTranscriptDbFromUCSC(genome = "mm9", tablename = "refGene")
> range.exon <- exonsBy(txdb.ensgene, "gene")
>
> my_bam <- "209_accepted_hits.bam.fltrd.bam"
> aligns <- readGAlignmentPairsFromBam(my_bam,use.names=T)
> aligns_s <- readGAlignmentsFromBam(my_bam,use.names=T)
>
> temp3 <- findOverlaps(range.exon["328561"],aligns) # gene Apol10b
> temp3_aligns <- aligns[subjectHits(temp3)]
>
> temp4 <- findOverlaps(range.exon["328561"],aligns_s)
> temp4_aligns <- aligns_s[subjectHits(temp4)]
>
> setdiff(names(temp3_aligns) , names(temp4_aligns))[1]
> [1] "HISEQ2000-02:440:C35VLACXX:7:2311:15754:64830"
>
> setdiff(names(temp4_aligns),names(temp3_aligns))[1]
> [1] "HISEQ2000-02:440:C35VLACXX:7:1102:3519:68259"
>
>
>
> sessionInfo()
> R version 3.0.1 (2013-05-16)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_GB.ISO-8859-1 LC_NUMERIC=C
> [3] LC_TIME=en_GB.ISO-8859-1 LC_COLLATE=en_GB.ISO-8859-1
> [5] LC_MONETARY=en_GB.ISO-8859-1 LC_MESSAGES=en_GB.ISO-8859-1
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_GB.ISO-8859-1 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] org.Mm.eg.db_2.10.1 RSQLite_0.11.4 DBI_0.2-7
> [4] annotate_1.40.0 rtracklayer_1.22.1 GenomicFeatures_1.14.2
> [7] AnnotationDbi_1.24.0 Biobase_2.22.0 Rsamtools_1.14.2
> [10] Biostrings_2.30.1 GenomicRanges_1.14.4 XVector_0.2.0
> [13] IRanges_1.20.6 BiocGenerics_0.8.0
>
> loaded via a namespace (and not attached):
> [1] biomaRt_2.18.0 bitops_1.0-6 BSgenome_1.30.0 RCurl_1.95-4.1
> [5] stats4_3.0.1 tools_3.0.1 XML_3.98-1.1 xtable_1.7-1
> [9] zlibbioc_1.8.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>
>
>
> ________________________________________________________________________
> devteam-bioc mailing list
> To unsubscribe from this mailing list send a blank email to
> devteam-bioc-leave at lists.fhcrc.org
> You can also unsubscribe or change your personal options at
> https://lists.fhcrc.org/mailman/listinfo/devteam-bioc
>
--
Valerie Obenchain
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: vobencha at fhcrc.org
Phone: (206) 667-3158
Fax: (206) 667-1319
More information about the Bioconductor
mailing list