[BioC] The ambiguous alignment examples in findMateAlignment don't have to be dumped
Hervé Pagès
hpages at fhcrc.org
Sat Jan 25 00:50:16 CET 2014
Hi Yun,
On 01/24/2014 02:01 PM, yun YAN wrote:
> Hi Hervé,
> Thanks for your tips on `readGAlignmentsList`. Unlike
> `readGAlignmentPairs`, it is not compatible with `first`, `last`,
> `left`, `right` function with feedbacks:
>
> unable to find an inherited method for function ‘first’ for
> signature ‘"GAlignmentsList"’
>
> I'd better speak out that the goal of my sub-modules here is to get the
> reads coverage on strand(+)/(-) separately as my RNA-seq library is
> strand-specific, therefore the small fraction of ambiguous reads are
> better not dumped. (BTW you are right the fraction is too small, 1340/13
> million alignments in my test case, to deserve computational resources).
If you want to compute coverage, you don't need to go thru the pairing
process. Just load them in a GAlignments object with readGAlignments()
and call coverage() on it. You won't get *exactly* the same coverage
vector as if you load the alignments with readGAlignmentPairs() because
the latter cannot pair everything. But it will be *very* close.
Alternatively, you can load them with readGAlignmentsList() and do
coverage(unlist(.)) on your GAlignmentsList object. That should give
you exactly the same result as when you load with readGAlignments().
>
> bamData <- readGAlignmentsList("my_alignment.bam", ...)
>
> signalOnPositiveStrand <- coverage(grglist(left(bamData),
> drop.D.range=T))
>
>
> The psudo-code would make it in one-step if my understanding were right
> together with the compatibility of `readGAlignmentsList` and `left`.
> **Is there any other functions or methods that could help me to filter
> the first/second mate when processing the readGAlignmentsList object or
> dumped alignments retrieved by getDumpedAlignments?** Like this:
>
> firstMate <- bamData[mates(bamData) == 1, ]
>
> secondMate <- bamData[mates(bamData) == 2, ]
>
> firstMate <- readsRetrived[mates(readsRetrived) == 1, ]
>
> secondMate <- readsRetrived[mates(readsRetrived) == 2, ]
>
Like scanBam(), which they rely on, the readGAlignment* functions can
take a ScanBamParam object thru their 'param' argument. Using
a ScanBamParam object lets you filter the alignments based on strand,
1st/2nd mate, etc... See ?ScanBamParam for more info.
>
> I do have other three strategies:
> 1> back to root, the `scanBam` function with the following parameters.
> But it is slow (as a result some guys prefer ShortRead package and
> python pysam package), and returns a list, the fundamental R class, thus
> I'd better not recreate wheels to dig out information, construct data
> table, etc.
>
> isFirstMatePrmt <- ScanBamParam(flag = scanBamFlag(isPaired=TRUE,
> isProperPair=TRUE, isFirstMateRead=TRUE), what=scanBamWhat())
So you know about ScanBamParam() and its 'flag' arg. Now use that with
any of the readGAlignment* functions. Note that, when using these
function, some flags are implicit e.g. readGAlignmentPairs()
automatically sets the isPaired flag to TRUE for you.
Cheers,
H.
>
> 2> feed R with the first/second mate bam by using samtools. But it would
> cost extra disk storage and not strongly dependent on R.
> 3> betray to HTSeq, a python package, the worst choice.
>
> Cheers,
>
> Yun
>
> On Thu, Jan 23, 2014 at 7:25 PM, Hervé Pagès <hpages at fhcrc.org
> <mailto:hpages at fhcrc.org>> wrote:
>
> Hi Yun,
>
>
> On 01/23/2014 12:56 PM, yun YAN wrote:
>
> Hi Hervé Pagès and other maintainers in Bioconductor,
>
> Thanks for your package instruction
> (http://140.107.3.20/packages/__2.14/bioc/manuals/__GenomicAlignments/man/__GenomicAlignments.pdf
> <http://140.107.3.20/packages/2.14/bioc/manuals/GenomicAlignments/man/GenomicAlignments.pdf>)
> so that I could understand the how does it work. Here you
> mentioned the
> dumped alignments but I was wondering whether they should be dumped?
>
> seqnames strand cigar qwidth start end
> <Rle> <Rle> <character> <integer> <integer> <integer>
> SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd
> mate/read2 ...
> SRR031714.2658602 chr2R + 21M384N16M 37 6983850 6984270 ## 2nd
> mate/read2 ...
> SRR031714.2658602 chr2R - 13M372N24M 37 6983858 6984266 ## 1st
> mate/read1 ...
> SRR031714.2658602 chr2R - 13M378N24M 37 6983858 6984272 ## 1st
> mate/read1 ...
>
>
> For them you mentioned:
>
> pairs could be rec(1) <-> rec(3) and rec(2) <->
> rec(4), or they could be rec(1) <-> rec(4) and rec(2) <->
> rec(3).
> There is no way to disambiguate!
>
>
> I agree the two pairing strategies above are ambiguous in light of
> codes/computation. But it is fact that rec(1) and rec(2) are exactly
> same,
>
>
> Yes in that case. But it's not always the case. Maybe I should put an
> example in this man page (i.e. the man page for findMateAlignment) where
> the 2 records are not the same.
>
> Note that in order to decide whether they are exactly the same or
> not, one would need to look at all the fixed fields in the 2 records
> plus
> all the possible tags (there can be an arbitrary number of them).
> There is a computational cost associated to this and I don't want to
> impose that cost by default just to save a tiny fraction of ambiguous
> pairs. Not worth it. My understanding is that it's not even possible
> at the moment to load all the tags with Rsamtools (unless you already
> know the tags but I'm not aware of an easy way to discover that).
>
> thusthese two alignments are actually one type of multiple
>
> alignment, though wired but potentially useful. The factor that
> leads to
> the multiple mappable positions of one mate of the fragment and
> fixed
> position of the other mate could probably:
> a> the reference sequence similarity between these
> two neighboring regions together with different splicing
> isoforms. Here
> first mate could have 2 mapped positions. # is exon while = is
> intron.
>
> 2nd mate: ---- ----
> 1st mate: ---- ----
> 1st mate: ---- ----
> isoform1: ==####======#####==========###__#==####===
> isoform2: ==######====#####==========###__#==####===
>
> b> or just aligner artifact, etc. Therefore I cannot agree your
> package
> dumps those alignments.
>
>
> If you want to keep the ambiguous alignments, you can use
> readGAlignmentsList() instead of readGAlignmentPairs().
> See ?readGAlignmentsList for the details.
>
>
> It needs better function name, at least, to
> retrieve them.
>
>
> The utility to retrieve them is called getDumpedAlignments(). What do
> you propose?
>
> Thanks,
> H.
>
>
> Regards,
> Yun
>
>
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org <mailto:hpages at fhcrc.org>
> Phone: (206) 667-5791 <tel:%28206%29%20667-5791>
> Fax: (206) 667-1319 <tel:%28206%29%20667-1319>
>
>
>
>
> --
> YAN Yun, Master of Science
> RNA Lab, 6101
> Molecular Biology and Biochemistry Department
> College of Life Sciences, Wuhan University
> Wuhan, Hubei, P.R.China
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list