[BioC] Fwd: BioStrings PairwiseAlignment deletion() insertion() function
Hervé Pagès
hpages at fhcrc.org
Sat Sep 28 02:22:03 CEST 2013
Hi Marcin,
On 09/27/2013 07:48 AM, Marcin Imielinski wrote:
> Sorry, forgot to cc you. Thanks in advance!
>
> ---------- Forwarded message ----------
> From: *Marcin Imielinski* <marcin at broad.mit.edu
> <mailto:marcin at broad.mit.edu>>
> Date: Fri, Sep 27, 2013 at 10:40 AM
> Subject: BioStrings PairwiseAlignment deletion() insertion() function
> To: bioconductor at r-project.org <mailto:bioconductor at r-project.org>
>
>
> Hi -
>
> I'm confused about the output of deletion(pa) and insertion(pa)
> functions for pa = pairwiseAlignment(). My understanding is that they
> should output IRanges corresponding to gaps in the pattern (deletion())
> and gaps in the subject (insertion()) in terms of alignment coordinates.
>
> However, it appears that the outputted ranges can overlap. For example,
> the alignment (below) of a 101 letter pattern and 404 letter subject.
> The deletion ranges overlap. What sequence positions do these ranges
> refer to (pattern, subject, or alignment)?
>
> Is this is a bug or am I misinterpreting this output?
It's a questionable design choice but the ranges describing the
deletions are reported with respect to the "original" (aka unaligned)
pattern, not to the aligned pattern. For example the 1st range you
see in 'deletion(pa)' means there is a deletion of 3 letters (when
going from subject to pattern) and that this deletion starts at position
4 in the original pattern. With such conventions, the *ranges*
describing the deletions can overlap but the deletions don't, which
I must admit is very confusing.
As you noticed, those ranges can be shifted to make them refer to the
aligned pattern:
> offset <- c(0L, cumsum(head(width(deletion(pa)[[1]]), n=-1)))
> shift(deletion(pa)[[1]], offset)
IRanges of length 9
start end width
[1] 4 6 3
[2] 21 39 19
[3] 49 76 28
[4] 84 128 45
[5] 135 172 38
[6] 183 186 4
[7] 195 204 10
[8] 212 215 4
[9] 245 284 40
Hope this helps,
H.
>
> Thanks
> Marcin
>
> ##############################################################
>
> > str[1]
> [1]
> "TCCTTGCACATTGATAAGTTCACATCTGAAATTTGCATGACATAAACATACAGTTGAGAAGGAGAGAACGTATGCCCTATGGTAAATATTGACATTTTAAA"
> > str[2]
> [1]
> "CTGGGCTTTCGATGAAATAGTTCATTTATCTGTGGGTAGATATTACTTACTGGTTGAGTTAAACTGGGTTAAACATCAATTCTATTTCCATTTTTCATTTTTATAAATAGGTACTGAGAATCTTTGTTCATATAAATAGATGGATAGGATTAGCCACTTCTTTGAATTTCTTTTTCAAGTTTCATGCCAAGATTCACATCATAACACATGTAACTGCATGTCTGGATGGAGAACAGATGTACCTATGCAGCGGCAGGGACATCAACACTCTCACTGATGAATTGGCCGAGGAATGAGGAATAGCACAAATCAGCTACGGAACATTGACAAACTGGGAGCTAAACTTTGCTTCATGCCTGTGAGGCAGTATTTTGATGAGCGGTGGATGCCCAGTGCTTCCTTGT"
> >
> > pa = pairwiseAlignment(str[1], str[2])
> > deletion(pa)
> IRangesList of length 1
> [[1]]
> IRanges of length 9
> start end width
> [1] 4 6 3
> [2] 18 36 19
> [3] 27 54 28
> [4] 34 78 45
> [5] 40 77 38
> [6] 50 53 4
> [7] 58 67 10
> [8] 65 68 4
> [9] 94 133 40
>
> > insertion(pa)
> IRangesList of length 1
> [[1]]
> IRanges of length 1
> start end width
> [1] 234 235 2
>
> > nchar(pa)
> [1] 292
>
> > as.character(aligned(pattern(pa)))
> [1]
> "TCC---TTGCACATTGATAA-------------------GTTCACATC----------------------------TGAAATT---------------------------------------------TGCATG--------------------------------------ACATAAACAT----ACAGTTGA----------GAAGGAG----AGAACGTATGCCCTATGGTAAATATTGAC----------------------------------------ATTTTAAA"
> > as.character(aligned(subject(pa)))
> [1]
> "TCCATTTTTCATTTTTATAAATAGGTACTGAGAATCTTTGTTCATATAAATAGATGGATAGGATTAGCCACTTCTTTGAATTTCTTTTTCAAGTTTCATGCCAAGATTCACATCATAACACATGTAACTGCATGTCTGGATGGAGAACAGATGTACCTATGCAGCGGCAGGGACATCAACACTCTCACTGATGAATTGGCCGAGGAATGAGGAATAGCACAAATCAGCTACGG--AACATTGACAAACTGGGAGCTAAACTTTGCTTCATGCCTGTGAGGCAGTATTTTGAT"
>
> ### here is what I would expect deletion(pa) to output ... notice that
> it resembles
> ### the above deletion(pa) output with a shift corresponding to
> cumsum(width)
> ### is this a bug?
>
> > as(which(strsplit(as.character(aligned(subject(pa))), '')[[1]] ==
> "-"), 'IRanges')
> IRanges of length 9
> start end width
> [1] 4 6 3
> [2] 21 39 19
> [3] 49 76 28
> [4] 84 128 45
> [5] 135 172 38
> [6] 183 186 4
> [7] 195 204 10
> [8] 212 215 4
> [9] 245 284 40
>
>
> > sessionInfo()
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] C
>
> attached base packages:
> [1] parallel stats graphics grDevices utils datasets methods
> [8] base
>
> other attached packages:
> [1] Biostrings_2.28.0 IRanges_1.18.4 BiocGenerics_0.6.0
> [4] BiocInstaller_1.10.3 multicore_0.1-7
>
> loaded via a namespace (and not attached):
> [1] stats4_3.0.0 tools_3.0.0
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list