[Bioc-devel] Best object structure for representing a pairwise genome alignment ?
Pages, Herve
hp@ge@ @end|ng |rom |redhutch@org
Mon Sep 21 23:17:55 CEST 2020
Hi Charles, Vince,
Yes, a PairwiseAlignments object will contain the sequences of the 2
genomes being aligned so will be big. Could be mitigated by using one
object per chromosome instead of trying to represent the full genome
alignment in a single object, but then you loose the ability to
represent regions that align across chromosomes.
Other downsides of using PairwiseAlignments are:
- You loose the nice/simple block-to-block mapping that GRangePairs
gives you, together with the easy/straightforward way to annotate the
links between blocks (via the metadata columns of the GRangePairs).
- A PairwiseAlignments object can only represent replacements and indels
while the block-to-block mapping in a GRangePairs object can support
rearrangements (in addition to indels and replacements).
- The GRangesPairs approach even allows you to represent a many-to-many
relationship between the blocks/regions of the 2 genomes, something that
a PairwiseAlignments-based approach cannot do.
So the GRangePairs approach seems more flexible.
Maybe a better way to support an arbitrary relationship between the
blocks/regions of the 2 genomes would be to use a 3-slot data structure:
2 slots for 2 GRanges objects defining regions on the 2 genomes + 1 slot
for representing the links between the regions defined on each genome
(these links could be stored in a Hits object). Note that this is a
classic bipartite graph. Would particularly make sense if the mapping
between the regions is expected to be many-to-many. This kind of
container would be able to represent a side-by-side comparison of 2
arbitrary genomes, in its more general form, not just a pairwise genome
alignment, which is more restrictive.
Cheers,
H.
On 9/18/20 02:41, Vincent Carey wrote:
> Starting from
>
> PairwiseAlignments-class package:Biostrings R Documentation
>
> PairwiseAlignments, PairwiseAlignmentsSingleSubject, and
> PairwiseAlignmentsSingleSubjectSummary objects
>
> Description:
>
> The ‘PairwiseAlignments’ class is a container for storing a set of
> pairwise alignments.
>
> The ‘PairwiseAlignmentsSingleSubject’ class is a container for
> storing a set of pairwise alignments with a single subject.
>
> The ‘PairwiseAlignmentsSingleSubjectSummary’ class is a container
> for storing the summary of a set of pairwise alignments.
>
> Usage:
>
> ## Constructors:
> ## When subject is missing, pattern must be of length 2
> ## S4 method for signature 'XString,XString'
> PairwiseAlignments(pattern, subject,
> type = "global", substitutionMatrix = NULL, gapOpening = 0,
> gapExtension = 1)
> ## S4 method for signature 'XStringSet,missing'
> PairwiseAlignments(pattern, subject,
> type = "global", substitutionMatrix = NULL, gapOpening = 0,
> gapExtension = 1)
> ## S4 method for signature 'character,character'
> PairwiseAlignments(pattern, subject,
> type = "global", substitutionMatrix = NULL, gapOpening = 0,
> gapExtension = 1,
> baseClass = "BString")
>
> ...
>
> my question would be whether this is a relevant starting place? Clearly
> the focus is not on coordinates, but perhaps a structure that maintains
> genomic content and coordinates together would be of use?
>
>
> On Fri, Sep 18, 2020 at 2:49 AM Charles Plessy <charles.plessy using oist.jp>
> wrote:
>
>> Dear Bioc developers,
>>
>> I am currently analysing pairwise genome alignments with Bioconductor,
>> and I represent them with a GRanges object of the first genome,
>> containing one element by alignment block, and storing the coordinates
>> in the other genome in a metadata column containing another GRanges object.
>>
>> Something like this.
>>
>> GRanges object with 36582 ranges and 2 metadata columns:
>> seqnames ranges strand | score query
>> <Rle> <IRanges> <Rle> | <numeric> <GRanges>
>> [1] S1 162-550 + | 861 XSR:909374-909853
>> [2] S1 833-3738 + | 7238 XSR:910181-913291
>> [3] S1 3769-4212 + | 1165 XSR:913510-913953
>> [4] S1 4246-4381 + | 359 XSR:914134-914275
>> [5] S1 4532-5990 + | 2977 chr2:6694031-6695569
>> ... ... ... ... . ... ...
>> [36578] S99 17228-17759 - | 793 chr1:2375870-2376379
>> [36579] S99 16417-16935 - | 632 chr1:2376612-2377077
>> [36580] S99 12370-12759 - | 773 chr1:2379949-2380343
>> [36581] S99 5270-5384 - | 295 chr1:843397-843511
>> [36582] S99 1949-3053 - | 2105 chr1:845358-846326
>> -------
>>
>> Using "Pairwise genome alignment" as a keyword in a search engine, I
>> found that the packages CNEr is doing something similar, although it
>> uses a dedicated "GRangePairs" object for the purpose.
>>
>> Before I start to invest time in either direction, I wanted to check on
>> that mailing list if there were other solutions already existing, in
>> particularly closer to the core packages ?
>>
>> Have a nice day,
>>
>> Charles
>>
>> --
>> Charles Plessy - - ~ ~ ~ ~ ~ ~~~~ ~ ~ ~ ~ ~ - - charles.plessy using oist.jp
>> Okinawa Institute of Science and Technology Graduate University
>> Staff scientist in the Luscombe Unit - ~ - https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.oist.jp_grsu&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=r5xETEWy-EvPFytBzN3OIep0rJszcSjeifYojLhhtaA&s=oEIrW494OIg6MI6BH6Ejfv96KG24jJ5H3Ijrc0LuFro&e=
>> Toots from work - ~ ~~ ~ - https://urldefense.proofpoint.com/v2/url?u=https-3A__mastodon.technology_-40charles-5Fplessy&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=r5xETEWy-EvPFytBzN3OIep0rJszcSjeifYojLhhtaA&s=7x6nE_0XPtO8DIDREGFWyCk5HhTa000nsvUSR_fcNlc&e=
>>
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=r5xETEWy-EvPFytBzN3OIep0rJszcSjeifYojLhhtaA&s=r_OCYlJwGnKasJbsl9ly6L9Ini_26uXFqKK80ZTgKo4&e=
>>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages using fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list