[Bioc-devel] Best object structure for representing a pairwise genome alignment ?

Pages, Herve hp@ge@ @end|ng |rom |redhutch@org
Mon Sep 21 23:17:55 CEST 2020


Hi Charles, Vince,

Yes, a PairwiseAlignments object will contain the sequences of the 2 
genomes being aligned so will be big. Could be mitigated by using one 
object per chromosome instead of trying to represent the full genome 
alignment in a single object, but then you loose the ability to 
represent regions that align across chromosomes.

Other downsides of using PairwiseAlignments are:
- You loose the nice/simple block-to-block mapping that GRangePairs 
gives you, together with the easy/straightforward way to annotate the 
links between blocks (via the metadata columns of the GRangePairs).
- A PairwiseAlignments object can only represent replacements and indels 
while the block-to-block mapping in a GRangePairs object can support 
rearrangements (in addition to indels and replacements).
- The GRangesPairs approach even allows you to represent a many-to-many 
relationship between the blocks/regions of the 2 genomes, something that 
a PairwiseAlignments-based approach cannot do.

So the GRangePairs approach seems more flexible.

Maybe a better way to support an arbitrary relationship between the 
blocks/regions of the 2 genomes would be to use a 3-slot data structure: 
2 slots for 2 GRanges objects defining regions on the 2 genomes + 1 slot 
for representing the links between the regions defined on each genome 
(these links could be stored in a Hits object). Note that this is a 
classic bipartite graph. Would particularly make sense if the mapping 
between the regions is expected to be many-to-many. This kind of 
container would be able to represent a side-by-side comparison of 2 
arbitrary genomes, in its more general form, not just a pairwise genome 
alignment, which is more restrictive.

Cheers,
H.

On 9/18/20 02:41, Vincent Carey wrote:
> Starting from
> 
> PairwiseAlignments-class      package:Biostrings       R Documentation
> 
> PairwiseAlignments, PairwiseAlignmentsSingleSubject, and
> PairwiseAlignmentsSingleSubjectSummary objects
> 
> Description:
> 
>       The ‘PairwiseAlignments’ class is a container for storing a set of
>       pairwise alignments.
> 
>       The ‘PairwiseAlignmentsSingleSubject’ class is a container for
>       storing a set of pairwise alignments with a single subject.
> 
>       The ‘PairwiseAlignmentsSingleSubjectSummary’ class is a container
>       for storing the summary of a set of pairwise alignments.
> 
> Usage:
> 
>       ## Constructors:
>       ## When subject is missing, pattern must be of length 2
>       ## S4 method for signature 'XString,XString'
>       PairwiseAlignments(pattern, subject,
>         type = "global", substitutionMatrix = NULL, gapOpening = 0,
> gapExtension = 1)
>       ## S4 method for signature 'XStringSet,missing'
>       PairwiseAlignments(pattern, subject,
>         type = "global", substitutionMatrix = NULL, gapOpening = 0,
> gapExtension = 1)
>       ## S4 method for signature 'character,character'
>       PairwiseAlignments(pattern, subject,
>         type = "global", substitutionMatrix = NULL, gapOpening = 0,
> gapExtension = 1,
>         baseClass = "BString")
> 
> ...
> 
> my question would be whether this is a relevant starting place?  Clearly
> the focus is not on coordinates, but perhaps a structure that maintains
> genomic content and coordinates together would be of use?
> 
> 
> On Fri, Sep 18, 2020 at 2:49 AM Charles Plessy <charles.plessy using oist.jp>
> wrote:
> 
>> Dear Bioc developers,
>>
>> I am currently analysing pairwise genome alignments with Bioconductor,
>> and I represent them with a GRanges object of the first genome,
>> containing one element by alignment block, and storing the coordinates
>> in the other genome in a metadata column containing another GRanges object.
>>
>> Something like this.
>>
>> GRanges object with 36582 ranges and 2 metadata columns:
>>             seqnames      ranges strand |     score                query
>>                <Rle>   <IRanges>  <Rle> | <numeric>            <GRanges>
>>         [1]       S1     162-550      + |       861    XSR:909374-909853
>>         [2]       S1    833-3738      + |      7238    XSR:910181-913291
>>         [3]       S1   3769-4212      + |      1165    XSR:913510-913953
>>         [4]       S1   4246-4381      + |       359    XSR:914134-914275
>>         [5]       S1   4532-5990      + |      2977 chr2:6694031-6695569
>>         ...      ...         ...    ... .       ...                  ...
>>     [36578]      S99 17228-17759      - |       793 chr1:2375870-2376379
>>     [36579]      S99 16417-16935      - |       632 chr1:2376612-2377077
>>     [36580]      S99 12370-12759      - |       773 chr1:2379949-2380343
>>     [36581]      S99   5270-5384      - |       295   chr1:843397-843511
>>     [36582]      S99   1949-3053      - |      2105   chr1:845358-846326
>>     -------
>>
>> Using "Pairwise genome alignment" as a keyword in a search engine, I
>> found that the packages CNEr is doing something similar, although it
>> uses a dedicated "GRangePairs" object for the purpose.
>>
>> Before I start to invest time in either direction, I wanted to check on
>> that mailing list if there were other solutions already existing, in
>> particularly closer to the core packages ?
>>
>> Have a nice day,
>>
>> Charles
>>
>> --
>> Charles Plessy - - ~ ~ ~ ~ ~ ~~~~ ~ ~ ~ ~ ~ - - charles.plessy using oist.jp
>> Okinawa  Institute  of  Science  and  Technology  Graduate  University
>> Staff scientist in the Luscombe Unit - ~ - https://urldefense.proofpoint.com/v2/url?u=https-3A__groups.oist.jp_grsu&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=r5xETEWy-EvPFytBzN3OIep0rJszcSjeifYojLhhtaA&s=oEIrW494OIg6MI6BH6Ejfv96KG24jJ5H3Ijrc0LuFro&e=
>> Toots from work - ~ ~~ ~ - https://urldefense.proofpoint.com/v2/url?u=https-3A__mastodon.technology_-40charles-5Fplessy&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=r5xETEWy-EvPFytBzN3OIep0rJszcSjeifYojLhhtaA&s=7x6nE_0XPtO8DIDREGFWyCk5HhTa000nsvUSR_fcNlc&e=
>>
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIFaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=r5xETEWy-EvPFytBzN3OIep0rJszcSjeifYojLhhtaA&s=r_OCYlJwGnKasJbsl9ly6L9Ini_26uXFqKK80ZTgKo4&e=
>>
> 

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages using fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319


More information about the Bioc-devel mailing list