[Bioc-devel] zero-width ranges representing insertions

Kasper Daniel Hansen kasperdanielhansen at gmail.com
Mon Mar 16 21:00:46 CET 2015


There would be a LOT of value in having core packages use exactly the same
representation; I don't have any opinion about which one is best.

Kasper

On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès <hpages at fredhutch.org> wrote:

> On 03/16/2015 09:22 AM, Michael Lawrence wrote:
>
>> Yes, I think it would make sense for the Xtra package to follow the
>> established convention in VariantAnnotation/VCF.
>>
>
> I don't know. I agree it would be nice to use a more consistent
> representation of insertions across the software but I'm not convinced
> we should necessarily follow the VCF way, which is to include the base
> before the event in the ref and alt alleles as well as in the reported
> range.
>
> Note that there doesn't seem to be any consensus in the broader
> Bioinformatics community either. For example dbSNP and HGVS report the
> range that corresponds to the 2 flanking nucleotides but they don't
> include these nucleotides in the ref or alt alleles. VCF does the same
> as GFF3 which says "start equals end and the implied site is to the
> right of the indicated base" except that VCF wants to treat events that
> occur at position 1 in a special way. In that case VCF says the base
> *after* the event should be included (seems like the VCF authors want
> to avoid both: empty ranges and also ranges that start at POS 0).
> BTW it doesn't seem that VariantAnnotation::isInsertion() is aware of
> that special treatment.
>
> UCSC uses a zero-width range, and so does the XtraSNPlocs.* packages.
> The advantage of this representation is its simplicity. There is no
> special cases. It also reflects the notion that an insertion is
> a replacement of an empty string with a non-empty string. No need
> to include flanking nucleotides in the representation (which is
> artificial and distorts the "real" alt allele).
>
>
> H.
>
>
>> On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo <robert.castelo at upf.edu>
>> wrote:
>>
>>  dear devel people, specially Val and Michael,
>>>
>>> Hervé has recently added an annotation package that includes non-SNVs
>>> variants from dbSNP, it is called:
>>>
>>> library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>>>
>>> if you execute the corresponding example you'll see the kind of
>>> information stored in the package:
>>>
>>> example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>>>
>>>
>>> if you pay attention to the following case:
>>>
>>> my_snps1[1:2]
>>> GRanges object with 2 ranges and 3 metadata columns:
>>>        seqnames               ranges strand |   RefSNP_id     alleles
>>>           <Rle>            <IRanges>  <Rle> | <character> <character>
>>>    [1]       22 [10513380, 10513380]      - | rs386831164         -/T
>>>    [2]       22 [10519678, 10519677]      + |  rs71286731       -/TTT
>>>            ref_allele
>>>        <DNAStringSet>
>>>    [1]              T
>>>    [2]              -
>>>    -------
>>>    seqinfo: 25 sequences (1 circular) from GRCh38 genome
>>>
>>> you'll see the first variant (rs386831164) is a deletion of one
>>> nucleotide
>>> and the second (rs71286731) is an insertion of three nucleotides (TTT).
>>>
>>> it struck me that the ranges representing the insertion had an start
>>> position one nucleotide larger than then and i contacted Hervé thinking
>>> that this was a mistake. however, i've learned from him that these are
>>> so-called "zero-width" ranges and they actually allow to distinguish
>>> insertions from every other type of variant without the need to know
>>> anything about the reference or the alternate allele.
>>>
>>> currently, the VCF specification 4.2 (http://samtools.github.io/
>>> hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the REF
>>> column to help distinguishing insertions by including the flanking
>>> nucleotide of the inserted sequence. As a result,
>>> VariantAnnotation::readVcf() produces ranges that mimic this standard
>>> having identical start and end positions leading to 1-width ranges:
>>>
>>> fl <- system.file("extdata", "CEUtrio.vcf.bgz",
>>> package="VariantFiltering")
>>> vcf <- readVcf(fl, genome="hg19")
>>> rowRanges(vcf[isInsertion(vcf), ])[1:2]
>>> GRanges object with 2 ranges and 5 metadata columns:
>>>                      seqnames               ranges strand | paramRangeID
>>>                         <Rle>            <IRanges>  <Rle> |     <factor>
>>>           rs11474033       20 [45093330, 45093330]      * |         <NA>
>>>    20:47592746_G/GGC       20 [47592746, 47592746]      * |         <NA>
>>>                                 REF                ALT      QUAL
>>> FILTER
>>>                      <DNAStringSet> <DNAStringSetList> <numeric>
>>> <character>
>>>           rs11474033              C              CTTCT   2901.12
>>>    .
>>>    20:47592746_G/GGC              G                GGC    608.88
>>>    .
>>>    -------
>>>    seqinfo: 84 sequences from hg19 genome
>>>
>>>
>>> table(width(rowRanges(vcf[isInsertion(vcf), ])))
>>>
>>>   1
>>> 78
>>>
>>> i would like to ask whether it would make sense to harmonize the way in
>>> which dbSNP insertions and variants are imported into Bioconductor by
>>> making VariantAnnotation::readVcf() to produce zero-width ranges for
>>> insertion variants. this not a feature request, i only would like to know
>>> what whether there is a particular reason not to use the available
>>> zero-width ranges that seem to be implemented for this purpose.
>>>
>>>
>>> cheers,
>>>
>>> robert.
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>         [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org
> Phone:  (206) 667-5791
> Fax:    (206) 667-1319
>
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list