[Bioc-devel] zero-width ranges representing insertions

Michael Lawrence lawrence.michael at gene.com
Mon Mar 16 17:22:02 CET 2015


Yes, I think it would make sense for the Xtra package to follow the
established convention in VariantAnnotation/VCF.

On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo <robert.castelo at upf.edu>
wrote:

> dear devel people, specially Val and Michael,
>
> Hervé has recently added an annotation package that includes non-SNVs
> variants from dbSNP, it is called:
>
> library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>
> if you execute the corresponding example you'll see the kind of
> information stored in the package:
>
> example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>
>
> if you pay attention to the following case:
>
> my_snps1[1:2]
> GRanges object with 2 ranges and 3 metadata columns:
>       seqnames               ranges strand |   RefSNP_id     alleles
>          <Rle>            <IRanges>  <Rle> | <character> <character>
>   [1]       22 [10513380, 10513380]      - | rs386831164         -/T
>   [2]       22 [10519678, 10519677]      + |  rs71286731       -/TTT
>           ref_allele
>       <DNAStringSet>
>   [1]              T
>   [2]              -
>   -------
>   seqinfo: 25 sequences (1 circular) from GRCh38 genome
>
> you'll see the first variant (rs386831164) is a deletion of one nucleotide
> and the second (rs71286731) is an insertion of three nucleotides (TTT).
>
> it struck me that the ranges representing the insertion had an start
> position one nucleotide larger than then and i contacted Hervé thinking
> that this was a mistake. however, i've learned from him that these are
> so-called "zero-width" ranges and they actually allow to distinguish
> insertions from every other type of variant without the need to know
> anything about the reference or the alternate allele.
>
> currently, the VCF specification 4.2 (http://samtools.github.io/
> hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the REF
> column to help distinguishing insertions by including the flanking
> nucleotide of the inserted sequence. As a result,
> VariantAnnotation::readVcf() produces ranges that mimic this standard
> having identical start and end positions leading to 1-width ranges:
>
> fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering")
> vcf <- readVcf(fl, genome="hg19")
> rowRanges(vcf[isInsertion(vcf), ])[1:2]
> GRanges object with 2 ranges and 5 metadata columns:
>                     seqnames               ranges strand | paramRangeID
>                        <Rle>            <IRanges>  <Rle> |     <factor>
>          rs11474033       20 [45093330, 45093330]      * |         <NA>
>   20:47592746_G/GGC       20 [47592746, 47592746]      * |         <NA>
>                                REF                ALT      QUAL      FILTER
>                     <DNAStringSet> <DNAStringSetList> <numeric> <character>
>          rs11474033              C              CTTCT   2901.12           .
>   20:47592746_G/GGC              G                GGC    608.88           .
>   -------
>   seqinfo: 84 sequences from hg19 genome
>
>
> table(width(rowRanges(vcf[isInsertion(vcf), ])))
>
>  1
> 78
>
> i would like to ask whether it would make sense to harmonize the way in
> which dbSNP insertions and variants are imported into Bioconductor by
> making VariantAnnotation::readVcf() to produce zero-width ranges for
> insertion variants. this not a feature request, i only would like to know
> what whether there is a particular reason not to use the available
> zero-width ranges that seem to be implemented for this purpose.
>
>
> cheers,
>
> robert.
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list