[Bioc-devel] zero-width ranges representing insertions
Michael Lawrence
lawrence.michael at gene.com
Mon Mar 16 17:22:02 CET 2015
Yes, I think it would make sense for the Xtra package to follow the
established convention in VariantAnnotation/VCF.
On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo <robert.castelo at upf.edu>
wrote:
> dear devel people, specially Val and Michael,
>
> Hervé has recently added an annotation package that includes non-SNVs
> variants from dbSNP, it is called:
>
> library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>
> if you execute the corresponding example you'll see the kind of
> information stored in the package:
>
> example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>
>
> if you pay attention to the following case:
>
> my_snps1[1:2]
> GRanges object with 2 ranges and 3 metadata columns:
> seqnames ranges strand | RefSNP_id alleles
> <Rle> <IRanges> <Rle> | <character> <character>
> [1] 22 [10513380, 10513380] - | rs386831164 -/T
> [2] 22 [10519678, 10519677] + | rs71286731 -/TTT
> ref_allele
> <DNAStringSet>
> [1] T
> [2] -
> -------
> seqinfo: 25 sequences (1 circular) from GRCh38 genome
>
> you'll see the first variant (rs386831164) is a deletion of one nucleotide
> and the second (rs71286731) is an insertion of three nucleotides (TTT).
>
> it struck me that the ranges representing the insertion had an start
> position one nucleotide larger than then and i contacted Hervé thinking
> that this was a mistake. however, i've learned from him that these are
> so-called "zero-width" ranges and they actually allow to distinguish
> insertions from every other type of variant without the need to know
> anything about the reference or the alternate allele.
>
> currently, the VCF specification 4.2 (http://samtools.github.io/
> hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the REF
> column to help distinguishing insertions by including the flanking
> nucleotide of the inserted sequence. As a result,
> VariantAnnotation::readVcf() produces ranges that mimic this standard
> having identical start and end positions leading to 1-width ranges:
>
> fl <- system.file("extdata", "CEUtrio.vcf.bgz", package="VariantFiltering")
> vcf <- readVcf(fl, genome="hg19")
> rowRanges(vcf[isInsertion(vcf), ])[1:2]
> GRanges object with 2 ranges and 5 metadata columns:
> seqnames ranges strand | paramRangeID
> <Rle> <IRanges> <Rle> | <factor>
> rs11474033 20 [45093330, 45093330] * | <NA>
> 20:47592746_G/GGC 20 [47592746, 47592746] * | <NA>
> REF ALT QUAL FILTER
> <DNAStringSet> <DNAStringSetList> <numeric> <character>
> rs11474033 C CTTCT 2901.12 .
> 20:47592746_G/GGC G GGC 608.88 .
> -------
> seqinfo: 84 sequences from hg19 genome
>
>
> table(width(rowRanges(vcf[isInsertion(vcf), ])))
>
> 1
> 78
>
> i would like to ask whether it would make sense to harmonize the way in
> which dbSNP insertions and variants are imported into Bioconductor by
> making VariantAnnotation::readVcf() to produce zero-width ranges for
> insertion variants. this not a feature request, i only would like to know
> what whether there is a particular reason not to use the available
> zero-width ranges that seem to be implemented for this purpose.
>
>
> cheers,
>
> robert.
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list