[Bioc-devel] zero-width ranges representing insertions
Robert Castelo
robert.castelo at upf.edu
Mon Mar 16 23:12:38 CET 2015
+1 IMO BioC could adopt the zero-width ranges representation for
insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to deal
with each corresponding beast, be VCF, dbSNP or the like. Who knows, VCF
could also change their representation in the future and it'll be a
headache to update the affected packages if we decide to keep using its
insertion representation internally to store variant ranges in BioC.
robert.
On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote:
> There would be a LOT of value in having core packages use exactly the
> same representation; I don't have any opinion about which one is
> best.
>
> Kasper
>
> On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès <hpages at fredhutch.org
> <mailto:hpages at fredhutch.org>> wrote:
>
> On 03/16/2015 09:22 AM, Michael Lawrence wrote:
>
> Yes, I think it would make sense for the Xtra package to follow the
> established convention in VariantAnnotation/VCF.
>
>
> I don't know. I agree it would be nice to use a more consistent
> representation of insertions across the software but I'm not
> convinced we should necessarily follow the VCF way, which is to
> include the base before the event in the ref and alt alleles as well
> as in the reported range.
>
> Note that there doesn't seem to be any consensus in the broader
> Bioinformatics community either. For example dbSNP and HGVS report
> the range that corresponds to the 2 flanking nucleotides but they
> don't include these nucleotides in the ref or alt alleles. VCF does
> the same as GFF3 which says "start equals end and the implied site is
> to the right of the indicated base" except that VCF wants to treat
> events that occur at position 1 in a special way. In that case VCF
> says the base *after* the event should be included (seems like the
> VCF authors want to avoid both: empty ranges and also ranges that
> start at POS 0). BTW it doesn't seem that
> VariantAnnotation::isInsertion() is aware of that special treatment.
>
> UCSC uses a zero-width range, and so does the XtraSNPlocs.*
> packages. The advantage of this representation is its simplicity.
> There is no special cases. It also reflects the notion that an
> insertion is a replacement of an empty string with a non-empty
> string. No need to include flanking nucleotides in the representation
> (which is artificial and distorts the "real" alt allele).
>
>
> H.
>
>
> On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo
> <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>> wrote:
>
> dear devel people, specially Val and Michael,
>
> Hervé has recently added an annotation package that includes
> non-SNVs variants from dbSNP, it is called:
>
> library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>
> if you execute the corresponding example you'll see the kind of
> information stored in the package:
>
> example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>
>
> if you pay attention to the following case:
>
> my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns:
> seqnames ranges strand | RefSNP_id alleles <Rle>
> <IRanges> <Rle> | <character> <character> [1] 22 [10513380,
> 10513380] - | rs386831164 -/T [2] 22 [10519678,
> 10519677] + | rs71286731 -/TTT ref_allele <DNAStringSet>
> [1] T [2] - ------- seqinfo: 25 sequences
> (1 circular) from GRCh38 genome
>
> you'll see the first variant (rs386831164) is a deletion of one
> nucleotide and the second (rs71286731) is an insertion of three
> nucleotides (TTT).
>
> it struck me that the ranges representing the insertion had an start
> position one nucleotide larger than then and i contacted Hervé
> thinking that this was a mistake. however, i've learned from him that
> these are so-called "zero-width" ranges and they actually allow to
> distinguish insertions from every other type of variant without the
> need to know anything about the reference or the alternate allele.
>
> currently, the VCF specification 4.2 (http://samtools.github.io/
> hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the
> REF column to help distinguishing insertions by including the
> flanking nucleotide of the inserted sequence. As a result,
> VariantAnnotation::readVcf() produces ranges that mimic this
> standard having identical start and end positions leading to 1-width
> ranges:
>
> fl <- system.file("extdata", "CEUtrio.vcf.bgz",
> package="VariantFiltering") vcf <- readVcf(fl, genome="hg19")
> rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2 ranges
> and 5 metadata columns: seqnames ranges strand |
> paramRangeID <Rle> <IRanges> <Rle> | <factor>
> rs11474033 20 [45093330, 45093330] * | <NA>
> 20:47592746_G/GGC 20 [47592746, 47592746] * |
> <NA> REF ALT QUAL FILTER <DNAStringSet>
> <DNAStringSetList> <numeric> <character> rs11474033 C
> CTTCT 2901.12 . 20:47592746_G/GGC G
> GGC 608.88 . ------- seqinfo: 84 sequences from hg19
> genome
>
>
> table(width(rowRanges(vcf[isInsertion(vcf), ])))
>
> 1 78
>
> i would like to ask whether it would make sense to harmonize the way
> in which dbSNP insertions and variants are imported into Bioconductor
> by making VariantAnnotation::readVcf() to produce zero-width ranges
> for insertion variants. this not a feature request, i only would like
> to know what whether there is a particular reason not to use the
> available zero-width ranges that seem to be implemented for this
> purpose.
>
>
> cheers,
>
> robert.
>
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
> list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
> list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
> -- Hervé Pagès
>
> Program in Computational Biology Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024 Seattle, WA 98109-1024
>
> E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org> Phone:
> (206) 667-5791 <tel:%28206%29%20667-5791> Fax: (206) 667-1319
> <tel:%28206%29%20667-1319>
>
>
> _______________________________________________
> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
> list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list