[Bioc-devel] zero-width ranges representing insertions

Robert Castelo robert.castelo at upf.edu
Mon Mar 16 23:12:38 CET 2015


+1 IMO BioC could adopt the zero-width ranges representation for 
insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to deal 
with each corresponding beast, be VCF, dbSNP or the like. Who knows, VCF 
could also change their representation in the future and it'll be a 
headache to update the affected packages if we decide to keep using its 
insertion representation internally to store variant ranges in BioC.

robert.

On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote:
> There would be a LOT of value  in having core packages use exactly the
 > same representation; I don't have any opinion about which one is
 > best.
 >
 > Kasper
 >
 > On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès <hpages at fredhutch.org
 > <mailto:hpages at fredhutch.org>> wrote:
 >
 > On 03/16/2015 09:22 AM, Michael Lawrence wrote:
 >
 > Yes, I think it would make sense for the Xtra package to follow the
 > established convention in VariantAnnotation/VCF.
 >
 >
 > I don't know. I agree it would be nice to use a more consistent
 > representation of insertions across the software but I'm not
 > convinced we should necessarily follow the VCF way, which is to
 > include the base before the event in the ref and alt alleles as well
 > as in the reported range.
 >
 > Note that there doesn't seem to be any consensus in the broader
 > Bioinformatics community either. For example dbSNP and HGVS report
 > the range that corresponds to the 2 flanking nucleotides but they
 > don't include these nucleotides in the ref or alt alleles. VCF does
 > the same as GFF3 which says "start equals end and the implied site is
 > to the right of the indicated base" except that VCF wants to treat
 > events that occur at position 1 in a special way. In that case VCF
 > says the base *after* the event should be included (seems like the
 > VCF authors want to avoid both: empty ranges and also ranges that
 > start at POS 0). BTW it doesn't seem that
 > VariantAnnotation::isInsertion() is aware of that special treatment.
 >
 > UCSC uses a zero-width range, and so does the XtraSNPlocs.*
 > packages. The advantage of this representation is its simplicity.
 > There is no special cases. It also reflects the notion that an
 > insertion is a replacement of an empty string with a non-empty
 > string. No need to include flanking nucleotides in the representation
 > (which is artificial and distorts the "real" alt allele).
 >
 >
 > H.
 >
 >
 > On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo
 > <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>> wrote:
 >
 > dear devel people, specially Val and Michael,
 >
 > Hervé has recently added an annotation package that includes
 > non-SNVs variants from dbSNP, it is called:
 >
 > library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
 >
 > if you execute the corresponding example you'll see the kind of
 > information stored in the package:
 >
 > example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
 >
 >
 > if you pay attention to the following case:
 >
 > my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns:
 > seqnames               ranges strand |   RefSNP_id alleles <Rle>
 > <IRanges>  <Rle> | <character> <character> [1]       22 [10513380,
 > 10513380]      - | rs386831164         -/T [2]       22 [10519678,
 > 10519677]      + |  rs71286731       -/TTT ref_allele <DNAStringSet>
 > [1]              T [2]              - ------- seqinfo: 25 sequences
 > (1 circular) from GRCh38 genome
 >
 > you'll see the first variant (rs386831164) is a deletion of one
 > nucleotide and the second (rs71286731) is an insertion of three
 > nucleotides (TTT).
 >
 > it struck me that the ranges representing the insertion had an start
 > position one nucleotide larger than then and i contacted Hervé
 > thinking that this was a mistake. however, i've learned from him that
 > these are so-called "zero-width" ranges and they actually allow to
 > distinguish insertions from every other type of variant without the
 > need to know anything about the reference or the alternate allele.
 >
 > currently, the VCF specification 4.2 (http://samtools.github.io/
 > hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the
 > REF column to help distinguishing insertions by including the
 > flanking nucleotide of the inserted sequence. As a result,
 > VariantAnnotation::readVcf() produces ranges that mimic this
 > standard having identical start and end positions leading to 1-width
 > ranges:
 >
 > fl <- system.file("extdata", "CEUtrio.vcf.bgz",
 > package="VariantFiltering") vcf <- readVcf(fl, genome="hg19")
 > rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2 ranges
 > and 5 metadata columns: seqnames               ranges strand |
 > paramRangeID <Rle>            <IRanges> <Rle> |     <factor>
 > rs11474033       20 [45093330, 45093330]      * | <NA>
 > 20:47592746_G/GGC       20 [47592746, 47592746]      * |
 > <NA> REF                ALT      QUAL      FILTER <DNAStringSet>
 > <DNAStringSetList> <numeric> <character> rs11474033              C
 > CTTCT   2901.12           . 20:47592746_G/GGC              G
 > GGC    608.88           . ------- seqinfo: 84 sequences from hg19
 > genome
 >
 >
 > table(width(rowRanges(vcf[isInsertion(vcf), ])))
 >
 > 1 78
 >
 > i would like to ask whether it would make sense to harmonize the way
 > in which dbSNP insertions and variants are imported into Bioconductor
 > by making VariantAnnotation::readVcf() to produce zero-width ranges
 > for insertion variants. this not a feature request, i only would like
 > to know what whether there is a particular reason not to use the
 > available zero-width ranges that seem to be implemented for this
 > purpose.
 >
 >
 > cheers,
 >
 > robert.
 >
 > _______________________________________________
 > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
 > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
 > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
 >
 >
 > [[alternative HTML version deleted]]
 >
 > _______________________________________________
 > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
 > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
 > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
 >
 >
 > -- Hervé Pagès
 >
 > Program in Computational Biology Division of Public Health Sciences
 > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514
 > P.O. Box 19024 Seattle, WA 98109-1024
 >
 > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org> Phone:
 > (206) 667-5791 <tel:%28206%29%20667-5791> Fax:    (206) 667-1319
 > <tel:%28206%29%20667-1319>
 >
 >
 > _______________________________________________
 > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
 > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
 > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
 >
 >



	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list