[Bioc-devel] zero-width ranges representing insertions
Hervé Pagès
hpages at fredhutch.org
Tue Mar 17 01:31:12 CET 2015
On 03/16/2015 04:06 PM, Michael Lawrence wrote:
>
>
> On Mon, Mar 16, 2015 at 3:12 PM, Robert Castelo <robert.castelo at upf.edu
> <mailto:robert.castelo at upf.edu>> wrote:
>
> +1 IMO BioC could adopt the zero-width ranges representation for
> insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to
> deal with each corresponding beast, be VCF, dbSNP or the like. Who
> knows, VCF could also change their representation in the future and
> it'll be a headache to update the affected packages if we decide to
> keep using its insertion representation internally to store variant
> ranges in BioC.
>
>
> That would break just about every tool, so let's hope not. There's a
> bunch of code on top of Bioc that currently depends on the current
> representation. For example, zero width ranges do not overlap anything,
> so they need special treatment to e.g. detect whether an insertion falls
> within a gene. There are real benefits to keeping the representation of
> indels consistent with the rest of the field (VCF). There was much
> thought put into this.
Note that findOverlaps() now handles zero-width ranges.
Straight use of findOverlaps() on the ranges of a VCF object leads to
some subtle problems on insertions. For example predictCoding() (which
I guess uses findOverlaps() internally) reports strange things for
these 2 insertions (1 right before and 1 right after the stop codon):
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> cdsBy(txdb, use.names=TRUE)$uc002wcw.3
GRanges object with 2 ranges and 3 metadata columns:
seqnames ranges strand | cds_id cds_name exon_rank
<Rle> <IRanges> <Rle> | <integer> <character> <integer>
[1] chr20 [68351, 68408] + | 206100 <NA> 1
[2] chr20 [76646, 77058] + | 206101 <NA> 2
-------
seqinfo: 93 sequences (1 circular) from hg19 genome
library(VariantAnnotation)
> rowRanges(vcf) # hand-made VCF
GRanges object with 2 ranges and 5 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
ins before stop codon chr20 [77055, 77055] * | <NA>
ins after stop codon chr20 [77058, 77058] * | <NA>
REF ALT QUAL
FILTER
<DNAStringSet> <DNAStringSetList> <numeric>
<character>
ins before stop codon T TG 70
PASS
ins after stop codon A AG 70
PASS
-------
seqinfo: 1 sequence from hg19 genome
Calling predictCoding():
> library(BSgenome.Hsapiens.UCSC.hg19)
> predictCoding(vcf, txdb, Hsapiens)
GRanges object with 2 ranges and 17 metadata columns:
seqnames ranges strand | paramRangeID
<Rle> <IRanges> <Rle> | <factor>
ins before stop codon chr20 [77055, 77055] + | <NA>
ins after stop codon chr20 [77058, 77058] + | <NA>
REF ALT QUAL
FILTER
<DNAStringSet> <DNAStringSetList> <numeric>
<character>
ins before stop codon T TG 70
PASS
ins after stop codon A AG 70
PASS
varAllele CDSLOC PROTEINLOC QUERYID
<DNAStringSet> <IRanges> <IntegerList> <integer>
ins before stop codon TG [468, 468] 156 1
ins after stop codon AG [471, 471] 157 2
TXID CDSID GENEID CONSEQUENCE
<character> <IntegerList> <character> <factor>
ins before stop codon 70477 245938 frameshift
ins after stop codon 70477 245938 frameshift
REFCODON VARCODON REFAA
<DNAStringSet> <DNAStringSet> <AAStringSet>
ins before stop codon AAT AAT
ins after stop codon TAA TAA
VARAA
<AAStringSet>
ins before stop codon
ins after stop codon
-------
seqinfo: 1 sequence from hg19 genome
PROTEINLOC, REFCODON, VARCODON, and CONSEQUENCE don't seem quite right
to me. Could be that my hand-made vcf is missing some important data
needeed by predictCoding() though...
H.
>
> robert.
>
> On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote:
> > There would be a LOT of value in having core packages use exactly the
> > same representation; I don't have any opinion about which one is
> > best.
> >
> > Kasper
> >
> > On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès
> <hpages at fredhutch.org <mailto:hpages at fredhutch.org>
> > <mailto:hpages at fredhutch.org> <mailto:hpages at fredhutch.org>> wrote:
> >
> > On 03/16/2015 09:22 AM, Michael Lawrence wrote:
> >
> > Yes, I think it would make sense for the Xtra package to follow the
> > established convention in VariantAnnotation/VCF.
> >
> >
> > I don't know. I agree it would be nice to use a more consistent
> > representation of insertions across the software but I'm not
> > convinced we should necessarily follow the VCF way, which is to
> > include the base before the event in the ref and alt alleles as well
> > as in the reported range.
> >
> > Note that there doesn't seem to be any consensus in the broader
> > Bioinformatics community either. For example dbSNP and HGVS report
> > the range that corresponds to the 2 flanking nucleotides but they
> > don't include these nucleotides in the ref or alt alleles. VCF does
> > the same as GFF3 which says "start equals end and the implied site is
> > to the right of the indicated base" except that VCF wants to treat
> > events that occur at position 1 in a special way. In that case VCF
> > says the base *after* the event should be included (seems like the
> > VCF authors want to avoid both: empty ranges and also ranges that
> > start at POS 0). BTW it doesn't seem that
> > VariantAnnotation::isInsertion() is aware of that special treatment.
> >
> > UCSC uses a zero-width range, and so does the XtraSNPlocs.*
> > packages. The advantage of this representation is its simplicity.
> > There is no special cases. It also reflects the notion that an
> > insertion is a replacement of an empty string with a non-empty
> > string. No need to include flanking nucleotides in the representation
> > (which is artificial and distorts the "real" alt allele).
> >
> >
> > H.
> >
> >
> > On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo
> > <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>
> <mailto:robert.castelo at upf.edu> <mailto:robert.castelo at upf.edu>> wrote:
> >
> > dear devel people, specially Val and Michael,
> >
> > Hervé has recently added an annotation package that includes
> > non-SNVs variants from dbSNP, it is called:
> >
> > library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
> >
> > if you execute the corresponding example you'll see the kind of
> > information stored in the package:
> >
> > example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
> >
> >
> > if you pay attention to the following case:
> >
> > my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns:
> > seqnames ranges strand | RefSNP_id alleles <Rle>
> > <IRanges> <Rle> | <character> <character> [1] 22 [10513380,
> > 10513380] - | rs386831164 -/T [2] 22 [10519678,
> > 10519677] + | rs71286731 -/TTT ref_allele <DNAStringSet>
> > [1] T [2] - ------- seqinfo: 25 sequences
> > (1 circular) from GRCh38 genome
> >
> > you'll see the first variant (rs386831164) is a deletion of one
> > nucleotide and the second (rs71286731) is an insertion of three
> > nucleotides (TTT).
> >
> > it struck me that the ranges representing the insertion had an start
> > position one nucleotide larger than then and i contacted Hervé
> > thinking that this was a mistake. however, i've learned from him that
> > these are so-called "zero-width" ranges and they actually allow to
> > distinguish insertions from every other type of variant without the
> > need to know anything about the reference or the alternate allele.
> >
> > currently, the VCF specification 4.2 (http://samtools.github.io/
> > hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the
> > REF column to help distinguishing insertions by including the
> > flanking nucleotide of the inserted sequence. As a result,
> > VariantAnnotation::readVcf() produces ranges that mimic this
> > standard having identical start and end positions leading to 1-width
> > ranges:
> >
> > fl <- system.file("extdata", "CEUtrio.vcf.bgz",
> > package="VariantFiltering") vcf <- readVcf(fl, genome="hg19")
> > rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2 ranges
> > and 5 metadata columns: seqnames ranges strand |
> > paramRangeID <Rle> <IRanges> <Rle> | <factor>
> > rs11474033 20 [45093330, 45093330] * | <NA>
> > 20:47592746_G/GGC 20 [47592746, 47592746] * |
> > <NA> REF ALT QUAL FILTER <DNAStringSet>
> > <DNAStringSetList> <numeric> <character> rs11474033 C
> > CTTCT 2901.12 . 20:47592746_G/GGC G
> > GGC 608.88 . ------- seqinfo: 84 sequences from hg19
> > genome
> >
> >
> > table(width(rowRanges(vcf[isInsertion(vcf), ])))
> >
> > 1 78
> >
> > i would like to ask whether it would make sense to harmonize the way
> > in which dbSNP insertions and variants are imported into Bioconductor
> > by making VariantAnnotation::readVcf() to produce zero-width ranges
> > for insertion variants. this not a feature request, i only would like
> > to know what whether there is a particular reason not to use the
> > available zero-width ranges that seem to be implemented for this
> > purpose.
> >
> >
> > cheers,
> >
> > robert.
> >
> > _______________________________________________
> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
> mailing
> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
> mailing
> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >
> >
> > -- Hervé Pagès
> >
> > Program in Computational Biology Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024 Seattle, WA 98109-1024
> >
> > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
> <mailto:hpages at fredhutch.org> <mailto:hpages at fredhutch.org> Phone:
> > (206) 667-5791 <tel:%28206%29%20667-5791>
> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> Fax: (206)
> 667-1319
> > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
> >
> >
> > _______________________________________________
> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
> <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
> mailing
> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >
> >
>
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list