[Bioc-devel] zero-width ranges representing insertions

Hervé Pagès hpages at fredhutch.org
Tue Mar 17 01:31:12 CET 2015


On 03/16/2015 04:06 PM, Michael Lawrence wrote:
>
>
> On Mon, Mar 16, 2015 at 3:12 PM, Robert Castelo <robert.castelo at upf.edu
> <mailto:robert.castelo at upf.edu>> wrote:
>
>     +1 IMO BioC could adopt the zero-width ranges representation for
>     insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to
>     deal with each corresponding beast, be VCF, dbSNP or the like. Who
>     knows, VCF could also change their representation in the future and
>     it'll be a headache to update the affected packages if we decide to
>     keep using its insertion representation internally to store variant
>     ranges in BioC.
>
>
> That would break just about every tool, so let's hope not. There's a
> bunch of code on top of Bioc that currently depends on the current
> representation. For example, zero width ranges do not overlap anything,
> so they need special treatment to e.g. detect whether an insertion falls
> within a gene. There are real benefits to keeping the representation of
> indels consistent with the rest of the field (VCF). There was much
> thought put into this.

Note that findOverlaps() now handles zero-width ranges.

Straight use of findOverlaps() on the ranges of a VCF object leads to
some subtle problems on insertions. For example predictCoding() (which
I guess uses findOverlaps() internally) reports strange things for
these 2 insertions (1 right before and 1 right after the stop codon):

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

 > cdsBy(txdb, use.names=TRUE)$uc002wcw.3
GRanges object with 2 ranges and 3 metadata columns:
       seqnames         ranges strand |    cds_id    cds_name exon_rank
          <Rle>      <IRanges>  <Rle> | <integer> <character> <integer>
   [1]    chr20 [68351, 68408]      + |    206100        <NA>         1
   [2]    chr20 [76646, 77058]      + |    206101        <NA>         2
   -------
   seqinfo: 93 sequences (1 circular) from hg19 genome


library(VariantAnnotation)
 > rowRanges(vcf)  # hand-made VCF
GRanges object with 2 ranges and 5 metadata columns:
                         seqnames         ranges strand | paramRangeID
                            <Rle>      <IRanges>  <Rle> |     <factor>
   ins before stop codon    chr20 [77055, 77055]      * |         <NA>
    ins after stop codon    chr20 [77058, 77058]      * |         <NA>
                                    REF                ALT      QUAL 
   FILTER
                         <DNAStringSet> <DNAStringSetList> <numeric> 
<character>
   ins before stop codon              T                 TG        70 
     PASS
    ins after stop codon              A                 AG        70 
     PASS
   -------
   seqinfo: 1 sequence from hg19 genome

Calling predictCoding():

 > library(BSgenome.Hsapiens.UCSC.hg19)
 > predictCoding(vcf, txdb, Hsapiens)
GRanges object with 2 ranges and 17 metadata columns:
                         seqnames         ranges strand | paramRangeID
                            <Rle>      <IRanges>  <Rle> |     <factor>
   ins before stop codon    chr20 [77055, 77055]      + |         <NA>
    ins after stop codon    chr20 [77058, 77058]      + |         <NA>
                                    REF                ALT      QUAL 
   FILTER
                         <DNAStringSet> <DNAStringSetList> <numeric> 
<character>
   ins before stop codon              T                 TG        70 
     PASS
    ins after stop codon              A                 AG        70 
     PASS
                              varAllele     CDSLOC    PROTEINLOC   QUERYID
                         <DNAStringSet>  <IRanges> <IntegerList> <integer>
   ins before stop codon             TG [468, 468]           156         1
    ins after stop codon             AG [471, 471]           157         2
                                TXID         CDSID      GENEID CONSEQUENCE
                         <character> <IntegerList> <character>    <factor>
   ins before stop codon       70477                    245938  frameshift
    ins after stop codon       70477                    245938  frameshift
                               REFCODON       VARCODON         REFAA
                         <DNAStringSet> <DNAStringSet> <AAStringSet>
   ins before stop codon            AAT            AAT
    ins after stop codon            TAA            TAA
                                 VARAA
                         <AAStringSet>
   ins before stop codon
    ins after stop codon
   -------
   seqinfo: 1 sequence from hg19 genome

PROTEINLOC, REFCODON, VARCODON, and CONSEQUENCE don't seem quite right
to me. Could be that my hand-made vcf is missing some important data
needeed by predictCoding() though...

H.


>
>     robert.
>
>     On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote:
>     > There would be a LOT of value  in having core packages use exactly the
>      > same representation; I don't have any opinion about which one is
>      > best.
>      >
>      > Kasper
>      >
>      > On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès
>     <hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>      > <mailto:hpages at fredhutch.org> <mailto:hpages at fredhutch.org>> wrote:
>      >
>      > On 03/16/2015 09:22 AM, Michael Lawrence wrote:
>      >
>      > Yes, I think it would make sense for the Xtra package to follow the
>      > established convention in VariantAnnotation/VCF.
>      >
>      >
>      > I don't know. I agree it would be nice to use a more consistent
>      > representation of insertions across the software but I'm not
>      > convinced we should necessarily follow the VCF way, which is to
>      > include the base before the event in the ref and alt alleles as well
>      > as in the reported range.
>      >
>      > Note that there doesn't seem to be any consensus in the broader
>      > Bioinformatics community either. For example dbSNP and HGVS report
>      > the range that corresponds to the 2 flanking nucleotides but they
>      > don't include these nucleotides in the ref or alt alleles. VCF does
>      > the same as GFF3 which says "start equals end and the implied site is
>      > to the right of the indicated base" except that VCF wants to treat
>      > events that occur at position 1 in a special way. In that case VCF
>      > says the base *after* the event should be included (seems like the
>      > VCF authors want to avoid both: empty ranges and also ranges that
>      > start at POS 0). BTW it doesn't seem that
>      > VariantAnnotation::isInsertion() is aware of that special treatment.
>      >
>      > UCSC uses a zero-width range, and so does the XtraSNPlocs.*
>      > packages. The advantage of this representation is its simplicity.
>      > There is no special cases. It also reflects the notion that an
>      > insertion is a replacement of an empty string with a non-empty
>      > string. No need to include flanking nucleotides in the representation
>      > (which is artificial and distorts the "real" alt allele).
>      >
>      >
>      > H.
>      >
>      >
>      > On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo
>      > <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>
>     <mailto:robert.castelo at upf.edu> <mailto:robert.castelo at upf.edu>> wrote:
>      >
>      > dear devel people, specially Val and Michael,
>      >
>      > Hervé has recently added an annotation package that includes
>      > non-SNVs variants from dbSNP, it is called:
>      >
>      > library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>      >
>      > if you execute the corresponding example you'll see the kind of
>      > information stored in the package:
>      >
>      > example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>      >
>      >
>      > if you pay attention to the following case:
>      >
>      > my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns:
>      > seqnames               ranges strand |   RefSNP_id alleles <Rle>
>      > <IRanges>  <Rle> | <character> <character> [1]       22 [10513380,
>      > 10513380]      - | rs386831164         -/T [2]       22 [10519678,
>      > 10519677]      + |  rs71286731       -/TTT ref_allele <DNAStringSet>
>      > [1]              T [2]              - ------- seqinfo: 25 sequences
>      > (1 circular) from GRCh38 genome
>      >
>      > you'll see the first variant (rs386831164) is a deletion of one
>      > nucleotide and the second (rs71286731) is an insertion of three
>      > nucleotides (TTT).
>      >
>      > it struck me that the ranges representing the insertion had an start
>      > position one nucleotide larger than then and i contacted Hervé
>      > thinking that this was a mistake. however, i've learned from him that
>      > these are so-called "zero-width" ranges and they actually allow to
>      > distinguish insertions from every other type of variant without the
>      > need to know anything about the reference or the alternate allele.
>      >
>      > currently, the VCF specification 4.2 (http://samtools.github.io/
>      > hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the
>      > REF column to help distinguishing insertions by including the
>      > flanking nucleotide of the inserted sequence. As a result,
>      > VariantAnnotation::readVcf() produces ranges that mimic this
>      > standard having identical start and end positions leading to 1-width
>      > ranges:
>      >
>      > fl <- system.file("extdata", "CEUtrio.vcf.bgz",
>      > package="VariantFiltering") vcf <- readVcf(fl, genome="hg19")
>      > rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2 ranges
>      > and 5 metadata columns: seqnames               ranges strand |
>      > paramRangeID <Rle>            <IRanges> <Rle> |     <factor>
>      > rs11474033       20 [45093330, 45093330]      * | <NA>
>      > 20:47592746_G/GGC       20 [47592746, 47592746]      * |
>      > <NA> REF                ALT      QUAL      FILTER <DNAStringSet>
>      > <DNAStringSetList> <numeric> <character> rs11474033              C
>      > CTTCT   2901.12           . 20:47592746_G/GGC              G
>      > GGC    608.88           . ------- seqinfo: 84 sequences from hg19
>      > genome
>      >
>      >
>      > table(width(rowRanges(vcf[isInsertion(vcf), ])))
>      >
>      > 1 78
>      >
>      > i would like to ask whether it would make sense to harmonize the way
>      > in which dbSNP insertions and variants are imported into Bioconductor
>      > by making VariantAnnotation::readVcf() to produce zero-width ranges
>      > for insertion variants. this not a feature request, i only would like
>      > to know what whether there is a particular reason not to use the
>      > available zero-width ranges that seem to be implemented for this
>      > purpose.
>      >
>      >
>      > cheers,
>      >
>      > robert.
>      >
>      > _______________________________________________
>      > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
>     mailing
>      > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>      > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>      >
>      >
>      > [[alternative HTML version deleted]]
>      >
>      > _______________________________________________
>      > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
>     mailing
>      > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>      > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>      >
>      >
>      > -- Hervé Pagès
>      >
>      > Program in Computational Biology Division of Public Health Sciences
>      > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514
>      > P.O. Box 19024 Seattle, WA 98109-1024
>      >
>      > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>     <mailto:hpages at fredhutch.org> <mailto:hpages at fredhutch.org> Phone:
>      > (206) 667-5791 <tel:%28206%29%20667-5791>
>     <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> Fax:    (206)
>     667-1319
>      > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
>      >
>      >
>      > _______________________________________________
>      > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>     <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
>     mailing
>      > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>      > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>     <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>      >
>      >
>
>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fredhutch.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioc-devel mailing list