[Bioc-devel] zero-width ranges representing insertions
Valerie Obenchain
vobencha at fredhutch.org
Fri Mar 20 00:04:11 CET 2015
On 03/16/2015 05:31 PM, Hervé Pagès wrote:
> On 03/16/2015 04:06 PM, Michael Lawrence wrote:
>>
>>
>> On Mon, Mar 16, 2015 at 3:12 PM, Robert Castelo <robert.castelo at upf.edu
>> <mailto:robert.castelo at upf.edu>> wrote:
>>
>> +1 IMO BioC could adopt the zero-width ranges representation for
>> insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to
>> deal with each corresponding beast, be VCF, dbSNP or the like. Who
>> knows, VCF could also change their representation in the future and
>> it'll be a headache to update the affected packages if we decide to
>> keep using its insertion representation internally to store variant
>> ranges in BioC.
>>
>>
>> That would break just about every tool, so let's hope not. There's a
>> bunch of code on top of Bioc that currently depends on the current
>> representation. For example, zero width ranges do not overlap anything,
>> so they need special treatment to e.g. detect whether an insertion falls
>> within a gene. There are real benefits to keeping the representation of
>> indels consistent with the rest of the field (VCF). There was much
>> thought put into this.
>
> Note that findOverlaps() now handles zero-width ranges.
I've had a chance to take a closer look at how VA handles zero-width ranges.
Previously, both predictCoding() and locateVariants() treated zero-width
ranges as width 1 (start decremented to equal end). In VA 1.13.42 this
has been changed for predictCoding() so now zero width are dropped. The
function internals expect REF and ALT to conform with the vcf specs and
zero width ranges aren't used. So, it seemed wise to drop the zero-width
for now.
locateVariants() remains the same because this is more general. I think
it's still useful to identify where a zero width range falls with
respect to gene features.
>
> Straight use of findOverlaps() on the ranges of a VCF object leads to
> some subtle problems on insertions. For example predictCoding() (which
> I guess uses findOverlaps() internally) reports strange things for
> these 2 insertions (1 right before and 1 right after the stop codon):
This output is actually fine. The VARCODON values may be slightly
misleading but the data are correct. predictCoding() only computes amino
acid sequences for snps or indels that conform to the 'groups of 3'
idea. The substitution or deletion must result in the sequence being
divisible by 3 otherwise there is a partial codon at the end that must
be inferred (consider all possible combinations) and then one must be
chosen (consensus). The code does not currently do this and I'm not sure
there is common agreement on how to do it.
This GRanges has a snv followed by 1, 2, and 3 base pair insertions:
>>> rowRanges(vcf)
>> GRanges object with 4 ranges and 5 metadata columns:
>> seqnames ranges strand | paramRangeID REF
>> <Rle> <IRanges> <Rle> | <logical> <DNAStringSet>
>> snv chr20 [77055, 77055] * | <NA> T
>> 1bp ins chr20 [77054, 77055] * | <NA> AT
>> 2bp ins chr20 [77054, 77055] * | <NA> AT
>> 3bp ins chr20 [77054, 77055] * | <NA> AT
>> ALT QUAL FILTER
>> <DNAStringSetList> <numeric> <character>
>> snv G 70 PASS
>> 1bp ins ATC 70 PASS
>> 2bp ins ATCG 70 PASS
>> 3bp ins ATCGG 70 PASS
>> -------
>> seqinfo: 1 sequence from an unspecified genome; no seqlengths
Coding changes are computed for the snv and 3bp insertion but the others
are marked as 'frameshift'. Previously when an indel couldn't be
translated the VARCODON was the same as the REFCODON which may have been
confusing (was intended to mean nothing has changed). I've changed this
so VARCODON is now missing (like VARAA) when it can't be translated.
>
>>> predictCoding(vcf, txdb, Hsapiens)
>> GRanges object with 4 ranges and 17 metadata columns:
>> seqnames ranges strand | paramRangeID REF
>> <Rle> <IRanges> <Rle> | <logical> <DNAStringSet>
>> snv chr20 [77055, 77055] + | <NA> T
>> 1bp ins chr20 [77054, 77055] + | <NA> AT
>> 2bp ins chr20 [77054, 77055] + | <NA> AT
>> 3bp ins chr20 [77054, 77055] + | <NA> AT
>> ALT QUAL FILTER varAllele CDSLOC
>> <DNAStringSetList> <numeric> <character> <DNAStringSet> <IRanges>
>> snv G 70 PASS G [468, 468]
>> 1bp ins ATC 70 PASS ATC [467, 468]
>> 2bp ins ATCG 70 PASS ATCG [467, 468]
>> 3bp ins ATCGG 70 PASS ATCGG [467, 468]
>> PROTEINLOC QUERYID TXID CDSID GENEID
>> <IntegerList> <integer> <character> <IntegerList> <character>
>> snv 156 1 70477 206101 245938
>> 1bp ins 156 2 70477 206101 245938
>> 2bp ins 156 3 70477 206101 245938
>> 3bp ins 156 4 70477 206101 245938
>> CONSEQUENCE REFCODON VARCODON REFAA
>> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
>> snv nonsynonymous AAT AAG N
>> 1bp ins frameshift AAT N
>> 2bp ins frameshift AAT N
>> 3bp ins nonsynonymous AAT AATCGG N
>> VARAA
>> <AAStringSet>
>> snv K
>> 1bp ins
>> 2bp ins
>> 3bp ins NR
>> -------
>> seqinfo: 1 sequence from an unspecified genome; no seqlengths
PROTEINLOC is the codon number in the coding sequence. These are the cds
regions:
>>> cds
>> GRanges object with 2 ranges and 3 metadata columns:
>> seqnames ranges strand | cds_id cds_name exon_rank
>> <Rle> <IRanges> <Rle> | <integer> <character> <integer>
>> [1] chr20 [68351, 68408] + | 206100 <NA> 1
>> [2] chr20 [76646, 77058] + | 206101 <NA> 2
>> -------
>> seqinfo: 93 sequences (1 circular) from hg19 genome
There are 157 codons, position 77055 falls in the second to last codon,
so 156.
>>> sum(width(cds)) / 3
>> [1] 157
>>>
Val
>
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>
> > cdsBy(txdb, use.names=TRUE)$uc002wcw.3
> GRanges object with 2 ranges and 3 metadata columns:
> seqnames ranges strand | cds_id cds_name exon_rank
> <Rle> <IRanges> <Rle> | <integer> <character> <integer>
> [1] chr20 [68351, 68408] + | 206100 <NA> 1
> [2] chr20 [76646, 77058] + | 206101 <NA> 2
> -------
> seqinfo: 93 sequences (1 circular) from hg19 genome
>
>
> library(VariantAnnotation)
> > rowRanges(vcf) # hand-made VCF
> GRanges object with 2 ranges and 5 metadata columns:
> seqnames ranges strand | paramRangeID
> <Rle> <IRanges> <Rle> | <factor>
> ins before stop codon chr20 [77055, 77055] * | <NA>
> ins after stop codon chr20 [77058, 77058] * | <NA>
> REF ALT QUAL
> FILTER
> <DNAStringSet> <DNAStringSetList> <numeric>
> <character>
> ins before stop codon T TG 70
> PASS
> ins after stop codon A AG 70
> PASS
> -------
> seqinfo: 1 sequence from hg19 genome
>
> Calling predictCoding():
>
> > library(BSgenome.Hsapiens.UCSC.hg19)
> > predictCoding(vcf, txdb, Hsapiens)
> GRanges object with 2 ranges and 17 metadata columns:
> seqnames ranges strand | paramRangeID
> <Rle> <IRanges> <Rle> | <factor>
> ins before stop codon chr20 [77055, 77055] + | <NA>
> ins after stop codon chr20 [77058, 77058] + | <NA>
> REF ALT QUAL
> FILTER
> <DNAStringSet> <DNAStringSetList> <numeric>
> <character>
> ins before stop codon T TG 70
> PASS
> ins after stop codon A AG 70
> PASS
> varAllele CDSLOC PROTEINLOC QUERYID
> <DNAStringSet> <IRanges> <IntegerList> <integer>
> ins before stop codon TG [468, 468] 156 1
> ins after stop codon AG [471, 471] 157 2
> TXID CDSID GENEID CONSEQUENCE
> <character> <IntegerList> <character> <factor>
> ins before stop codon 70477 245938 frameshift
> ins after stop codon 70477 245938 frameshift
> REFCODON VARCODON REFAA
> <DNAStringSet> <DNAStringSet> <AAStringSet>
> ins before stop codon AAT AAT
> ins after stop codon TAA TAA
> VARAA
> <AAStringSet>
> ins before stop codon
> ins after stop codon
> -------
> seqinfo: 1 sequence from hg19 genome
>
> PROTEINLOC, REFCODON, VARCODON, and CONSEQUENCE don't seem quite right
> to me. Could be that my hand-made vcf is missing some important data
> needeed by predictCoding() though...
>
> H.
>
>
>>
>> robert.
>>
>> On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote:
>> > There would be a LOT of value in having core packages use
>> exactly the
>> > same representation; I don't have any opinion about which one is
>> > best.
>> >
>> > Kasper
>> >
>> > On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès
>> <hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>> > <mailto:hpages at fredhutch.org> <mailto:hpages at fredhutch.org>>
>> wrote:
>> >
>> > On 03/16/2015 09:22 AM, Michael Lawrence wrote:
>> >
>> > Yes, I think it would make sense for the Xtra package to follow
>> the
>> > established convention in VariantAnnotation/VCF.
>> >
>> >
>> > I don't know. I agree it would be nice to use a more consistent
>> > representation of insertions across the software but I'm not
>> > convinced we should necessarily follow the VCF way, which is to
>> > include the base before the event in the ref and alt alleles as
>> well
>> > as in the reported range.
>> >
>> > Note that there doesn't seem to be any consensus in the broader
>> > Bioinformatics community either. For example dbSNP and HGVS report
>> > the range that corresponds to the 2 flanking nucleotides but they
>> > don't include these nucleotides in the ref or alt alleles. VCF
>> does
>> > the same as GFF3 which says "start equals end and the implied
>> site is
>> > to the right of the indicated base" except that VCF wants to treat
>> > events that occur at position 1 in a special way. In that case VCF
>> > says the base *after* the event should be included (seems like the
>> > VCF authors want to avoid both: empty ranges and also ranges that
>> > start at POS 0). BTW it doesn't seem that
>> > VariantAnnotation::isInsertion() is aware of that special
>> treatment.
>> >
>> > UCSC uses a zero-width range, and so does the XtraSNPlocs.*
>> > packages. The advantage of this representation is its simplicity.
>> > There is no special cases. It also reflects the notion that an
>> > insertion is a replacement of an empty string with a non-empty
>> > string. No need to include flanking nucleotides in the
>> representation
>> > (which is artificial and distorts the "real" alt allele).
>> >
>> >
>> > H.
>> >
>> >
>> > On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo
>> > <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>
>> <mailto:robert.castelo at upf.edu> <mailto:robert.castelo at upf.edu>>
>> wrote:
>> >
>> > dear devel people, specially Val and Michael,
>> >
>> > Hervé has recently added an annotation package that includes
>> > non-SNVs variants from dbSNP, it is called:
>> >
>> > library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>> >
>> > if you execute the corresponding example you'll see the kind of
>> > information stored in the package:
>> >
>> > example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
>> >
>> >
>> > if you pay attention to the following case:
>> >
>> > my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns:
>> > seqnames ranges strand | RefSNP_id alleles <Rle>
>> > <IRanges> <Rle> | <character> <character> [1] 22 [10513380,
>> > 10513380] - | rs386831164 -/T [2] 22 [10519678,
>> > 10519677] + | rs71286731 -/TTT ref_allele
>> <DNAStringSet>
>> > [1] T [2] - ------- seqinfo: 25
>> sequences
>> > (1 circular) from GRCh38 genome
>> >
>> > you'll see the first variant (rs386831164) is a deletion of one
>> > nucleotide and the second (rs71286731) is an insertion of three
>> > nucleotides (TTT).
>> >
>> > it struck me that the ranges representing the insertion had an
>> start
>> > position one nucleotide larger than then and i contacted Hervé
>> > thinking that this was a mistake. however, i've learned from
>> him that
>> > these are so-called "zero-width" ranges and they actually allow to
>> > distinguish insertions from every other type of variant without
>> the
>> > need to know anything about the reference or the alternate allele.
>> >
>> > currently, the VCF specification 4.2 (http://samtools.github.io/
>> > hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition
>> of the
>> > REF column to help distinguishing insertions by including the
>> > flanking nucleotide of the inserted sequence. As a result,
>> > VariantAnnotation::readVcf() produces ranges that mimic this
>> > standard having identical start and end positions leading to
>> 1-width
>> > ranges:
>> >
>> > fl <- system.file("extdata", "CEUtrio.vcf.bgz",
>> > package="VariantFiltering") vcf <- readVcf(fl, genome="hg19")
>> > rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2
>> ranges
>> > and 5 metadata columns: seqnames ranges strand |
>> > paramRangeID <Rle> <IRanges> <Rle> | <factor>
>> > rs11474033 20 [45093330, 45093330] * | <NA>
>> > 20:47592746_G/GGC 20 [47592746, 47592746] * |
>> > <NA> REF ALT QUAL FILTER <DNAStringSet>
>> > <DNAStringSetList> <numeric> <character> rs11474033 C
>> > CTTCT 2901.12 . 20:47592746_G/GGC G
>> > GGC 608.88 . ------- seqinfo: 84 sequences from hg19
>> > genome
>> >
>> >
>> > table(width(rowRanges(vcf[isInsertion(vcf), ])))
>> >
>> > 1 78
>> >
>> > i would like to ask whether it would make sense to harmonize
>> the way
>> > in which dbSNP insertions and variants are imported into
>> Bioconductor
>> > by making VariantAnnotation::readVcf() to produce zero-width
>> ranges
>> > for insertion variants. this not a feature request, i only
>> would like
>> > to know what whether there is a particular reason not to use the
>> > available zero-width ranges that seem to be implemented for this
>> > purpose.
>> >
>> >
>> > cheers,
>> >
>> > robert.
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
>> mailing
>> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> >
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
>> mailing
>> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> >
>> >
>> > -- Hervé Pagès
>> >
>> > Program in Computational Biology Division of Public Health
>> Sciences
>> > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N,
>> M1-B514
>> > P.O. Box 19024 Seattle, WA 98109-1024
>> >
>> > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org>
>> <mailto:hpages at fredhutch.org> <mailto:hpages at fredhutch.org> Phone:
>> > (206) 667-5791 <tel:%28206%29%20667-5791>
>> <tel:%28206%29%20667-5791> <tel:%28206%29%20667-5791> Fax: (206)
>> 667-1319
>> > <tel:%28206%29%20667-1319> <tel:%28206%29%20667-1319>
>> >
>> >
>> > _______________________________________________
>> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org>
>> <mailto:Bioc-devel at r-project.org> <mailto:Bioc-devel at r-project.org>
>> mailing
>> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
>> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
>> >
>> >
>>
>>
>>
>
More information about the Bioc-devel
mailing list