[Bioc-devel] zero-width ranges representing insertions
Hervé Pagès
hpages at fredhutch.org
Mon Mar 16 23:54:03 CET 2015
Using a GAlignments object to show 3 different representations of the
same insertion (three nucleotides (TTT) inserted between positions 10
and 11):
library(GenomicAlignments)
gal <- GAlignments(Rle("chr1", 3), pos=c(10L, 10L, 11L),
cigar=c("1M3I1M", "1M3I", "3I"),
strand=Rle(strand("+"), 3),
ref=c("-", "A", ""),
alt=c("TTT", "ATTT", "TTT"),
who=c("dbSNP/HGVS", "VariantAnnotation/VCF",
"XtraSNPlocs/UCSC"))
> gal
GAlignments object with 3 alignments and 3 metadata columns:
seqnames strand cigar qwidth start end width
<Rle> <Rle> <character> <integer> <integer> <integer> <integer>
[1] chr1 + 1M3I1M 5 10 11 2
[2] chr1 + 1M3I 4 10 10 1
[3] chr1 + 3I 3 11 10 0
njunc | ref alt who
<integer> | <character> <character> <character>
[1] 0 | - TTT dbSNP/HGVS
[2] 0 | A ATTT VariantAnnotation/VCF
[3] 0 | TTT XtraSNPlocs/UCSC
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
In the context of a GAlignments object the alt allele can be seen
as the sequence of the query (SEQ field in BAM). So we expect the
number of nucleotides in the alt allele to match the qwidth:
> nchar(mcols(gal)$alt) == qwidth(gal)
[1] FALSE TRUE TRUE
Not true for the dbSNP/HGVS representation.
The range of the insertion on the reference is:
> as(gal, "GRanges")
GRanges object with 3 ranges and 3 metadata columns:
seqnames ranges strand | ref alt
who
<Rle> <IRanges> <Rle> | <character> <character>
<character>
[1] chr1 [10, 11] + | - TTT
dbSNP/HGVS
[2] chr1 [10, 10] + | A ATTT
VariantAnnotation/VCF
[3] chr1 [11, 10] + | TTT
XtraSNPlocs/UCSC
-------
seqinfo: 1 sequence from an unspecified genome; no seqlengths
We expect its width to be the same as the nb of nucleotides in the
ref alleles:
> width(gal) == nchar(mcols(gal)$ref)
[1] FALSE TRUE TRUE
Again, not true for the dbSNP/HGVS representation.
H.
On 03/16/2015 03:12 PM, Robert Castelo wrote:
> +1 IMO BioC could adopt the zero-width ranges representation for
> insertions, adapting readVcf(), writeVcf(), XtraSNPlocs.*, etc., to deal
> with each corresponding beast, be VCF, dbSNP or the like. Who knows, VCF
> could also change their representation in the future and it'll be a
> headache to update the affected packages if we decide to keep using its
> insertion representation internally to store variant ranges in BioC.
>
> robert.
>
> On 3/16/15 9:00 PM, Kasper Daniel Hansen wrote:
>> There would be a LOT of value in having core packages use exactly the
> > same representation; I don't have any opinion about which one is
> > best.
> >
> > Kasper
> >
> > On Mon, Mar 16, 2015 at 3:38 PM, Hervé Pagès <hpages at fredhutch.org
> > <mailto:hpages at fredhutch.org>> wrote:
> >
> > On 03/16/2015 09:22 AM, Michael Lawrence wrote:
> >
> > Yes, I think it would make sense for the Xtra package to follow the
> > established convention in VariantAnnotation/VCF.
> >
> >
> > I don't know. I agree it would be nice to use a more consistent
> > representation of insertions across the software but I'm not
> > convinced we should necessarily follow the VCF way, which is to
> > include the base before the event in the ref and alt alleles as well
> > as in the reported range.
> >
> > Note that there doesn't seem to be any consensus in the broader
> > Bioinformatics community either. For example dbSNP and HGVS report
> > the range that corresponds to the 2 flanking nucleotides but they
> > don't include these nucleotides in the ref or alt alleles. VCF does
> > the same as GFF3 which says "start equals end and the implied site is
> > to the right of the indicated base" except that VCF wants to treat
> > events that occur at position 1 in a special way. In that case VCF
> > says the base *after* the event should be included (seems like the
> > VCF authors want to avoid both: empty ranges and also ranges that
> > start at POS 0). BTW it doesn't seem that
> > VariantAnnotation::isInsertion() is aware of that special treatment.
> >
> > UCSC uses a zero-width range, and so does the XtraSNPlocs.*
> > packages. The advantage of this representation is its simplicity.
> > There is no special cases. It also reflects the notion that an
> > insertion is a replacement of an empty string with a non-empty
> > string. No need to include flanking nucleotides in the representation
> > (which is artificial and distorts the "real" alt allele).
> >
> >
> > H.
> >
> >
> > On Mon, Mar 16, 2015 at 9:16 AM, Robert Castelo
> > <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>> wrote:
> >
> > dear devel people, specially Val and Michael,
> >
> > Hervé has recently added an annotation package that includes
> > non-SNVs variants from dbSNP, it is called:
> >
> > library(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
> >
> > if you execute the corresponding example you'll see the kind of
> > information stored in the package:
> >
> > example(XtraSNPlocs.Hsapiens.dbSNP141.GRCh38)
> >
> >
> > if you pay attention to the following case:
> >
> > my_snps1[1:2] GRanges object with 2 ranges and 3 metadata columns:
> > seqnames ranges strand | RefSNP_id alleles <Rle>
> > <IRanges> <Rle> | <character> <character> [1] 22 [10513380,
> > 10513380] - | rs386831164 -/T [2] 22 [10519678,
> > 10519677] + | rs71286731 -/TTT ref_allele <DNAStringSet>
> > [1] T [2] - ------- seqinfo: 25 sequences
> > (1 circular) from GRCh38 genome
> >
> > you'll see the first variant (rs386831164) is a deletion of one
> > nucleotide and the second (rs71286731) is an insertion of three
> > nucleotides (TTT).
> >
> > it struck me that the ranges representing the insertion had an start
> > position one nucleotide larger than then and i contacted Hervé
> > thinking that this was a mistake. however, i've learned from him that
> > these are so-called "zero-width" ranges and they actually allow to
> > distinguish insertions from every other type of variant without the
> > need to know anything about the reference or the alternate allele.
> >
> > currently, the VCF specification 4.2 (http://samtools.github.io/
> > hts-specs/VCFv4.2.pdf page 5) uses the nucleotide composition of the
> > REF column to help distinguishing insertions by including the
> > flanking nucleotide of the inserted sequence. As a result,
> > VariantAnnotation::readVcf() produces ranges that mimic this
> > standard having identical start and end positions leading to 1-width
> > ranges:
> >
> > fl <- system.file("extdata", "CEUtrio.vcf.bgz",
> > package="VariantFiltering") vcf <- readVcf(fl, genome="hg19")
> > rowRanges(vcf[isInsertion(vcf), ])[1:2] GRanges object with 2 ranges
> > and 5 metadata columns: seqnames ranges strand |
> > paramRangeID <Rle> <IRanges> <Rle> | <factor>
> > rs11474033 20 [45093330, 45093330] * | <NA>
> > 20:47592746_G/GGC 20 [47592746, 47592746] * |
> > <NA> REF ALT QUAL FILTER <DNAStringSet>
> > <DNAStringSetList> <numeric> <character> rs11474033 C
> > CTTCT 2901.12 . 20:47592746_G/GGC G
> > GGC 608.88 . ------- seqinfo: 84 sequences from hg19
> > genome
> >
> >
> > table(width(rowRanges(vcf[isInsertion(vcf), ])))
> >
> > 1 78
> >
> > i would like to ask whether it would make sense to harmonize the way
> > in which dbSNP insertions and variants are imported into Bioconductor
> > by making VariantAnnotation::readVcf() to produce zero-width ranges
> > for insertion variants. this not a feature request, i only would like
> > to know what whether there is a particular reason not to use the
> > available zero-width ranges that seem to be implemented for this
> > purpose.
> >
> >
> > cheers,
> >
> > robert.
> >
> > _______________________________________________
> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >
> >
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >
> >
> > -- Hervé Pagès
> >
> > Program in Computational Biology Division of Public Health Sciences
> > Fred Hutchinson Cancer Research Center 1100 Fairview Ave. N, M1-B514
> > P.O. Box 19024 Seattle, WA 98109-1024
> >
> > E-mail: hpages at fredhutch.org <mailto:hpages at fredhutch.org> Phone:
> > (206) 667-5791 <tel:%28206%29%20667-5791> Fax: (206) 667-1319
> > <tel:%28206%29%20667-1319>
> >
> >
> > _______________________________________________
> > Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
> > list https://stat.ethz.ch/mailman/listinfo/bioc-devel
> > <https://stat.ethz.ch/mailman/listinfo/bioc-devel>
> >
> >
>
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fredhutch.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioc-devel
mailing list