[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness
Robert Castelo
robert.castelo at upf.edu
Thu Jun 11 18:53:10 CEST 2015
In fact, a feature like locateVariants() returning an object of the same
class of the input query argument would be (IMO) a useful feature at
least in the case of an input CollapsedVCF object, because the user
could then dump directly the annotations to a VCF file with writeVcf(),
which is one way to store variant annotations into files.
On 6/11/15 6:38 PM, Michael Lawrence wrote:
> I didn't realize that locateVariants() returned an object with its
> strand matching that of the subject. I would have expected the subject
> strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
> sounds like you want to merge the locateVariants() output with the
> input. Merging the output strand as LOCSTRAND on the VRanges sounds
> like a reasonable approach, for now. I don't know if Val is listening,
> but it sounds like it would be nice to have convenient functions for
> merging locateVariants() output with its input. The one for VRanges
> might do something like the above.
>
> Michael
>
> On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
>> Of course, the inclusion of strand would imply an interpretation of the
>> variant and its strand (e.g., "-") with respect to an annotated feature. I
>> can see a practical problem of integrity of the information on a VRanges
>> object, by which a mandatory column, such as strand, depends on a
>> non-mandatory column, such as some feature annotation stored as a metadata
>> column.
>>
>> A solution would be to add the transcript identifier (TXID) as mandatory
>> column on the VRanges object but I suspect this is a big change to do, so
>> adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
>> locateVariants) in the metadata columns of the VRanges object would allow me
>> to use a VRanges object as a container of variant x allele x sample x
>> annotation.
>>
>> Just to clear up the issue of merging strand and variant: a noisy variant (a
>> variant that is not silent) and has a, e.g., loss-of-function effect such as
>> the gain of a stop codon, is usually interpreted in the strand of the
>> transcript and coding sequence in which the stop codon is gained, saying
>> something like and A changed to a T producting the stop codon TAA. Ref and
>> alt alleles are called in the strand of the reference chromosome, so if the
>> transcript was annotated in the negative strand, we would know that we need
>> to reverse-complement ref and alt to interpret the variant, although I see
>> no need to do anything on the VRanges object to ref and alt because we know
>> they are always in the strand of the reference chromosome. Only if you want
>> to detect this stop-gain event (with predictCoding) then you would have to
>> reverse-complement the ref and alt alleles. Conversely, if the variant falls
>> in an intergenic region, then obviously the strand plays no role in the
>> interpretation of the variant and nothing needs to be done when interpreting
>> the ref and alt alleles.
>>
>>
>> On 6/11/15 5:47 PM, Michael Lawrence wrote:
>>> The fact that the position describes the variant, but the strand
>>> refers to the transcript is confusing to me. What is the concrete use
>>> case for merging the two features like that? VRanges constrains its
>>> strand for at least 2 reasons: (1) to be less error prone [of course
>>> this runs completely counter to flexibility] and (2) simplicity [we
>>> don't have to worry about what "-" means for ref/alt, overlap, etc].
>>>
>>> On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo <robert.castelo at upf.edu>
>>> wrote:
>>>> one option for me is just to add a metadata column with the strand of the
>>>> overlapping feature. however, i'm interested to fully understand the
>>>> rationale behind this aspect of the design of the VRanges object.
>>>>
>>>> a VRanges object unrolls variants in a VCF file per alternative allele
>>>> and
>>>> sample. variants in VCF files are obtained from tallying reads aligned on
>>>> a
>>>> reference genome. so, my understanding is that the reference allele is
>>>> the
>>>> allele of the reference genome against which the reads were aligned while
>>>> the alternate allele(s) are allele calls different from the reference.
>>>> from
>>>> this perspective, my interpretation is that ref and alt alleles have
>>>> already
>>>> a strand, which is the strand of the reference chromosome against which
>>>> the
>>>> reads were aligned to. i'm interested in this interpretation of the
>>>> strand
>>>> of the variants because i'm interested in the interpretation of
>>>> sequence-features containing the reference and the alternate alleles,
>>>> such
>>>> as differences in a binding site with the reference and the alternate
>>>> allele.
>>>>
>>>> if we relax the meaning of elements in a VRanges object to, not only
>>>> variants x allele x sample, but to variants x allele x sample x
>>>> annotated-feature, then i think it would make sense to have the
>>>> strand-specific annotation in the strand slot of the VRanges object.
>>>>
>>>> while this idea may be good or not for a number of reasons, i'm now
>>>> mostly
>>>> interested in knowing whether i'm misinterpreting the design of VRanges
>>>> objects, and maybe variant calling in general or i'm in a more or less
>>>> right
>>>> path in using a VRanges object to hold variant annotations.
>>>>
>>>>
>>>> thanks!!!
>>>>
>>>> robert.
>>>>
>>>>
>>>> On 06/11/2015 04:30 AM, Michael Lawrence wrote:
>>>>> I guess it depends on what the strand should mean. Would having a
>>>>> negative strand indicate that the ref/alt should be complemented? I'm
>>>>> not sure it's a good idea to conflate the strand of the variant itself
>>>>> with the strand of an overlapping feature.
>>>>>
>>>>> On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo<robert.castelo at upf.edu>
>>>>> wrote:
>>>>>> my understanding was that VRanges is a container for variants and
>>>>>> variant
>>>>>> annotations and strand is just one annotation more. when we use
>>>>>> locateVariants() a variant can be annotated to multiple transcripts
>>>>>> where
>>>>>> in
>>>>>> one overlaps an exon, in another an intron and so on. In all those
>>>>>> transcripts annotations there is a strand annotation, the strand of the
>>>>>> transcript. if the transcript is annotated in the negative strand of
>>>>>> the
>>>>>> reference chromosome then the annotation of a transcript region to a
>>>>>> variant
>>>>>> is going to be also on the negative strand.
>>>>>>
>>>>>> both locateVariants() and predictCoding() return GRanges objects with
>>>>>> strand
>>>>>> annotations according to the transcripts being annotated. I thought it
>>>>>> made
>>>>>> sense in VariantFiltering to use VRanges objects as a container for
>>>>>> variants and annotations and, for this reason, I would like to carry on
>>>>>> the
>>>>>> strand annotation into the VRanges object. Is there a strong reason for
>>>>>> a
>>>>>> VRanges object, derived from GRanges, not to have strand?
>>>>>>
>>>>>>
>>>>>> On 6/10/15 6:26 PM, Michael Lawrence wrote:
>>>>>>>
>>>>>>> VRanges is supposed to enforce strand. The goal is to use "*" always,
>>>>>>> for simplicity and consistency with the result of readVcf(). Is there
>>>>>>> a use case for negative strand variants?
>>>>>>>
>>>>>>> On Wed, Jun 10, 2015 at 5:54 AM, Robert
>>>>>>> Castelo<robert.castelo at upf.edu>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Michael,
>>>>>>>>
>>>>>>>> regarding our email exchange three weeks ago, I found a couple of
>>>>>>>> places
>>>>>>>> in
>>>>>>>> VariantAnnotation that IMO need to be updated to avoid enforcing
>>>>>>>> strand
>>>>>>>> on
>>>>>>>> VRanges.
>>>>>>>>
>>>>>>>> these places occur when constructing and validating VRanges objects,
>>>>>>>> concretely at:
>>>>>>>>
>>>>>>>> 1. file R/methods-VRanges-class.R at the VRanges class constructor:
>>>>>>>>
>>>>>>>> VRanges<-
>>>>>>>> function(seqnames = Rle(), ranges = IRanges(),
>>>>>>>> ref = character(), alt = NA_character_,
>>>>>>>> totalDepth = NA_integer_, refDepth = NA_integer_,
>>>>>>>> altDepth = NA_integer_, ..., sampleNames =
>>>>>>>> NA_character_,
>>>>>>>> softFilterMatrix = FilterMatrix(matrix(nrow =
>>>>>>>> length(gr),
>>>>>>>> ncol =
>>>>>>>> 0L),
>>>>>>>> FilterRules()),
>>>>>>>> hardFilters = FilterRules())
>>>>>>>> {
>>>>>>>> gr<- GRanges(seqnames, ranges,
>>>>>>>> strand = .rleRecycleVector("*", length(ranges)),
>>>>>>>> ...)
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> that precludes setting the strand at construction time:
>>>>>>>>
>>>>>>>> library(VariantAnnotation)
>>>>>>>> VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C",
>>>>>>>> strand="-")
>>>>>>>> Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
>>>>>>>> length(ranges)), :
>>>>>>>> formal argument "strand" matched by multiple actual arguments
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. R/AllClasses.R at the VRanges class validity function
>>>>>>>> .valid.VRanges():
>>>>>>>>
>>>>>>>> .valid.VRanges.strand<- function(object) {
>>>>>>>> if (any(object at strand == "-"))
>>>>>>>> paste("'strand' must always be '+' or '*'")
>>>>>>>> }
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> .valid.VRanges<- function(object)
>>>>>>>> {
>>>>>>>> c(.valid.VRanges.ref(object),
>>>>>>>> .valid.VRanges.alt(object),
>>>>>>>> .valid.VRanges.strand(object),
>>>>>>>> .valid.VRanges.depth(object))
>>>>>>>> }
>>>>>>>>
>>>>>>>> that prompts an error when variants annotated on the negative strand
>>>>>>>> are
>>>>>>>> detected:
>>>>>>>>
>>>>>>>> library(VariantAnnotation)
>>>>>>>> example(VRanges)
>>>>>>>> strand(vr)<- "-"
>>>>>>>> c(vr)
>>>>>>>> Error in validObject(.Object) :
>>>>>>>> invalid class “VRanges” object: 'strand' must always be '+' or
>>>>>>>> '*'
>>>>>>>>
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>>
>>>>>>>> robert.
>>>>>>>>
>>>>>>>> On 05/22/2015 09:49 PM, Michael Lawrence wrote:
>>>>>>>>>
>>>>>>>>> This changed recently. VariantAnnotation in devel no longer enforces
>>>>>>>>> a
>>>>>>>>> strand on VRanges, or at least it allows the "*" case.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, May 22, 2015 at 11:33 AM, Robert
>>>>>>>>> Castelo<robert.castelo at upf.edu
>>>>>>>>> <mailto:robert.castelo at upf.edu>> wrote:
>>>>>>>>>
>>>>>>>>> Hi,
>>>>>>>>>
>>>>>>>>> I have encountered myself in a strange situation when using
>>>>>>>>> the
>>>>>>>>> function locateVariants() from VariantAnnotation with an
>>>>>>>>> input
>>>>>>>>> VRanges object. The problem is that some of the expected
>>>>>>>>> coding
>>>>>>>>> annotations are not showing up when using locateVariants()
>>>>>>>>> with
>>>>>>>>> default parameters.
>>>>>>>>>
>>>>>>>>> After investigating this situation I think I found the
>>>>>>>>> reason,
>>>>>>>>> which
>>>>>>>>> does not look like a bug but I would like that you give me
>>>>>>>>> some
>>>>>>>>> clarification about the logic behind using locateVariants()
>>>>>>>>> with
>>>>>>>>> VRanges objects.
>>>>>>>>>
>>>>>>>>> The documentation of the VRanges-class says that in this
>>>>>>>>> class
>>>>>>>>> of
>>>>>>>>> objects "The strand is always constrained to be positive
>>>>>>>>> (+).".
>>>>>>>>> I
>>>>>>>>> guess there may be a good reason for this but I could not
>>>>>>>>> find
>>>>>>>>> it
>>>>>>>>> in
>>>>>>>>> the documentation or googling about it.
>>>>>>>>>
>>>>>>>>> This means that when you coerce a CollapsedVCF object
>>>>>>>>> (obtained,
>>>>>>>>> for
>>>>>>>>> example, from a VCF file via readVcf()) to a VRanges object,
>>>>>>>>> even
>>>>>>>>> though variants in the VCF may have no strand, they get a
>>>>>>>>> positive
>>>>>>>>> strand in the VRanges object.
>>>>>>>>>
>>>>>>>>> The problem arises then, when you call locateVariants() with
>>>>>>>>> this
>>>>>>>>> VRanges object, because features on the negative strand are
>>>>>>>>> never
>>>>>>>>> going to overlap with the variants since, by default, the
>>>>>>>>> argument
>>>>>>>>> ignore.strand=FALSE.
>>>>>>>>>
>>>>>>>>> Let me illustrate this with a toy example. Consider the SNP
>>>>>>>>> rs1129038
>>>>>>>>>
>>>>>>>>> (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038)
>>>>>>>>> at
>>>>>>>>> chr15:28356859 with allels A/G. It is located on the 3' UTR
>>>>>>>>> of
>>>>>>>>> the
>>>>>>>>> gene HERC2 coded on the negative strand of the human
>>>>>>>>> reference
>>>>>>>>> genome. Let's build a toy VRanges object having this variant:
>>>>>>>>>
>>>>>>>>> library(VariantAnnotation)
>>>>>>>>> vr<- VRanges(seqnames="chr15",
>>>>>>>>> ranges=IRanges(28356859, 28356859),
>>>>>>>>> ref="A", alt="G",
>>>>>>>>> refDepth=5, altDepth=7,
>>>>>>>>> totalDepth=12, sampleNames="A")
>>>>>>>>> strand(vr)
>>>>>>>>> factor-Rle of length 1 with 1 run
>>>>>>>>> Lengths: 1
>>>>>>>>> Values : +
>>>>>>>>> Levels(3): + - *
>>>>>>>>>
>>>>>>>>> Let's build now its CollapsedVCF counterpart by using the
>>>>>>>>> corresponding coercion method and set the strand to "*":
>>>>>>>>>
>>>>>>>>> vcf<- asVCF(vr)
>>>>>>>>> strand(vcf)<- "*"
>>>>>>>>>
>>>>>>>>> Now run locateVariants() on both objects with UCSC
>>>>>>>>> annotations:
>>>>>>>>>
>>>>>>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>>>>> txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>>>>>>
>>>>>>>>> locateVariants(vcf, txdb, region=AllVariants())
>>>>>>>>> GRanges object with 2 ranges and 9 metadata columns:
>>>>>>>>> seqnames ranges strand | LOCATION
>>>>>>>>> LOCSTART
>>>>>>>>> LOCEND QUERYID TXID CDSID
>>>>>>>>> <Rle> <IRanges> <Rle> |<factor> <integer> <integer>
>>>>>>>>> <integer>
>>>>>>>>> <character> <IntegerList>
>>>>>>>>> [1] chr15 [28356859, 28356859] * | threeUTR
>>>>>>>>> 50
>>>>>>>>> 50
>>>>>>>>> 1 55386
>>>>>>>>> [2] chr15 [28356859, 28356859] * | threeUTR
>>>>>>>>> 50
>>>>>>>>> 50
>>>>>>>>> 1 55387
>>>>>>>>> GENEID PRECEDEID FOLLOWID
>>>>>>>>> <character> <CharacterList> <CharacterList>
>>>>>>>>> [1] 8924
>>>>>>>>> [2] 8924
>>>>>>>>> -------
>>>>>>>>> seqinfo: 1 sequence from an unspecified genome; no
>>>>>>>>> seqlengths
>>>>>>>>>
>>>>>>>>> locateVariants(vr, txdb, region=AllVariants())
>>>>>>>>> GRanges object with 1 range and 9 metadata columns:
>>>>>>>>> seqnames ranges strand | LOCATION
>>>>>>>>> LOCSTART
>>>>>>>>> LOCEND QUERYID TXID CDSID
>>>>>>>>> <Rle> <IRanges> <Rle> |<factor> <integer> <integer>
>>>>>>>>> <integer>
>>>>>>>>> <integer> <IntegerList>
>>>>>>>>> [1] chr15 [28356859, 28356859] + | intergenic<NA>
>>>>>>>>> <NA>
>>>>>>>>> 1<NA>
>>>>>>>>> GENEID PRECEDEID
>>>>>>>>> FOLLOWID
>>>>>>>>> <character> <CharacterList> <CharacterList>
>>>>>>>>> [1]<NA> 100132565,100289656,100616223,... 2567
>>>>>>>>> -------
>>>>>>>>> seqinfo: 1 sequence from an unspecified genome; no
>>>>>>>>> seqlengths
>>>>>>>>>
>>>>>>>>> Note that while we get the 3' UTR annotation from the
>>>>>>>>> strandless
>>>>>>>>> VCF
>>>>>>>>> object we do not get it from the VRanges object with the
>>>>>>>>> positive
>>>>>>>>> strand. To make my point clear: this positive strand shows up
>>>>>>>>> when
>>>>>>>>> you coerce a strandless VCF object to a VRanges one, because
>>>>>>>>> positive strandness seems to be the convention for VRanges
>>>>>>>>> objects:
>>>>>>>>>
>>>>>>>>> as(vcf, VRanges)
>>>>>>>>> VRanges object with 1 range and 1 metadata column:
>>>>>>>>> seqnames ranges strand ref
>>>>>>>>> alt totalDepth refDepth altDepth
>>>>>>>>> <Rle> <IRanges> <Rle> <character> <characterOrRle>
>>>>>>>>> <integerOrRle>
>>>>>>>>> <integerOrRle> <integerOrRle>
>>>>>>>>> [1] chr15 [28356859, 28356859] + A
>>>>>>>>> G 12 5 7
>>>>>>>>> sampleNames softFilterMatrix | QUAL
>>>>>>>>> <factorOrRle> <matrix> |<numeric>
>>>>>>>>> [1] A |<NA>
>>>>>>>>> -------
>>>>>>>>> seqinfo: 1 sequence from an unspecified genome; no
>>>>>>>>> seqlengths
>>>>>>>>> hardFilters: NULL
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> Of course, if I run locateVariants() with the argument
>>>>>>>>> ignore.strand=TRUE, then I get the expected annotation:
>>>>>>>>>
>>>>>>>>> locateVariants(vr, txdb, region=AllVariants(),
>>>>>>>>> ignore.strand=TRUE)
>>>>>>>>> GRanges object with 2 ranges and 9 metadata columns:
>>>>>>>>> seqnames ranges strand | LOCATION
>>>>>>>>> LOCSTART
>>>>>>>>> LOCEND QUERYID TXID CDSID
>>>>>>>>> <Rle> <IRanges> <Rle> |<factor> <integer> <integer>
>>>>>>>>> <integer>
>>>>>>>>> <character> <IntegerList>
>>>>>>>>> [1] chr15 [28356859, 28356859] + | threeUTR
>>>>>>>>> 677
>>>>>>>>> 677 1 55386
>>>>>>>>> [2] chr15 [28356859, 28356859] + | threeUTR
>>>>>>>>> 677
>>>>>>>>> 677 1 55387
>>>>>>>>> GENEID PRECEDEID FOLLOWID
>>>>>>>>> <character> <CharacterList> <CharacterList>
>>>>>>>>> [1] 8924
>>>>>>>>> [2] 8924
>>>>>>>>> -------
>>>>>>>>> seqinfo: 1 sequence from an unspecified genome; no
>>>>>>>>> seqlengths
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> So, my question is, given that VRanges objects are enforced
>>>>>>>>> to
>>>>>>>>> have
>>>>>>>>> a positive strand, would not be better to have
>>>>>>>>> ignore.strand=TRUE
>>>>>>>>> as
>>>>>>>>> default in locateVariants?
>>>>>>>>>
>>>>>>>>> Alternatively, I would recommend that locateVariants() issues
>>>>>>>>> a
>>>>>>>>> warning, or maybe an error, when the input object is VRanges
>>>>>>>>> and
>>>>>>>>> ignore.strand=FALSE.
>>>>>>>>>
>>>>>>>>> Finally, out of curiosity, why a VRanges object enforces the
>>>>>>>>> positive strand in all its genomic ranges? Would not be
>>>>>>>>> better
>>>>>>>>> just
>>>>>>>>> taking the strand of the CollapsedVCF object when coercing
>>>>>>>>> the
>>>>>>>>> CollapsedVCF object to VRanges?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> thanks!!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> robert.
>>>>>>>>>
>>>>>>>>> _______________________________________________
>>>>>>>>> Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>
>>>>>>>>> mailing
>>>>>>>>> list
>>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Robert Castelo, PhD
>>>>>>>> Associate Professor
>>>>>>>> Dept. of Experimental and Health Sciences
>>>>>>>> Universitat Pompeu Fabra (UPF)
>>>>>>>> Barcelona Biomedical Research Park (PRBB)
>>>>>>>> Dr Aiguader 88
>>>>>>>> E-08003 Barcelona, Spain
>>>>>>>> telf: +34.933.160.514
>>>>>>>> fax: +34.933.160.550
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>> --
>>>> Robert Castelo, PhD
>>>> Associate Professor
>>>> Dept. of Experimental and Health Sciences
>>>> Universitat Pompeu Fabra (UPF)
>>>> Barcelona Biomedical Research Park (PRBB)
>>>> Dr Aiguader 88
>>>> E-08003 Barcelona, Spain
>>>> telf: +34.933.160.514
>>>> fax: +34.933.160.550
>>
More information about the Bioc-devel
mailing list