[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Robert Castelo robert.castelo at upf.edu
Thu Jun 11 18:53:10 CEST 2015


In fact, a feature like locateVariants() returning an object of the same 
class of the input query argument would be (IMO) a useful feature at 
least in the case of an input CollapsedVCF object, because the user 
could then dump directly the annotations to a VCF file with writeVcf(), 
which is one way to store variant annotations into files.

On 6/11/15 6:38 PM, Michael Lawrence wrote:
> I didn't realize that locateVariants() returned an object with its
> strand matching that of the subject. I would have expected the subject
> strand to be stored in a LOCSTRAND column, as you suggest. Anyway, it
> sounds like you want to merge the locateVariants() output with the
> input. Merging the output strand as LOCSTRAND on the VRanges sounds
> like a reasonable approach, for now. I don't know if Val is listening,
> but it sounds like it would be nice to have convenient functions for
> merging locateVariants() output with its input. The one for VRanges
> might do something like the above.
>
> Michael
>
> On Thu, Jun 11, 2015 at 9:14 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
>> Of course, the inclusion of strand would imply an interpretation of the
>> variant and its strand (e.g., "-") with respect to an annotated feature. I
>> can see a practical problem of integrity of the information on a VRanges
>> object, by which a mandatory column, such as strand, depends on a
>> non-mandatory column, such as some feature annotation stored as a metadata
>> column.
>>
>> A solution would be to add the transcript identifier (TXID) as mandatory
>> column on the VRanges object but I suspect this is a big change to do, so
>> adding a LOCSTRAND column (next to LOCSTART and LOCEND generated by
>> locateVariants) in the metadata columns of the VRanges object would allow me
>> to use a VRanges object as a container of variant x allele x sample x
>> annotation.
>>
>> Just to clear up the issue of merging strand and variant: a noisy variant (a
>> variant that is not silent) and has a, e.g., loss-of-function effect such as
>> the gain of a stop codon, is usually interpreted in the strand of the
>> transcript and coding sequence in which the stop codon is gained, saying
>> something like and A changed to a T producting the stop codon TAA. Ref and
>> alt alleles are called in the strand of the reference chromosome, so if the
>> transcript was annotated in the negative strand, we would know that we need
>> to reverse-complement ref and alt to interpret the variant, although I see
>> no need to do anything on the VRanges object to ref and alt because we know
>> they are always in the strand of the reference chromosome. Only if you want
>> to detect this stop-gain event (with predictCoding) then you would have to
>> reverse-complement the ref and alt alleles. Conversely, if the variant falls
>> in an intergenic region, then obviously the strand plays no role in the
>> interpretation of the variant and nothing needs to be done when interpreting
>> the ref and alt alleles.
>>
>>
>> On 6/11/15 5:47 PM, Michael Lawrence wrote:
>>> The fact that the position describes the variant, but the strand
>>> refers to the transcript is confusing to me. What is the concrete use
>>> case for merging the two features like that? VRanges constrains its
>>> strand for at least 2 reasons: (1) to be less error prone [of course
>>> this runs completely counter to flexibility] and (2) simplicity [we
>>> don't have to worry about what "-" means for ref/alt, overlap, etc].
>>>
>>> On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo <robert.castelo at upf.edu>
>>> wrote:
>>>> one option for me is just to add a metadata column with the strand of the
>>>> overlapping feature. however, i'm interested to fully understand the
>>>> rationale behind this aspect of the design of the VRanges object.
>>>>
>>>> a VRanges object unrolls variants in a VCF file per alternative allele
>>>> and
>>>> sample. variants in VCF files are obtained from tallying reads aligned on
>>>> a
>>>> reference genome. so, my understanding is that the reference allele is
>>>> the
>>>> allele of the reference genome against which the reads were aligned while
>>>> the alternate allele(s) are allele calls different from the reference.
>>>> from
>>>> this perspective, my interpretation is that ref and alt alleles have
>>>> already
>>>> a strand, which is the strand of the reference chromosome against which
>>>> the
>>>> reads were aligned to. i'm interested in this interpretation of the
>>>> strand
>>>> of the variants because i'm interested in the interpretation of
>>>> sequence-features containing the reference and the alternate alleles,
>>>> such
>>>> as differences in a binding site with the reference and the alternate
>>>> allele.
>>>>
>>>> if we relax the meaning of elements in a VRanges object to, not only
>>>> variants x allele x sample, but to variants x allele x sample x
>>>> annotated-feature, then i think it would make sense to have the
>>>> strand-specific annotation in the strand slot of the VRanges object.
>>>>
>>>> while this idea may be good or not for a number of reasons, i'm now
>>>> mostly
>>>> interested in knowing whether i'm misinterpreting the design of VRanges
>>>> objects, and maybe variant calling in general or i'm in a more or less
>>>> right
>>>> path in using a VRanges object to hold variant annotations.
>>>>
>>>>
>>>> thanks!!!
>>>>
>>>> robert.
>>>>
>>>>
>>>> On 06/11/2015 04:30 AM, Michael Lawrence wrote:
>>>>> I guess it depends on what the strand should mean. Would having a
>>>>> negative strand indicate that the ref/alt should be complemented? I'm
>>>>> not sure it's a good idea to conflate the strand of the variant itself
>>>>> with the strand of an overlapping feature.
>>>>>
>>>>> On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo<robert.castelo at upf.edu>
>>>>> wrote:
>>>>>> my understanding was that VRanges is a container for variants and
>>>>>> variant
>>>>>> annotations and strand is just one annotation more. when we use
>>>>>> locateVariants() a variant can be annotated to multiple transcripts
>>>>>> where
>>>>>> in
>>>>>> one overlaps an exon, in another an intron and so on. In all those
>>>>>> transcripts annotations there is a strand annotation, the strand of the
>>>>>> transcript. if the transcript is annotated in the negative strand of
>>>>>> the
>>>>>> reference chromosome then the annotation of a transcript region to a
>>>>>> variant
>>>>>> is going to be also on the negative strand.
>>>>>>
>>>>>> both locateVariants() and predictCoding() return GRanges objects with
>>>>>> strand
>>>>>> annotations according to the transcripts being annotated. I thought it
>>>>>> made
>>>>>> sense in VariantFiltering to use VRanges objects as a  container for
>>>>>> variants and annotations and, for this reason, I would like to carry on
>>>>>> the
>>>>>> strand annotation into the VRanges object. Is there a strong reason for
>>>>>> a
>>>>>> VRanges object, derived from GRanges, not to have strand?
>>>>>>
>>>>>>
>>>>>> On 6/10/15 6:26 PM, Michael Lawrence wrote:
>>>>>>>
>>>>>>> VRanges is supposed to enforce strand. The goal is to use "*" always,
>>>>>>> for simplicity and consistency with the result of readVcf(). Is there
>>>>>>> a use case for negative strand variants?
>>>>>>>
>>>>>>> On Wed, Jun 10, 2015 at 5:54 AM, Robert
>>>>>>> Castelo<robert.castelo at upf.edu>
>>>>>>> wrote:
>>>>>>>>
>>>>>>>> Michael,
>>>>>>>>
>>>>>>>> regarding our email exchange three weeks ago, I found a couple of
>>>>>>>> places
>>>>>>>> in
>>>>>>>> VariantAnnotation that IMO need to be updated to avoid enforcing
>>>>>>>> strand
>>>>>>>> on
>>>>>>>> VRanges.
>>>>>>>>
>>>>>>>> these places occur when constructing and validating VRanges objects,
>>>>>>>> concretely at:
>>>>>>>>
>>>>>>>> 1. file R/methods-VRanges-class.R at the VRanges class constructor:
>>>>>>>>
>>>>>>>> VRanges<-
>>>>>>>>       function(seqnames = Rle(), ranges = IRanges(),
>>>>>>>>                ref = character(), alt = NA_character_,
>>>>>>>>                totalDepth = NA_integer_, refDepth = NA_integer_,
>>>>>>>>                altDepth = NA_integer_, ..., sampleNames =
>>>>>>>> NA_character_,
>>>>>>>>                softFilterMatrix = FilterMatrix(matrix(nrow =
>>>>>>>> length(gr),
>>>>>>>> ncol =
>>>>>>>> 0L),
>>>>>>>>                  FilterRules()),
>>>>>>>>                hardFilters = FilterRules())
>>>>>>>> {
>>>>>>>>       gr<- GRanges(seqnames, ranges,
>>>>>>>>                     strand = .rleRecycleVector("*", length(ranges)),
>>>>>>>> ...)
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> that precludes setting the strand at construction time:
>>>>>>>>
>>>>>>>> library(VariantAnnotation)
>>>>>>>> VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C",
>>>>>>>> strand="-")
>>>>>>>> Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
>>>>>>>> length(ranges)),  :
>>>>>>>>       formal argument "strand" matched by multiple actual arguments
>>>>>>>>
>>>>>>>>
>>>>>>>> 2. R/AllClasses.R at the VRanges class validity function
>>>>>>>> .valid.VRanges():
>>>>>>>>
>>>>>>>> .valid.VRanges.strand<- function(object) {
>>>>>>>>       if (any(object at strand == "-"))
>>>>>>>>         paste("'strand' must always be '+' or '*'")
>>>>>>>> }
>>>>>>>>
>>>>>>>> [...]
>>>>>>>>
>>>>>>>> .valid.VRanges<- function(object)
>>>>>>>> {
>>>>>>>>       c(.valid.VRanges.ref(object),
>>>>>>>>         .valid.VRanges.alt(object),
>>>>>>>>         .valid.VRanges.strand(object),
>>>>>>>>         .valid.VRanges.depth(object))
>>>>>>>> }
>>>>>>>>
>>>>>>>> that prompts an error when variants annotated on the negative strand
>>>>>>>> are
>>>>>>>> detected:
>>>>>>>>
>>>>>>>> library(VariantAnnotation)
>>>>>>>> example(VRanges)
>>>>>>>> strand(vr)<- "-"
>>>>>>>> c(vr)
>>>>>>>> Error in validObject(.Object) :
>>>>>>>>       invalid class “VRanges” object: 'strand' must always be '+' or
>>>>>>>> '*'
>>>>>>>>
>>>>>>>>
>>>>>>>> cheers,
>>>>>>>>
>>>>>>>> robert.
>>>>>>>>
>>>>>>>> On 05/22/2015 09:49 PM, Michael Lawrence wrote:
>>>>>>>>>
>>>>>>>>> This changed recently. VariantAnnotation in devel no longer enforces
>>>>>>>>> a
>>>>>>>>> strand on VRanges, or at least it allows the "*" case.
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, May 22, 2015 at 11:33 AM, Robert
>>>>>>>>> Castelo<robert.castelo at upf.edu
>>>>>>>>> <mailto:robert.castelo at upf.edu>>  wrote:
>>>>>>>>>
>>>>>>>>>         Hi,
>>>>>>>>>
>>>>>>>>>         I have encountered myself in a strange situation when using
>>>>>>>>> the
>>>>>>>>>         function locateVariants() from VariantAnnotation with an
>>>>>>>>> input
>>>>>>>>>         VRanges object. The problem is that some of the expected
>>>>>>>>> coding
>>>>>>>>>         annotations are not showing up when using locateVariants()
>>>>>>>>> with
>>>>>>>>>         default parameters.
>>>>>>>>>
>>>>>>>>>         After investigating this situation I think I found the
>>>>>>>>> reason,
>>>>>>>>> which
>>>>>>>>>         does not look like a bug but I would like that you give me
>>>>>>>>> some
>>>>>>>>>         clarification about the logic behind using locateVariants()
>>>>>>>>> with
>>>>>>>>>         VRanges objects.
>>>>>>>>>
>>>>>>>>>         The documentation of the VRanges-class says that in this
>>>>>>>>> class
>>>>>>>>> of
>>>>>>>>>         objects "The strand is always constrained to be positive
>>>>>>>>> (+).".
>>>>>>>>> I
>>>>>>>>>         guess there may be a good reason for this but I could not
>>>>>>>>> find
>>>>>>>>> it
>>>>>>>>> in
>>>>>>>>>         the documentation or googling about it.
>>>>>>>>>
>>>>>>>>>         This means that when you coerce a CollapsedVCF object
>>>>>>>>> (obtained,
>>>>>>>>> for
>>>>>>>>>         example, from a VCF file via readVcf()) to a VRanges object,
>>>>>>>>> even
>>>>>>>>>         though variants in the VCF may have no strand, they get a
>>>>>>>>> positive
>>>>>>>>>         strand in the VRanges object.
>>>>>>>>>
>>>>>>>>>         The problem arises then, when you call locateVariants() with
>>>>>>>>> this
>>>>>>>>>         VRanges object, because features on the negative strand are
>>>>>>>>> never
>>>>>>>>>         going to overlap with the variants since, by default, the
>>>>>>>>> argument
>>>>>>>>>         ignore.strand=FALSE.
>>>>>>>>>
>>>>>>>>>         Let me illustrate this with a toy example. Consider the SNP
>>>>>>>>>         rs1129038
>>>>>>>>>
>>>>>>>>> (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038)
>>>>>>>>> at
>>>>>>>>>         chr15:28356859 with allels A/G. It is located on the 3' UTR
>>>>>>>>> of
>>>>>>>>> the
>>>>>>>>>         gene HERC2 coded on the negative strand of the human
>>>>>>>>> reference
>>>>>>>>>         genome. Let's build a toy VRanges object having this variant:
>>>>>>>>>
>>>>>>>>>         library(VariantAnnotation)
>>>>>>>>>         vr<- VRanges(seqnames="chr15",
>>>>>>>>>                        ranges=IRanges(28356859, 28356859),
>>>>>>>>>                        ref="A", alt="G",
>>>>>>>>>                        refDepth=5, altDepth=7,
>>>>>>>>>                        totalDepth=12, sampleNames="A")
>>>>>>>>>         strand(vr)
>>>>>>>>>         factor-Rle of length 1 with 1 run
>>>>>>>>>            Lengths: 1
>>>>>>>>>            Values : +
>>>>>>>>>         Levels(3): + - *
>>>>>>>>>
>>>>>>>>>         Let's build now its CollapsedVCF counterpart by using the
>>>>>>>>>         corresponding coercion method and set the strand to "*":
>>>>>>>>>
>>>>>>>>>         vcf<- asVCF(vr)
>>>>>>>>>         strand(vcf)<- "*"
>>>>>>>>>
>>>>>>>>>         Now run locateVariants() on both objects with UCSC
>>>>>>>>> annotations:
>>>>>>>>>
>>>>>>>>>         library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>>>>>         txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>>>>>>
>>>>>>>>>         locateVariants(vcf, txdb, region=AllVariants())
>>>>>>>>>         GRanges object with 2 ranges and 9 metadata columns:
>>>>>>>>>                seqnames               ranges strand | LOCATION
>>>>>>>>> LOCSTART
>>>>>>>>>         LOCEND   QUERYID        TXID         CDSID
>>>>>>>>>         <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
>>>>>>>>> <integer>
>>>>>>>>>         <character>  <IntegerList>
>>>>>>>>>            [1]    chr15 [28356859, 28356859]      * | threeUTR
>>>>>>>>> 50
>>>>>>>>> 50
>>>>>>>>>                 1       55386
>>>>>>>>>            [2]    chr15 [28356859, 28356859]      * | threeUTR
>>>>>>>>> 50
>>>>>>>>> 50
>>>>>>>>>                 1       55387
>>>>>>>>>                     GENEID       PRECEDEID        FOLLOWID
>>>>>>>>>         <character>  <CharacterList>  <CharacterList>
>>>>>>>>>            [1]        8924
>>>>>>>>>            [2]        8924
>>>>>>>>>            -------
>>>>>>>>>            seqinfo: 1 sequence from an unspecified genome; no
>>>>>>>>> seqlengths
>>>>>>>>>
>>>>>>>>>         locateVariants(vr, txdb, region=AllVariants())
>>>>>>>>>         GRanges object with 1 range and 9 metadata columns:
>>>>>>>>>                seqnames               ranges strand |   LOCATION
>>>>>>>>> LOCSTART
>>>>>>>>>         LOCEND   QUERYID      TXID         CDSID
>>>>>>>>>         <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
>>>>>>>>> <integer>
>>>>>>>>>         <integer>  <IntegerList>
>>>>>>>>>            [1]    chr15 [28356859, 28356859]      + | intergenic<NA>
>>>>>>>>> <NA>
>>>>>>>>>                 1<NA>
>>>>>>>>>                     GENEID                         PRECEDEID
>>>>>>>>> FOLLOWID
>>>>>>>>>         <character>  <CharacterList>  <CharacterList>
>>>>>>>>>            [1]<NA>  100132565,100289656,100616223,...            2567
>>>>>>>>>            -------
>>>>>>>>>            seqinfo: 1 sequence from an unspecified genome; no
>>>>>>>>> seqlengths
>>>>>>>>>
>>>>>>>>>         Note that while we get the 3' UTR annotation from the
>>>>>>>>> strandless
>>>>>>>>> VCF
>>>>>>>>>         object we do not get it from the VRanges object with the
>>>>>>>>> positive
>>>>>>>>>         strand. To make my point clear: this positive strand shows up
>>>>>>>>> when
>>>>>>>>>         you coerce a strandless VCF object to a VRanges one, because
>>>>>>>>>         positive strandness seems to be the convention for VRanges
>>>>>>>>> objects:
>>>>>>>>>
>>>>>>>>>         as(vcf, VRanges)
>>>>>>>>>         VRanges object with 1 range and 1 metadata column:
>>>>>>>>>                seqnames               ranges strand         ref
>>>>>>>>>         alt     totalDepth       refDepth       altDepth
>>>>>>>>>         <Rle>  <IRanges>  <Rle>  <character>  <characterOrRle>
>>>>>>>>> <integerOrRle>
>>>>>>>>>         <integerOrRle>  <integerOrRle>
>>>>>>>>>            [1]    chr15 [28356859, 28356859]      +           A
>>>>>>>>>            G             12              5              7
>>>>>>>>>                  sampleNames softFilterMatrix |      QUAL
>>>>>>>>>         <factorOrRle>  <matrix>  |<numeric>
>>>>>>>>>            [1]             A                  |<NA>
>>>>>>>>>            -------
>>>>>>>>>            seqinfo: 1 sequence from an unspecified genome; no
>>>>>>>>> seqlengths
>>>>>>>>>            hardFilters: NULL
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         Of course, if I run locateVariants() with the argument
>>>>>>>>>         ignore.strand=TRUE, then I get the expected annotation:
>>>>>>>>>
>>>>>>>>>         locateVariants(vr, txdb, region=AllVariants(),
>>>>>>>>> ignore.strand=TRUE)
>>>>>>>>>         GRanges object with 2 ranges and 9 metadata columns:
>>>>>>>>>                seqnames               ranges strand | LOCATION
>>>>>>>>> LOCSTART
>>>>>>>>>         LOCEND   QUERYID        TXID         CDSID
>>>>>>>>>         <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
>>>>>>>>> <integer>
>>>>>>>>>         <character>  <IntegerList>
>>>>>>>>>            [1]    chr15 [28356859, 28356859]      + | threeUTR
>>>>>>>>> 677
>>>>>>>>>         677         1       55386
>>>>>>>>>            [2]    chr15 [28356859, 28356859]      + | threeUTR
>>>>>>>>> 677
>>>>>>>>>         677         1       55387
>>>>>>>>>                     GENEID       PRECEDEID        FOLLOWID
>>>>>>>>>         <character>  <CharacterList>  <CharacterList>
>>>>>>>>>            [1]        8924
>>>>>>>>>            [2]        8924
>>>>>>>>>            -------
>>>>>>>>>            seqinfo: 1 sequence from an unspecified genome; no
>>>>>>>>> seqlengths
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         So, my question is, given that VRanges objects are enforced
>>>>>>>>> to
>>>>>>>>> have
>>>>>>>>>         a positive strand, would not be better to have
>>>>>>>>> ignore.strand=TRUE
>>>>>>>>> as
>>>>>>>>>         default in locateVariants?
>>>>>>>>>
>>>>>>>>>         Alternatively, I would recommend that locateVariants() issues
>>>>>>>>> a
>>>>>>>>>         warning, or maybe an error, when the input object is VRanges
>>>>>>>>> and
>>>>>>>>>         ignore.strand=FALSE.
>>>>>>>>>
>>>>>>>>>         Finally, out of curiosity, why a VRanges object enforces the
>>>>>>>>>         positive strand in all its genomic ranges? Would not be
>>>>>>>>> better
>>>>>>>>> just
>>>>>>>>>         taking the strand of the CollapsedVCF object when coercing
>>>>>>>>> the
>>>>>>>>>         CollapsedVCF object to VRanges?
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         thanks!!
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>         robert.
>>>>>>>>>
>>>>>>>>>         _______________________________________________
>>>>>>>>>         Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>
>>>>>>>>> mailing
>>>>>>>>> list
>>>>>>>>>         https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>>>>
>>>>>>>>>
>>>>>>>> --
>>>>>>>> Robert Castelo, PhD
>>>>>>>> Associate Professor
>>>>>>>> Dept. of Experimental and Health Sciences
>>>>>>>> Universitat Pompeu Fabra (UPF)
>>>>>>>> Barcelona Biomedical Research Park (PRBB)
>>>>>>>> Dr Aiguader 88
>>>>>>>> E-08003 Barcelona, Spain
>>>>>>>> telf: +34.933.160.514
>>>>>>>> fax: +34.933.160.550
>>>>>>>>
>>>>>>>> _______________________________________________
>>>>>>>> Bioc-devel at r-project.org mailing list
>>>>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>> --
>>>> Robert Castelo, PhD
>>>> Associate Professor
>>>> Dept. of Experimental and Health Sciences
>>>> Universitat Pompeu Fabra (UPF)
>>>> Barcelona Biomedical Research Park (PRBB)
>>>> Dr Aiguader 88
>>>> E-08003 Barcelona, Spain
>>>> telf: +34.933.160.514
>>>> fax: +34.933.160.550
>>



More information about the Bioc-devel mailing list