[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Michael Lawrence lawrence.michael at gene.com
Thu Jun 11 17:47:16 CEST 2015


The fact that the position describes the variant, but the strand
refers to the transcript is confusing to me. What is the concrete use
case for merging the two features like that? VRanges constrains its
strand for at least 2 reasons: (1) to be less error prone [of course
this runs completely counter to flexibility] and (2) simplicity [we
don't have to worry about what "-" means for ref/alt, overlap, etc].

On Thu, Jun 11, 2015 at 6:05 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
> one option for me is just to add a metadata column with the strand of the
> overlapping feature. however, i'm interested to fully understand the
> rationale behind this aspect of the design of the VRanges object.
>
> a VRanges object unrolls variants in a VCF file per alternative allele and
> sample. variants in VCF files are obtained from tallying reads aligned on a
> reference genome. so, my understanding is that the reference allele is the
> allele of the reference genome against which the reads were aligned while
> the alternate allele(s) are allele calls different from the reference. from
> this perspective, my interpretation is that ref and alt alleles have already
> a strand, which is the strand of the reference chromosome against which the
> reads were aligned to. i'm interested in this interpretation of the strand
> of the variants because i'm interested in the interpretation of
> sequence-features containing the reference and the alternate alleles, such
> as differences in a binding site with the reference and the alternate
> allele.
>
> if we relax the meaning of elements in a VRanges object to, not only
> variants x allele x sample, but to variants x allele x sample x
> annotated-feature, then i think it would make sense to have the
> strand-specific annotation in the strand slot of the VRanges object.
>
> while this idea may be good or not for a number of reasons, i'm now mostly
> interested in knowing whether i'm misinterpreting the design of VRanges
> objects, and maybe variant calling in general or i'm in a more or less right
> path in using a VRanges object to hold variant annotations.
>
>
> thanks!!!
>
> robert.
>
>
> On 06/11/2015 04:30 AM, Michael Lawrence wrote:
>>
>> I guess it depends on what the strand should mean. Would having a
>> negative strand indicate that the ref/alt should be complemented? I'm
>> not sure it's a good idea to conflate the strand of the variant itself
>> with the strand of an overlapping feature.
>>
>> On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo<robert.castelo at upf.edu>
>> wrote:
>>>
>>> my understanding was that VRanges is a container for variants and variant
>>> annotations and strand is just one annotation more. when we use
>>> locateVariants() a variant can be annotated to multiple transcripts where
>>> in
>>> one overlaps an exon, in another an intron and so on. In all those
>>> transcripts annotations there is a strand annotation, the strand of the
>>> transcript. if the transcript is annotated in the negative strand of the
>>> reference chromosome then the annotation of a transcript region to a
>>> variant
>>> is going to be also on the negative strand.
>>>
>>> both locateVariants() and predictCoding() return GRanges objects with
>>> strand
>>> annotations according to the transcripts being annotated. I thought it
>>> made
>>> sense in VariantFiltering to use VRanges objects as a  container for
>>> variants and annotations and, for this reason, I would like to carry on
>>> the
>>> strand annotation into the VRanges object. Is there a strong reason for a
>>> VRanges object, derived from GRanges, not to have strand?
>>>
>>>
>>> On 6/10/15 6:26 PM, Michael Lawrence wrote:
>>>>
>>>>
>>>> VRanges is supposed to enforce strand. The goal is to use "*" always,
>>>> for simplicity and consistency with the result of readVcf(). Is there
>>>> a use case for negative strand variants?
>>>>
>>>> On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo<robert.castelo at upf.edu>
>>>> wrote:
>>>>>
>>>>>
>>>>> Michael,
>>>>>
>>>>> regarding our email exchange three weeks ago, I found a couple of
>>>>> places
>>>>> in
>>>>> VariantAnnotation that IMO need to be updated to avoid enforcing strand
>>>>> on
>>>>> VRanges.
>>>>>
>>>>> these places occur when constructing and validating VRanges objects,
>>>>> concretely at:
>>>>>
>>>>> 1. file R/methods-VRanges-class.R at the VRanges class constructor:
>>>>>
>>>>> VRanges<-
>>>>>     function(seqnames = Rle(), ranges = IRanges(),
>>>>>              ref = character(), alt = NA_character_,
>>>>>              totalDepth = NA_integer_, refDepth = NA_integer_,
>>>>>              altDepth = NA_integer_, ..., sampleNames = NA_character_,
>>>>>              softFilterMatrix = FilterMatrix(matrix(nrow = length(gr),
>>>>> ncol =
>>>>> 0L),
>>>>>                FilterRules()),
>>>>>              hardFilters = FilterRules())
>>>>> {
>>>>>     gr<- GRanges(seqnames, ranges,
>>>>>                   strand = .rleRecycleVector("*", length(ranges)), ...)
>>>>> [...]
>>>>>
>>>>> that precludes setting the strand at construction time:
>>>>>
>>>>> library(VariantAnnotation)
>>>>> VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C",
>>>>> strand="-")
>>>>> Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
>>>>> length(ranges)),  :
>>>>>     formal argument "strand" matched by multiple actual arguments
>>>>>
>>>>>
>>>>> 2. R/AllClasses.R at the VRanges class validity function
>>>>> .valid.VRanges():
>>>>>
>>>>> .valid.VRanges.strand<- function(object) {
>>>>>     if (any(object at strand == "-"))
>>>>>       paste("'strand' must always be '+' or '*'")
>>>>> }
>>>>>
>>>>> [...]
>>>>>
>>>>> .valid.VRanges<- function(object)
>>>>> {
>>>>>     c(.valid.VRanges.ref(object),
>>>>>       .valid.VRanges.alt(object),
>>>>>       .valid.VRanges.strand(object),
>>>>>       .valid.VRanges.depth(object))
>>>>> }
>>>>>
>>>>> that prompts an error when variants annotated on the negative strand
>>>>> are
>>>>> detected:
>>>>>
>>>>> library(VariantAnnotation)
>>>>> example(VRanges)
>>>>> strand(vr)<- "-"
>>>>> c(vr)
>>>>> Error in validObject(.Object) :
>>>>>     invalid class “VRanges” object: 'strand' must always be '+' or '*'
>>>>>
>>>>>
>>>>> cheers,
>>>>>
>>>>> robert.
>>>>>
>>>>> On 05/22/2015 09:49 PM, Michael Lawrence wrote:
>>>>>>
>>>>>>
>>>>>> This changed recently. VariantAnnotation in devel no longer enforces a
>>>>>> strand on VRanges, or at least it allows the "*" case.
>>>>>>
>>>>>>
>>>>>> On Fri, May 22, 2015 at 11:33 AM, Robert
>>>>>> Castelo<robert.castelo at upf.edu
>>>>>> <mailto:robert.castelo at upf.edu>>  wrote:
>>>>>>
>>>>>>       Hi,
>>>>>>
>>>>>>       I have encountered myself in a strange situation when using the
>>>>>>       function locateVariants() from VariantAnnotation with an input
>>>>>>       VRanges object. The problem is that some of the expected coding
>>>>>>       annotations are not showing up when using locateVariants() with
>>>>>>       default parameters.
>>>>>>
>>>>>>       After investigating this situation I think I found the reason,
>>>>>> which
>>>>>>       does not look like a bug but I would like that you give me some
>>>>>>       clarification about the logic behind using locateVariants() with
>>>>>>       VRanges objects.
>>>>>>
>>>>>>       The documentation of the VRanges-class says that in this class
>>>>>> of
>>>>>>       objects "The strand is always constrained to be positive (+).".
>>>>>> I
>>>>>>       guess there may be a good reason for this but I could not find
>>>>>> it
>>>>>> in
>>>>>>       the documentation or googling about it.
>>>>>>
>>>>>>       This means that when you coerce a CollapsedVCF object (obtained,
>>>>>> for
>>>>>>       example, from a VCF file via readVcf()) to a VRanges object,
>>>>>> even
>>>>>>       though variants in the VCF may have no strand, they get a
>>>>>> positive
>>>>>>       strand in the VRanges object.
>>>>>>
>>>>>>       The problem arises then, when you call locateVariants() with
>>>>>> this
>>>>>>       VRanges object, because features on the negative strand are
>>>>>> never
>>>>>>       going to overlap with the variants since, by default, the
>>>>>> argument
>>>>>>       ignore.strand=FALSE.
>>>>>>
>>>>>>       Let me illustrate this with a toy example. Consider the SNP
>>>>>>       rs1129038
>>>>>>
>>>>>> (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038)
>>>>>> at
>>>>>>       chr15:28356859 with allels A/G. It is located on the 3' UTR of
>>>>>> the
>>>>>>       gene HERC2 coded on the negative strand of the human reference
>>>>>>       genome. Let's build a toy VRanges object having this variant:
>>>>>>
>>>>>>       library(VariantAnnotation)
>>>>>>       vr<- VRanges(seqnames="chr15",
>>>>>>                      ranges=IRanges(28356859, 28356859),
>>>>>>                      ref="A", alt="G",
>>>>>>                      refDepth=5, altDepth=7,
>>>>>>                      totalDepth=12, sampleNames="A")
>>>>>>       strand(vr)
>>>>>>       factor-Rle of length 1 with 1 run
>>>>>>          Lengths: 1
>>>>>>          Values : +
>>>>>>       Levels(3): + - *
>>>>>>
>>>>>>       Let's build now its CollapsedVCF counterpart by using the
>>>>>>       corresponding coercion method and set the strand to "*":
>>>>>>
>>>>>>       vcf<- asVCF(vr)
>>>>>>       strand(vcf)<- "*"
>>>>>>
>>>>>>       Now run locateVariants() on both objects with UCSC annotations:
>>>>>>
>>>>>>       library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>>>       txdb<- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>>>
>>>>>>       locateVariants(vcf, txdb, region=AllVariants())
>>>>>>       GRanges object with 2 ranges and 9 metadata columns:
>>>>>>              seqnames               ranges strand | LOCATION  LOCSTART
>>>>>>       LOCEND   QUERYID        TXID         CDSID
>>>>>>       <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
>>>>>> <integer>
>>>>>>       <character>  <IntegerList>
>>>>>>          [1]    chr15 [28356859, 28356859]      * | threeUTR        50
>>>>>> 50
>>>>>>               1       55386
>>>>>>          [2]    chr15 [28356859, 28356859]      * | threeUTR        50
>>>>>> 50
>>>>>>               1       55387
>>>>>>                   GENEID       PRECEDEID        FOLLOWID
>>>>>>       <character>  <CharacterList>  <CharacterList>
>>>>>>          [1]        8924
>>>>>>          [2]        8924
>>>>>>          -------
>>>>>>          seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>>>>
>>>>>>       locateVariants(vr, txdb, region=AllVariants())
>>>>>>       GRanges object with 1 range and 9 metadata columns:
>>>>>>              seqnames               ranges strand |   LOCATION
>>>>>> LOCSTART
>>>>>>       LOCEND   QUERYID      TXID         CDSID
>>>>>>       <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
>>>>>> <integer>
>>>>>>       <integer>  <IntegerList>
>>>>>>          [1]    chr15 [28356859, 28356859]      + | intergenic<NA>
>>>>>> <NA>
>>>>>>               1<NA>
>>>>>>                   GENEID                         PRECEDEID
>>>>>> FOLLOWID
>>>>>>       <character>  <CharacterList>  <CharacterList>
>>>>>>          [1]<NA>  100132565,100289656,100616223,...            2567
>>>>>>          -------
>>>>>>          seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>>>>
>>>>>>       Note that while we get the 3' UTR annotation from the strandless
>>>>>> VCF
>>>>>>       object we do not get it from the VRanges object with the
>>>>>> positive
>>>>>>       strand. To make my point clear: this positive strand shows up
>>>>>> when
>>>>>>       you coerce a strandless VCF object to a VRanges one, because
>>>>>>       positive strandness seems to be the convention for VRanges
>>>>>> objects:
>>>>>>
>>>>>>       as(vcf, VRanges)
>>>>>>       VRanges object with 1 range and 1 metadata column:
>>>>>>              seqnames               ranges strand         ref
>>>>>>       alt     totalDepth       refDepth       altDepth
>>>>>>       <Rle>  <IRanges>  <Rle>  <character>  <characterOrRle>
>>>>>> <integerOrRle>
>>>>>>       <integerOrRle>  <integerOrRle>
>>>>>>          [1]    chr15 [28356859, 28356859]      +           A
>>>>>>          G             12              5              7
>>>>>>                sampleNames softFilterMatrix |      QUAL
>>>>>>       <factorOrRle>  <matrix>  |<numeric>
>>>>>>          [1]             A                  |<NA>
>>>>>>          -------
>>>>>>          seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>>>>          hardFilters: NULL
>>>>>>
>>>>>>
>>>>>>       Of course, if I run locateVariants() with the argument
>>>>>>       ignore.strand=TRUE, then I get the expected annotation:
>>>>>>
>>>>>>       locateVariants(vr, txdb, region=AllVariants(),
>>>>>> ignore.strand=TRUE)
>>>>>>       GRanges object with 2 ranges and 9 metadata columns:
>>>>>>              seqnames               ranges strand | LOCATION  LOCSTART
>>>>>>       LOCEND   QUERYID        TXID         CDSID
>>>>>>       <Rle>  <IRanges>  <Rle>  |<factor>  <integer>  <integer>
>>>>>> <integer>
>>>>>>       <character>  <IntegerList>
>>>>>>          [1]    chr15 [28356859, 28356859]      + | threeUTR       677
>>>>>>       677         1       55386
>>>>>>          [2]    chr15 [28356859, 28356859]      + | threeUTR       677
>>>>>>       677         1       55387
>>>>>>                   GENEID       PRECEDEID        FOLLOWID
>>>>>>       <character>  <CharacterList>  <CharacterList>
>>>>>>          [1]        8924
>>>>>>          [2]        8924
>>>>>>          -------
>>>>>>          seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>>>>
>>>>>>
>>>>>>       So, my question is, given that VRanges objects are enforced to
>>>>>> have
>>>>>>       a positive strand, would not be better to have
>>>>>> ignore.strand=TRUE
>>>>>> as
>>>>>>       default in locateVariants?
>>>>>>
>>>>>>       Alternatively, I would recommend that locateVariants() issues a
>>>>>>       warning, or maybe an error, when the input object is VRanges and
>>>>>>       ignore.strand=FALSE.
>>>>>>
>>>>>>       Finally, out of curiosity, why a VRanges object enforces the
>>>>>>       positive strand in all its genomic ranges? Would not be better
>>>>>> just
>>>>>>       taking the strand of the CollapsedVCF object when coercing the
>>>>>>       CollapsedVCF object to VRanges?
>>>>>>
>>>>>>
>>>>>>       thanks!!
>>>>>>
>>>>>>
>>>>>>       robert.
>>>>>>
>>>>>>       _______________________________________________
>>>>>>       Bioc-devel at r-project.org<mailto:Bioc-devel at r-project.org>
>>>>>> mailing
>>>>>> list
>>>>>>       https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>>
>>>>>>
>>>>> --
>>>>> Robert Castelo, PhD
>>>>> Associate Professor
>>>>> Dept. of Experimental and Health Sciences
>>>>> Universitat Pompeu Fabra (UPF)
>>>>> Barcelona Biomedical Research Park (PRBB)
>>>>> Dr Aiguader 88
>>>>> E-08003 Barcelona, Spain
>>>>> telf: +34.933.160.514
>>>>> fax: +34.933.160.550
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>>
>
> --
> Robert Castelo, PhD
> Associate Professor
> Dept. of Experimental and Health Sciences
> Universitat Pompeu Fabra (UPF)
> Barcelona Biomedical Research Park (PRBB)
> Dr Aiguader 88
> E-08003 Barcelona, Spain
> telf: +34.933.160.514
> fax: +34.933.160.550



More information about the Bioc-devel mailing list