[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Wed Jun 10 22:17:37 CEST 2015

my understanding was that VRanges is a container for variants and 
variant annotations and strand is just one annotation more. when we use 
locateVariants() a variant can be annotated to multiple transcripts 
where in one overlaps an exon, in another an intron and so on. In all 
those transcripts annotations there is a strand annotation, the strand 
of the transcript. if the transcript is annotated in the negative strand 
of the reference chromosome then the annotation of a transcript region 
to a variant is going to be also on the negative strand.

both locateVariants() and predictCoding() return GRanges objects with 
strand annotations according to the transcripts being annotated. I 
thought it made sense in VariantFiltering to use VRanges objects as a  
container for variants and annotations and, for this reason, I would 
like to carry on the strand annotation into the VRanges object. Is there 
a strong reason for a VRanges object, derived from GRanges, not to have 
strand?

On 6/10/15 6:26 PM, Michael Lawrence wrote:
> VRanges is supposed to enforce strand. The goal is to use "*" always,
> for simplicity and consistency with the result of readVcf(). Is there
> a use case for negative strand variants?
>
> On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
>> Michael,
>>
>> regarding our email exchange three weeks ago, I found a couple of places in
>> VariantAnnotation that IMO need to be updated to avoid enforcing strand on
>> VRanges.
>>
>> these places occur when constructing and validating VRanges objects,
>> concretely at:
>>
>> 1. file R/methods-VRanges-class.R at the VRanges class constructor:
>>
>> VRanges <-
>>    function(seqnames = Rle(), ranges = IRanges(),
>>             ref = character(), alt = NA_character_,
>>             totalDepth = NA_integer_, refDepth = NA_integer_,
>>             altDepth = NA_integer_, ..., sampleNames = NA_character_,
>>             softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), ncol =
>> 0L),
>>               FilterRules()),
>>             hardFilters = FilterRules())
>> {
>>    gr <- GRanges(seqnames, ranges,
>>                  strand = .rleRecycleVector("*", length(ranges)), ...)
>> [...]
>>
>> that precludes setting the strand at construction time:
>>
>> library(VariantAnnotation)
>> VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C", strand="-")
>> Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
>> length(ranges)),  :
>>    formal argument "strand" matched by multiple actual arguments
>>
>>
>> 2. R/AllClasses.R at the VRanges class validity function .valid.VRanges():
>>
>> .valid.VRanges.strand <- function(object) {
>>    if (any(object at strand == "-"))
>>      paste("'strand' must always be '+' or '*'")
>> }
>>
>> [...]
>>
>> .valid.VRanges <- function(object)
>> {
>>    c(.valid.VRanges.ref(object),
>>      .valid.VRanges.alt(object),
>>      .valid.VRanges.strand(object),
>>      .valid.VRanges.depth(object))
>> }
>>
>> that prompts an error when variants annotated on the negative strand are
>> detected:
>>
>> library(VariantAnnotation)
>> example(VRanges)
>> strand(vr) <- "-"
>> c(vr)
>> Error in validObject(.Object) :
>>    invalid class “VRanges” object: 'strand' must always be '+' or '*'
>>
>>
>> cheers,
>>
>> robert.
>>
>> On 05/22/2015 09:49 PM, Michael Lawrence wrote:
>>> This changed recently. VariantAnnotation in devel no longer enforces a
>>> strand on VRanges, or at least it allows the "*" case.
>>>
>>>
>>> On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.castelo at upf.edu
>>> <mailto:robert.castelo at upf.edu>> wrote:
>>>
>>>      Hi,
>>>
>>>      I have encountered myself in a strange situation when using the
>>>      function locateVariants() from VariantAnnotation with an input
>>>      VRanges object. The problem is that some of the expected coding
>>>      annotations are not showing up when using locateVariants() with
>>>      default parameters.
>>>
>>>      After investigating this situation I think I found the reason, which
>>>      does not look like a bug but I would like that you give me some
>>>      clarification about the logic behind using locateVariants() with
>>>      VRanges objects.
>>>
>>>      The documentation of the VRanges-class says that in this class of
>>>      objects "The strand is always constrained to be positive (+).". I
>>>      guess there may be a good reason for this but I could not find it in
>>>      the documentation or googling about it.
>>>
>>>      This means that when you coerce a CollapsedVCF object (obtained, for
>>>      example, from a VCF file via readVcf()) to a VRanges object, even
>>>      though variants in the VCF may have no strand, they get a positive
>>>      strand in the VRanges object.
>>>
>>>      The problem arises then, when you call locateVariants() with this
>>>      VRanges object, because features on the negative strand are never
>>>      going to overlap with the variants since, by default, the argument
>>>      ignore.strand=FALSE.
>>>
>>>      Let me illustrate this with a toy example. Consider the SNP
>>>      rs1129038
>>>      (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
>>>      chr15:28356859 with allels A/G. It is located on the 3' UTR of the
>>>      gene HERC2 coded on the negative strand of the human reference
>>>      genome. Let's build a toy VRanges object having this variant:
>>>
>>>      library(VariantAnnotation)
>>>      vr <- VRanges(seqnames="chr15",
>>>                     ranges=IRanges(28356859, 28356859),
>>>                     ref="A", alt="G",
>>>                     refDepth=5, altDepth=7,
>>>                     totalDepth=12, sampleNames="A")
>>>      strand(vr)
>>>      factor-Rle of length 1 with 1 run
>>>         Lengths: 1
>>>         Values : +
>>>      Levels(3): + - *
>>>
>>>      Let's build now its CollapsedVCF counterpart by using the
>>>      corresponding coercion method and set the strand to "*":
>>>
>>>      vcf <- asVCF(vr)
>>>      strand(vcf) <- "*"
>>>
>>>      Now run locateVariants() on both objects with UCSC annotations:
>>>
>>>      library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>      txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>
>>>      locateVariants(vcf, txdb, region=AllVariants())
>>>      GRanges object with 2 ranges and 9 metadata columns:
>>>             seqnames               ranges strand | LOCATION  LOCSTART
>>>      LOCEND   QUERYID        TXID         CDSID
>>>      <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>>>      <character> <IntegerList>
>>>         [1]    chr15 [28356859, 28356859]      * | threeUTR        50 50
>>>              1       55386
>>>         [2]    chr15 [28356859, 28356859]      * | threeUTR        50 50
>>>              1       55387
>>>                  GENEID       PRECEDEID        FOLLOWID
>>>      <character> <CharacterList> <CharacterList>
>>>         [1]        8924
>>>         [2]        8924
>>>         -------
>>>         seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>
>>>      locateVariants(vr, txdb, region=AllVariants())
>>>      GRanges object with 1 range and 9 metadata columns:
>>>             seqnames               ranges strand |   LOCATION  LOCSTART
>>>      LOCEND   QUERYID      TXID         CDSID
>>>      <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>>>      <integer> <IntegerList>
>>>         [1]    chr15 [28356859, 28356859]      + | intergenic <NA> <NA>
>>>              1 <NA>
>>>                  GENEID                         PRECEDEID        FOLLOWID
>>>      <character> <CharacterList> <CharacterList>
>>>         [1] <NA> 100132565,100289656,100616223,...            2567
>>>         -------
>>>         seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>
>>>      Note that while we get the 3' UTR annotation from the strandless VCF
>>>      object we do not get it from the VRanges object with the positive
>>>      strand. To make my point clear: this positive strand shows up when
>>>      you coerce a strandless VCF object to a VRanges one, because
>>>      positive strandness seems to be the convention for VRanges objects:
>>>
>>>      as(vcf, VRanges)
>>>      VRanges object with 1 range and 1 metadata column:
>>>             seqnames               ranges strand         ref
>>>      alt     totalDepth       refDepth       altDepth
>>>      <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
>>>      <integerOrRle> <integerOrRle>
>>>         [1]    chr15 [28356859, 28356859]      +           A
>>>         G             12              5              7
>>>               sampleNames softFilterMatrix |      QUAL
>>>      <factorOrRle> <matrix> | <numeric>
>>>         [1]             A                  | <NA>
>>>         -------
>>>         seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>         hardFilters: NULL
>>>
>>>
>>>      Of course, if I run locateVariants() with the argument
>>>      ignore.strand=TRUE, then I get the expected annotation:
>>>
>>>      locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE)
>>>      GRanges object with 2 ranges and 9 metadata columns:
>>>             seqnames               ranges strand | LOCATION  LOCSTART
>>>      LOCEND   QUERYID        TXID         CDSID
>>>      <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>>>      <character> <IntegerList>
>>>         [1]    chr15 [28356859, 28356859]      + | threeUTR       677
>>>      677         1       55386
>>>         [2]    chr15 [28356859, 28356859]      + | threeUTR       677
>>>      677         1       55387
>>>                  GENEID       PRECEDEID        FOLLOWID
>>>      <character> <CharacterList> <CharacterList>
>>>         [1]        8924
>>>         [2]        8924
>>>         -------
>>>         seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>
>>>
>>>      So, my question is, given that VRanges objects are enforced to have
>>>      a positive strand, would not be better to have ignore.strand=TRUE as
>>>      default in locateVariants?
>>>
>>>      Alternatively, I would recommend that locateVariants() issues a
>>>      warning, or maybe an error, when the input object is VRanges and
>>>      ignore.strand=FALSE.
>>>
>>>      Finally, out of curiosity, why a VRanges object enforces the
>>>      positive strand in all its genomic ranges? Would not be better just
>>>      taking the strand of the CollapsedVCF object when coercing the
>>>      CollapsedVCF object to VRanges?
>>>
>>>
>>>      thanks!!
>>>
>>>
>>>      robert.
>>>
>>>      _______________________________________________
>>>      Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>> list
>>>      https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>> --
>> Robert Castelo, PhD
>> Associate Professor
>> Dept. of Experimental and Health Sciences
>> Universitat Pompeu Fabra (UPF)
>> Barcelona Biomedical Research Park (PRBB)
>> Dr Aiguader 88
>> E-08003 Barcelona, Spain
>> telf: +34.933.160.514
>> fax: +34.933.160.550
>>
>> _______________________________________________
>> Bioc-devel at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel