[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Michael Lawrence lawrence.michael at gene.com
Thu Jun 11 04:30:49 CEST 2015


I guess it depends on what the strand should mean. Would having a
negative strand indicate that the ref/alt should be complemented? I'm
not sure it's a good idea to conflate the strand of the variant itself
with the strand of an overlapping feature.

On Wed, Jun 10, 2015 at 1:17 PM, Robert Castelo <robert.castelo at upf.edu> wrote:
> my understanding was that VRanges is a container for variants and variant
> annotations and strand is just one annotation more. when we use
> locateVariants() a variant can be annotated to multiple transcripts where in
> one overlaps an exon, in another an intron and so on. In all those
> transcripts annotations there is a strand annotation, the strand of the
> transcript. if the transcript is annotated in the negative strand of the
> reference chromosome then the annotation of a transcript region to a variant
> is going to be also on the negative strand.
>
> both locateVariants() and predictCoding() return GRanges objects with strand
> annotations according to the transcripts being annotated. I thought it made
> sense in VariantFiltering to use VRanges objects as a  container for
> variants and annotations and, for this reason, I would like to carry on the
> strand annotation into the VRanges object. Is there a strong reason for a
> VRanges object, derived from GRanges, not to have strand?
>
>
> On 6/10/15 6:26 PM, Michael Lawrence wrote:
>>
>> VRanges is supposed to enforce strand. The goal is to use "*" always,
>> for simplicity and consistency with the result of readVcf(). Is there
>> a use case for negative strand variants?
>>
>> On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo <robert.castelo at upf.edu>
>> wrote:
>>>
>>> Michael,
>>>
>>> regarding our email exchange three weeks ago, I found a couple of places
>>> in
>>> VariantAnnotation that IMO need to be updated to avoid enforcing strand
>>> on
>>> VRanges.
>>>
>>> these places occur when constructing and validating VRanges objects,
>>> concretely at:
>>>
>>> 1. file R/methods-VRanges-class.R at the VRanges class constructor:
>>>
>>> VRanges <-
>>>    function(seqnames = Rle(), ranges = IRanges(),
>>>             ref = character(), alt = NA_character_,
>>>             totalDepth = NA_integer_, refDepth = NA_integer_,
>>>             altDepth = NA_integer_, ..., sampleNames = NA_character_,
>>>             softFilterMatrix = FilterMatrix(matrix(nrow = length(gr),
>>> ncol =
>>> 0L),
>>>               FilterRules()),
>>>             hardFilters = FilterRules())
>>> {
>>>    gr <- GRanges(seqnames, ranges,
>>>                  strand = .rleRecycleVector("*", length(ranges)), ...)
>>> [...]
>>>
>>> that precludes setting the strand at construction time:
>>>
>>> library(VariantAnnotation)
>>> VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C",
>>> strand="-")
>>> Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
>>> length(ranges)),  :
>>>    formal argument "strand" matched by multiple actual arguments
>>>
>>>
>>> 2. R/AllClasses.R at the VRanges class validity function
>>> .valid.VRanges():
>>>
>>> .valid.VRanges.strand <- function(object) {
>>>    if (any(object at strand == "-"))
>>>      paste("'strand' must always be '+' or '*'")
>>> }
>>>
>>> [...]
>>>
>>> .valid.VRanges <- function(object)
>>> {
>>>    c(.valid.VRanges.ref(object),
>>>      .valid.VRanges.alt(object),
>>>      .valid.VRanges.strand(object),
>>>      .valid.VRanges.depth(object))
>>> }
>>>
>>> that prompts an error when variants annotated on the negative strand are
>>> detected:
>>>
>>> library(VariantAnnotation)
>>> example(VRanges)
>>> strand(vr) <- "-"
>>> c(vr)
>>> Error in validObject(.Object) :
>>>    invalid class “VRanges” object: 'strand' must always be '+' or '*'
>>>
>>>
>>> cheers,
>>>
>>> robert.
>>>
>>> On 05/22/2015 09:49 PM, Michael Lawrence wrote:
>>>>
>>>> This changed recently. VariantAnnotation in devel no longer enforces a
>>>> strand on VRanges, or at least it allows the "*" case.
>>>>
>>>>
>>>> On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.castelo at upf.edu
>>>> <mailto:robert.castelo at upf.edu>> wrote:
>>>>
>>>>      Hi,
>>>>
>>>>      I have encountered myself in a strange situation when using the
>>>>      function locateVariants() from VariantAnnotation with an input
>>>>      VRanges object. The problem is that some of the expected coding
>>>>      annotations are not showing up when using locateVariants() with
>>>>      default parameters.
>>>>
>>>>      After investigating this situation I think I found the reason,
>>>> which
>>>>      does not look like a bug but I would like that you give me some
>>>>      clarification about the logic behind using locateVariants() with
>>>>      VRanges objects.
>>>>
>>>>      The documentation of the VRanges-class says that in this class of
>>>>      objects "The strand is always constrained to be positive (+).". I
>>>>      guess there may be a good reason for this but I could not find it
>>>> in
>>>>      the documentation or googling about it.
>>>>
>>>>      This means that when you coerce a CollapsedVCF object (obtained,
>>>> for
>>>>      example, from a VCF file via readVcf()) to a VRanges object, even
>>>>      though variants in the VCF may have no strand, they get a positive
>>>>      strand in the VRanges object.
>>>>
>>>>      The problem arises then, when you call locateVariants() with this
>>>>      VRanges object, because features on the negative strand are never
>>>>      going to overlap with the variants since, by default, the argument
>>>>      ignore.strand=FALSE.
>>>>
>>>>      Let me illustrate this with a toy example. Consider the SNP
>>>>      rs1129038
>>>>      (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038)
>>>> at
>>>>      chr15:28356859 with allels A/G. It is located on the 3' UTR of the
>>>>      gene HERC2 coded on the negative strand of the human reference
>>>>      genome. Let's build a toy VRanges object having this variant:
>>>>
>>>>      library(VariantAnnotation)
>>>>      vr <- VRanges(seqnames="chr15",
>>>>                     ranges=IRanges(28356859, 28356859),
>>>>                     ref="A", alt="G",
>>>>                     refDepth=5, altDepth=7,
>>>>                     totalDepth=12, sampleNames="A")
>>>>      strand(vr)
>>>>      factor-Rle of length 1 with 1 run
>>>>         Lengths: 1
>>>>         Values : +
>>>>      Levels(3): + - *
>>>>
>>>>      Let's build now its CollapsedVCF counterpart by using the
>>>>      corresponding coercion method and set the strand to "*":
>>>>
>>>>      vcf <- asVCF(vr)
>>>>      strand(vcf) <- "*"
>>>>
>>>>      Now run locateVariants() on both objects with UCSC annotations:
>>>>
>>>>      library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>      txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>
>>>>      locateVariants(vcf, txdb, region=AllVariants())
>>>>      GRanges object with 2 ranges and 9 metadata columns:
>>>>             seqnames               ranges strand | LOCATION  LOCSTART
>>>>      LOCEND   QUERYID        TXID         CDSID
>>>>      <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>>>>      <character> <IntegerList>
>>>>         [1]    chr15 [28356859, 28356859]      * | threeUTR        50 50
>>>>              1       55386
>>>>         [2]    chr15 [28356859, 28356859]      * | threeUTR        50 50
>>>>              1       55387
>>>>                  GENEID       PRECEDEID        FOLLOWID
>>>>      <character> <CharacterList> <CharacterList>
>>>>         [1]        8924
>>>>         [2]        8924
>>>>         -------
>>>>         seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>>
>>>>      locateVariants(vr, txdb, region=AllVariants())
>>>>      GRanges object with 1 range and 9 metadata columns:
>>>>             seqnames               ranges strand |   LOCATION  LOCSTART
>>>>      LOCEND   QUERYID      TXID         CDSID
>>>>      <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>>>>      <integer> <IntegerList>
>>>>         [1]    chr15 [28356859, 28356859]      + | intergenic <NA> <NA>
>>>>              1 <NA>
>>>>                  GENEID                         PRECEDEID
>>>> FOLLOWID
>>>>      <character> <CharacterList> <CharacterList>
>>>>         [1] <NA> 100132565,100289656,100616223,...            2567
>>>>         -------
>>>>         seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>>
>>>>      Note that while we get the 3' UTR annotation from the strandless
>>>> VCF
>>>>      object we do not get it from the VRanges object with the positive
>>>>      strand. To make my point clear: this positive strand shows up when
>>>>      you coerce a strandless VCF object to a VRanges one, because
>>>>      positive strandness seems to be the convention for VRanges objects:
>>>>
>>>>      as(vcf, VRanges)
>>>>      VRanges object with 1 range and 1 metadata column:
>>>>             seqnames               ranges strand         ref
>>>>      alt     totalDepth       refDepth       altDepth
>>>>      <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
>>>>      <integerOrRle> <integerOrRle>
>>>>         [1]    chr15 [28356859, 28356859]      +           A
>>>>         G             12              5              7
>>>>               sampleNames softFilterMatrix |      QUAL
>>>>      <factorOrRle> <matrix> | <numeric>
>>>>         [1]             A                  | <NA>
>>>>         -------
>>>>         seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>>         hardFilters: NULL
>>>>
>>>>
>>>>      Of course, if I run locateVariants() with the argument
>>>>      ignore.strand=TRUE, then I get the expected annotation:
>>>>
>>>>      locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE)
>>>>      GRanges object with 2 ranges and 9 metadata columns:
>>>>             seqnames               ranges strand | LOCATION  LOCSTART
>>>>      LOCEND   QUERYID        TXID         CDSID
>>>>      <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>>>>      <character> <IntegerList>
>>>>         [1]    chr15 [28356859, 28356859]      + | threeUTR       677
>>>>      677         1       55386
>>>>         [2]    chr15 [28356859, 28356859]      + | threeUTR       677
>>>>      677         1       55387
>>>>                  GENEID       PRECEDEID        FOLLOWID
>>>>      <character> <CharacterList> <CharacterList>
>>>>         [1]        8924
>>>>         [2]        8924
>>>>         -------
>>>>         seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>>>
>>>>
>>>>      So, my question is, given that VRanges objects are enforced to have
>>>>      a positive strand, would not be better to have ignore.strand=TRUE
>>>> as
>>>>      default in locateVariants?
>>>>
>>>>      Alternatively, I would recommend that locateVariants() issues a
>>>>      warning, or maybe an error, when the input object is VRanges and
>>>>      ignore.strand=FALSE.
>>>>
>>>>      Finally, out of curiosity, why a VRanges object enforces the
>>>>      positive strand in all its genomic ranges? Would not be better just
>>>>      taking the strand of the CollapsedVCF object when coercing the
>>>>      CollapsedVCF object to VRanges?
>>>>
>>>>
>>>>      thanks!!
>>>>
>>>>
>>>>      robert.
>>>>
>>>>      _______________________________________________
>>>>      Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>>>> list
>>>>      https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>
>>>>
>>> --
>>> Robert Castelo, PhD
>>> Associate Professor
>>> Dept. of Experimental and Health Sciences
>>> Universitat Pompeu Fabra (UPF)
>>> Barcelona Biomedical Research Park (PRBB)
>>> Dr Aiguader 88
>>> E-08003 Barcelona, Spain
>>> telf: +34.933.160.514
>>> fax: +34.933.160.550
>>>
>>> _______________________________________________
>>> Bioc-devel at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>



More information about the Bioc-devel mailing list