[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Robert Castelo robert.castelo at upf.edu
Wed Jun 10 14:54:20 CEST 2015


Michael,

regarding our email exchange three weeks ago, I found a couple of places 
in VariantAnnotation that IMO need to be updated to avoid enforcing 
strand on VRanges.

these places occur when constructing and validating VRanges objects, 
concretely at:

1. file R/methods-VRanges-class.R at the VRanges class constructor:

VRanges <-
   function(seqnames = Rle(), ranges = IRanges(),
            ref = character(), alt = NA_character_,
            totalDepth = NA_integer_, refDepth = NA_integer_,
            altDepth = NA_integer_, ..., sampleNames = NA_character_,
            softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), 
ncol = 0L),
              FilterRules()),
            hardFilters = FilterRules())
{
   gr <- GRanges(seqnames, ranges,
                 strand = .rleRecycleVector("*", length(ranges)), ...)
[...]

that precludes setting the strand at construction time:

library(VariantAnnotation)
VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C", strand="-")
Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*", 
length(ranges)),  :
   formal argument "strand" matched by multiple actual arguments


2. R/AllClasses.R at the VRanges class validity function .valid.VRanges():

.valid.VRanges.strand <- function(object) {
   if (any(object at strand == "-"))
     paste("'strand' must always be '+' or '*'")
}

[...]

.valid.VRanges <- function(object)
{
   c(.valid.VRanges.ref(object),
     .valid.VRanges.alt(object),
     .valid.VRanges.strand(object),
     .valid.VRanges.depth(object))
}

that prompts an error when variants annotated on the negative strand are 
detected:

library(VariantAnnotation)
example(VRanges)
strand(vr) <- "-"
c(vr)
Error in validObject(.Object) :
   invalid class “VRanges” object: 'strand' must always be '+' or '*'


cheers,

robert.

On 05/22/2015 09:49 PM, Michael Lawrence wrote:
> This changed recently. VariantAnnotation in devel no longer enforces a
> strand on VRanges, or at least it allows the "*" case.
>
>
> On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.castelo at upf.edu
> <mailto:robert.castelo at upf.edu>> wrote:
>
>     Hi,
>
>     I have encountered myself in a strange situation when using the
>     function locateVariants() from VariantAnnotation with an input
>     VRanges object. The problem is that some of the expected coding
>     annotations are not showing up when using locateVariants() with
>     default parameters.
>
>     After investigating this situation I think I found the reason, which
>     does not look like a bug but I would like that you give me some
>     clarification about the logic behind using locateVariants() with
>     VRanges objects.
>
>     The documentation of the VRanges-class says that in this class of
>     objects "The strand is always constrained to be positive (+).". I
>     guess there may be a good reason for this but I could not find it in
>     the documentation or googling about it.
>
>     This means that when you coerce a CollapsedVCF object (obtained, for
>     example, from a VCF file via readVcf()) to a VRanges object, even
>     though variants in the VCF may have no strand, they get a positive
>     strand in the VRanges object.
>
>     The problem arises then, when you call locateVariants() with this
>     VRanges object, because features on the negative strand are never
>     going to overlap with the variants since, by default, the argument
>     ignore.strand=FALSE.
>
>     Let me illustrate this with a toy example. Consider the SNP
>     rs1129038
>     (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
>     chr15:28356859 with allels A/G. It is located on the 3' UTR of the
>     gene HERC2 coded on the negative strand of the human reference
>     genome. Let's build a toy VRanges object having this variant:
>
>     library(VariantAnnotation)
>     vr <- VRanges(seqnames="chr15",
>                    ranges=IRanges(28356859, 28356859),
>                    ref="A", alt="G",
>                    refDepth=5, altDepth=7,
>                    totalDepth=12, sampleNames="A")
>     strand(vr)
>     factor-Rle of length 1 with 1 run
>        Lengths: 1
>        Values : +
>     Levels(3): + - *
>
>     Let's build now its CollapsedVCF counterpart by using the
>     corresponding coercion method and set the strand to "*":
>
>     vcf <- asVCF(vr)
>     strand(vcf) <- "*"
>
>     Now run locateVariants() on both objects with UCSC annotations:
>
>     library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>     txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>
>     locateVariants(vcf, txdb, region=AllVariants())
>     GRanges object with 2 ranges and 9 metadata columns:
>            seqnames               ranges strand | LOCATION  LOCSTART
>     LOCEND   QUERYID        TXID         CDSID
>     <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>     <character> <IntegerList>
>        [1]    chr15 [28356859, 28356859]      * | threeUTR        50 50
>             1       55386
>        [2]    chr15 [28356859, 28356859]      * | threeUTR        50 50
>             1       55387
>                 GENEID       PRECEDEID        FOLLOWID
>     <character> <CharacterList> <CharacterList>
>        [1]        8924
>        [2]        8924
>        -------
>        seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>     locateVariants(vr, txdb, region=AllVariants())
>     GRanges object with 1 range and 9 metadata columns:
>            seqnames               ranges strand |   LOCATION  LOCSTART
>     LOCEND   QUERYID      TXID         CDSID
>     <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>     <integer> <IntegerList>
>        [1]    chr15 [28356859, 28356859]      + | intergenic <NA> <NA>
>             1 <NA>
>                 GENEID                         PRECEDEID        FOLLOWID
>     <character> <CharacterList> <CharacterList>
>        [1] <NA> 100132565,100289656,100616223,...            2567
>        -------
>        seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>     Note that while we get the 3' UTR annotation from the strandless VCF
>     object we do not get it from the VRanges object with the positive
>     strand. To make my point clear: this positive strand shows up when
>     you coerce a strandless VCF object to a VRanges one, because
>     positive strandness seems to be the convention for VRanges objects:
>
>     as(vcf, VRanges)
>     VRanges object with 1 range and 1 metadata column:
>            seqnames               ranges strand         ref
>     alt     totalDepth       refDepth       altDepth
>     <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
>     <integerOrRle> <integerOrRle>
>        [1]    chr15 [28356859, 28356859]      +           A
>        G             12              5              7
>              sampleNames softFilterMatrix |      QUAL
>     <factorOrRle> <matrix> | <numeric>
>        [1]             A                  | <NA>
>        -------
>        seqinfo: 1 sequence from an unspecified genome; no seqlengths
>        hardFilters: NULL
>
>
>     Of course, if I run locateVariants() with the argument
>     ignore.strand=TRUE, then I get the expected annotation:
>
>     locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE)
>     GRanges object with 2 ranges and 9 metadata columns:
>            seqnames               ranges strand | LOCATION  LOCSTART
>     LOCEND   QUERYID        TXID         CDSID
>     <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>     <character> <IntegerList>
>        [1]    chr15 [28356859, 28356859]      + | threeUTR       677
>     677         1       55386
>        [2]    chr15 [28356859, 28356859]      + | threeUTR       677
>     677         1       55387
>                 GENEID       PRECEDEID        FOLLOWID
>     <character> <CharacterList> <CharacterList>
>        [1]        8924
>        [2]        8924
>        -------
>        seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>
>     So, my question is, given that VRanges objects are enforced to have
>     a positive strand, would not be better to have ignore.strand=TRUE as
>     default in locateVariants?
>
>     Alternatively, I would recommend that locateVariants() issues a
>     warning, or maybe an error, when the input object is VRanges and
>     ignore.strand=FALSE.
>
>     Finally, out of curiosity, why a VRanges object enforces the
>     positive strand in all its genomic ranges? Would not be better just
>     taking the strand of the CollapsedVCF object when coercing the
>     CollapsedVCF object to VRanges?
>
>
>     thanks!!
>
>
>     robert.
>
>     _______________________________________________
>     Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing list
>     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>

-- 
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550



More information about the Bioc-devel mailing list