[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness
Michael Lawrence
lawrence.michael at gene.com
Wed Jun 10 18:26:15 CEST 2015
VRanges is supposed to enforce strand. The goal is to use "*" always,
for simplicity and consistency with the result of readVcf(). Is there
a use case for negative strand variants?
On Wed, Jun 10, 2015 at 5:54 AM, Robert Castelo <robert.castelo at upf.edu> wrote:
> Michael,
>
> regarding our email exchange three weeks ago, I found a couple of places in
> VariantAnnotation that IMO need to be updated to avoid enforcing strand on
> VRanges.
>
> these places occur when constructing and validating VRanges objects,
> concretely at:
>
> 1. file R/methods-VRanges-class.R at the VRanges class constructor:
>
> VRanges <-
> function(seqnames = Rle(), ranges = IRanges(),
> ref = character(), alt = NA_character_,
> totalDepth = NA_integer_, refDepth = NA_integer_,
> altDepth = NA_integer_, ..., sampleNames = NA_character_,
> softFilterMatrix = FilterMatrix(matrix(nrow = length(gr), ncol =
> 0L),
> FilterRules()),
> hardFilters = FilterRules())
> {
> gr <- GRanges(seqnames, ranges,
> strand = .rleRecycleVector("*", length(ranges)), ...)
> [...]
>
> that precludes setting the strand at construction time:
>
> library(VariantAnnotation)
> VRanges(seqnames="chr1", ranges=IRanges(1, 5), ref="T", alt="C", strand="-")
> Error in GRanges(seqnames, ranges, strand = .rleRecycleVector("*",
> length(ranges)), :
> formal argument "strand" matched by multiple actual arguments
>
>
> 2. R/AllClasses.R at the VRanges class validity function .valid.VRanges():
>
> .valid.VRanges.strand <- function(object) {
> if (any(object at strand == "-"))
> paste("'strand' must always be '+' or '*'")
> }
>
> [...]
>
> .valid.VRanges <- function(object)
> {
> c(.valid.VRanges.ref(object),
> .valid.VRanges.alt(object),
> .valid.VRanges.strand(object),
> .valid.VRanges.depth(object))
> }
>
> that prompts an error when variants annotated on the negative strand are
> detected:
>
> library(VariantAnnotation)
> example(VRanges)
> strand(vr) <- "-"
> c(vr)
> Error in validObject(.Object) :
> invalid class “VRanges” object: 'strand' must always be '+' or '*'
>
>
> cheers,
>
> robert.
>
> On 05/22/2015 09:49 PM, Michael Lawrence wrote:
>>
>> This changed recently. VariantAnnotation in devel no longer enforces a
>> strand on VRanges, or at least it allows the "*" case.
>>
>>
>> On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.castelo at upf.edu
>> <mailto:robert.castelo at upf.edu>> wrote:
>>
>> Hi,
>>
>> I have encountered myself in a strange situation when using the
>> function locateVariants() from VariantAnnotation with an input
>> VRanges object. The problem is that some of the expected coding
>> annotations are not showing up when using locateVariants() with
>> default parameters.
>>
>> After investigating this situation I think I found the reason, which
>> does not look like a bug but I would like that you give me some
>> clarification about the logic behind using locateVariants() with
>> VRanges objects.
>>
>> The documentation of the VRanges-class says that in this class of
>> objects "The strand is always constrained to be positive (+).". I
>> guess there may be a good reason for this but I could not find it in
>> the documentation or googling about it.
>>
>> This means that when you coerce a CollapsedVCF object (obtained, for
>> example, from a VCF file via readVcf()) to a VRanges object, even
>> though variants in the VCF may have no strand, they get a positive
>> strand in the VRanges object.
>>
>> The problem arises then, when you call locateVariants() with this
>> VRanges object, because features on the negative strand are never
>> going to overlap with the variants since, by default, the argument
>> ignore.strand=FALSE.
>>
>> Let me illustrate this with a toy example. Consider the SNP
>> rs1129038
>> (http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
>> chr15:28356859 with allels A/G. It is located on the 3' UTR of the
>> gene HERC2 coded on the negative strand of the human reference
>> genome. Let's build a toy VRanges object having this variant:
>>
>> library(VariantAnnotation)
>> vr <- VRanges(seqnames="chr15",
>> ranges=IRanges(28356859, 28356859),
>> ref="A", alt="G",
>> refDepth=5, altDepth=7,
>> totalDepth=12, sampleNames="A")
>> strand(vr)
>> factor-Rle of length 1 with 1 run
>> Lengths: 1
>> Values : +
>> Levels(3): + - *
>>
>> Let's build now its CollapsedVCF counterpart by using the
>> corresponding coercion method and set the strand to "*":
>>
>> vcf <- asVCF(vr)
>> strand(vcf) <- "*"
>>
>> Now run locateVariants() on both objects with UCSC annotations:
>>
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>
>> locateVariants(vcf, txdb, region=AllVariants())
>> GRanges object with 2 ranges and 9 metadata columns:
>> seqnames ranges strand | LOCATION LOCSTART
>> LOCEND QUERYID TXID CDSID
>> <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>> <character> <IntegerList>
>> [1] chr15 [28356859, 28356859] * | threeUTR 50 50
>> 1 55386
>> [2] chr15 [28356859, 28356859] * | threeUTR 50 50
>> 1 55387
>> GENEID PRECEDEID FOLLOWID
>> <character> <CharacterList> <CharacterList>
>> [1] 8924
>> [2] 8924
>> -------
>> seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>
>> locateVariants(vr, txdb, region=AllVariants())
>> GRanges object with 1 range and 9 metadata columns:
>> seqnames ranges strand | LOCATION LOCSTART
>> LOCEND QUERYID TXID CDSID
>> <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>> <integer> <IntegerList>
>> [1] chr15 [28356859, 28356859] + | intergenic <NA> <NA>
>> 1 <NA>
>> GENEID PRECEDEID FOLLOWID
>> <character> <CharacterList> <CharacterList>
>> [1] <NA> 100132565,100289656,100616223,... 2567
>> -------
>> seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>
>> Note that while we get the 3' UTR annotation from the strandless VCF
>> object we do not get it from the VRanges object with the positive
>> strand. To make my point clear: this positive strand shows up when
>> you coerce a strandless VCF object to a VRanges one, because
>> positive strandness seems to be the convention for VRanges objects:
>>
>> as(vcf, VRanges)
>> VRanges object with 1 range and 1 metadata column:
>> seqnames ranges strand ref
>> alt totalDepth refDepth altDepth
>> <Rle> <IRanges> <Rle> <character> <characterOrRle> <integerOrRle>
>> <integerOrRle> <integerOrRle>
>> [1] chr15 [28356859, 28356859] + A
>> G 12 5 7
>> sampleNames softFilterMatrix | QUAL
>> <factorOrRle> <matrix> | <numeric>
>> [1] A | <NA>
>> -------
>> seqinfo: 1 sequence from an unspecified genome; no seqlengths
>> hardFilters: NULL
>>
>>
>> Of course, if I run locateVariants() with the argument
>> ignore.strand=TRUE, then I get the expected annotation:
>>
>> locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE)
>> GRanges object with 2 ranges and 9 metadata columns:
>> seqnames ranges strand | LOCATION LOCSTART
>> LOCEND QUERYID TXID CDSID
>> <Rle> <IRanges> <Rle> | <factor> <integer> <integer> <integer>
>> <character> <IntegerList>
>> [1] chr15 [28356859, 28356859] + | threeUTR 677
>> 677 1 55386
>> [2] chr15 [28356859, 28356859] + | threeUTR 677
>> 677 1 55387
>> GENEID PRECEDEID FOLLOWID
>> <character> <CharacterList> <CharacterList>
>> [1] 8924
>> [2] 8924
>> -------
>> seqinfo: 1 sequence from an unspecified genome; no seqlengths
>>
>>
>> So, my question is, given that VRanges objects are enforced to have
>> a positive strand, would not be better to have ignore.strand=TRUE as
>> default in locateVariants?
>>
>> Alternatively, I would recommend that locateVariants() issues a
>> warning, or maybe an error, when the input object is VRanges and
>> ignore.strand=FALSE.
>>
>> Finally, out of curiosity, why a VRanges object enforces the
>> positive strand in all its genomic ranges? Would not be better just
>> taking the strand of the CollapsedVCF object when coercing the
>> CollapsedVCF object to VRanges?
>>
>>
>> thanks!!
>>
>>
>> robert.
>>
>> _______________________________________________
>> Bioc-devel at r-project.org <mailto:Bioc-devel at r-project.org> mailing
>> list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>
> --
> Robert Castelo, PhD
> Associate Professor
> Dept. of Experimental and Health Sciences
> Universitat Pompeu Fabra (UPF)
> Barcelona Biomedical Research Park (PRBB)
> Dr Aiguader 88
> E-08003 Barcelona, Spain
> telf: +34.933.160.514
> fax: +34.933.160.550
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
More information about the Bioc-devel
mailing list