[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Robert Castelo robert.castelo at upf.edu
Fri May 22 20:33:00 CEST 2015


Hi,

I have encountered myself in a strange situation when using the function 
locateVariants() from VariantAnnotation with an input VRanges object. 
The problem is that some of the expected coding annotations are not 
showing up when using locateVariants() with default parameters.

After investigating this situation I think I found the reason, which 
does not look like a bug but I would like that you give me some 
clarification about the logic behind using locateVariants() with VRanges 
objects.

The documentation of the VRanges-class says that in this class of 
objects "The strand is always constrained to be positive (+).". I guess 
there may be a good reason for this but I could not find it in the 
documentation or googling about it.

This means that when you coerce a CollapsedVCF object (obtained, for 
example, from a VCF file via readVcf()) to a VRanges object, even though 
variants in the VCF may have no strand, they get a positive strand in 
the VRanges object.

The problem arises then, when you call locateVariants() with this 
VRanges object, because features on the negative strand are never going 
to overlap with the variants since, by default, the argument 
ignore.strand=FALSE.

Let me illustrate this with a toy example. Consider the SNP rs1129038 
(http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at 
chr15:28356859 with allels A/G. It is located on the 3' UTR of the gene 
HERC2 coded on the negative strand of the human reference genome. Let's 
build a toy VRanges object having this variant:

library(VariantAnnotation)
vr <- VRanges(seqnames="chr15",
               ranges=IRanges(28356859, 28356859),
               ref="A", alt="G",
               refDepth=5, altDepth=7,
               totalDepth=12, sampleNames="A")
strand(vr)
factor-Rle of length 1 with 1 run
   Lengths: 1
   Values : +
Levels(3): + - *

Let's build now its CollapsedVCF counterpart by using the corresponding 
coercion method and set the strand to "*":

vcf <- asVCF(vr)
strand(vcf) <- "*"

Now run locateVariants() on both objects with UCSC annotations:

library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

locateVariants(vcf, txdb, region=AllVariants())
GRanges object with 2 ranges and 9 metadata columns:
       seqnames               ranges strand | LOCATION  LOCSTART 
LOCEND   QUERYID        TXID         CDSID
          <Rle>            <IRanges>  <Rle> | <factor> <integer> 
<integer> <integer> <character> <IntegerList>
   [1]    chr15 [28356859, 28356859]      * | threeUTR        50 
50         1       55386
   [2]    chr15 [28356859, 28356859]      * | threeUTR        50 
50         1       55387
            GENEID       PRECEDEID        FOLLOWID
       <character> <CharacterList> <CharacterList>
   [1]        8924
   [2]        8924
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

locateVariants(vr, txdb, region=AllVariants())
GRanges object with 1 range and 9 metadata columns:
       seqnames               ranges strand |   LOCATION  LOCSTART 
LOCEND   QUERYID      TXID         CDSID
          <Rle>            <IRanges>  <Rle> |   <factor> <integer> 
<integer> <integer> <integer> <IntegerList>
   [1]    chr15 [28356859, 28356859]      + | intergenic      <NA> 
<NA>         1      <NA>
            GENEID                         PRECEDEID        FOLLOWID
       <character>                   <CharacterList> <CharacterList>
   [1]        <NA> 100132565,100289656,100616223,...            2567
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths

Note that while we get the 3' UTR annotation from the strandless VCF 
object we do not get it from the VRanges object with the positive 
strand. To make my point clear: this positive strand shows up when you 
coerce a strandless VCF object to a VRanges one, because positive 
strandness seems to be the convention for VRanges objects:

as(vcf, VRanges)
VRanges object with 1 range and 1 metadata column:
       seqnames               ranges strand         ref              alt 
     totalDepth       refDepth       altDepth
          <Rle>            <IRanges>  <Rle> <character> <characterOrRle> 
<integerOrRle> <integerOrRle> <integerOrRle>
   [1]    chr15 [28356859, 28356859]      +           A                G 
             12              5              7
         sampleNames softFilterMatrix |      QUAL
       <factorOrRle>         <matrix> | <numeric>
   [1]             A                  |      <NA>
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths
   hardFilters: NULL


Of course, if I run locateVariants() with the argument 
ignore.strand=TRUE, then I get the expected annotation:

locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE)
GRanges object with 2 ranges and 9 metadata columns:
       seqnames               ranges strand | LOCATION  LOCSTART 
LOCEND   QUERYID        TXID         CDSID
          <Rle>            <IRanges>  <Rle> | <factor> <integer> 
<integer> <integer> <character> <IntegerList>
   [1]    chr15 [28356859, 28356859]      + | threeUTR       677 
677         1       55386
   [2]    chr15 [28356859, 28356859]      + | threeUTR       677 
677         1       55387
            GENEID       PRECEDEID        FOLLOWID
       <character> <CharacterList> <CharacterList>
   [1]        8924
   [2]        8924
   -------
   seqinfo: 1 sequence from an unspecified genome; no seqlengths


So, my question is, given that VRanges objects are enforced to have a 
positive strand, would not be better to have ignore.strand=TRUE as 
default in locateVariants?

Alternatively, I would recommend that locateVariants() issues a warning, 
or maybe an error, when the input object is VRanges and ignore.strand=FALSE.

Finally, out of curiosity, why a VRanges object enforces the positive 
strand in all its genomic ranges? Would not be better just taking the 
strand of the CollapsedVCF object when coercing the CollapsedVCF object 
to VRanges?


thanks!!


robert.



More information about the Bioc-devel mailing list