[Bioc-devel] VRanges-class positive strandness and locateVariants() strandawareness

Michael Lawrence lawrence.michael at gene.com
Fri May 22 21:49:50 CEST 2015


This changed recently. VariantAnnotation in devel no longer enforces a
strand on VRanges, or at least it allows the "*" case.


On Fri, May 22, 2015 at 11:33 AM, Robert Castelo <robert.castelo at upf.edu>
wrote:

> Hi,
>
> I have encountered myself in a strange situation when using the function
> locateVariants() from VariantAnnotation with an input VRanges object. The
> problem is that some of the expected coding annotations are not showing up
> when using locateVariants() with default parameters.
>
> After investigating this situation I think I found the reason, which does
> not look like a bug but I would like that you give me some clarification
> about the logic behind using locateVariants() with VRanges objects.
>
> The documentation of the VRanges-class says that in this class of objects
> "The strand is always constrained to be positive (+).". I guess there may
> be a good reason for this but I could not find it in the documentation or
> googling about it.
>
> This means that when you coerce a CollapsedVCF object (obtained, for
> example, from a VCF file via readVcf()) to a VRanges object, even though
> variants in the VCF may have no strand, they get a positive strand in the
> VRanges object.
>
> The problem arises then, when you call locateVariants() with this VRanges
> object, because features on the negative strand are never going to overlap
> with the variants since, by default, the argument ignore.strand=FALSE.
>
> Let me illustrate this with a toy example. Consider the SNP rs1129038 (
> http://www.ncbi.nlm.nih.gov/projects/SNP/snp_ref.cgi?rs=1129038) at
> chr15:28356859 with allels A/G. It is located on the 3' UTR of the gene
> HERC2 coded on the negative strand of the human reference genome. Let's
> build a toy VRanges object having this variant:
>
> library(VariantAnnotation)
> vr <- VRanges(seqnames="chr15",
>               ranges=IRanges(28356859, 28356859),
>               ref="A", alt="G",
>               refDepth=5, altDepth=7,
>               totalDepth=12, sampleNames="A")
> strand(vr)
> factor-Rle of length 1 with 1 run
>   Lengths: 1
>   Values : +
> Levels(3): + - *
>
> Let's build now its CollapsedVCF counterpart by using the corresponding
> coercion method and set the strand to "*":
>
> vcf <- asVCF(vr)
> strand(vcf) <- "*"
>
> Now run locateVariants() on both objects with UCSC annotations:
>
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>
> locateVariants(vcf, txdb, region=AllVariants())
> GRanges object with 2 ranges and 9 metadata columns:
>       seqnames               ranges strand | LOCATION  LOCSTART LOCEND
>  QUERYID        TXID         CDSID
>          <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer>
> <integer> <character> <IntegerList>
>   [1]    chr15 [28356859, 28356859]      * | threeUTR        50 50
>  1       55386
>   [2]    chr15 [28356859, 28356859]      * | threeUTR        50 50
>  1       55387
>            GENEID       PRECEDEID        FOLLOWID
>       <character> <CharacterList> <CharacterList>
>   [1]        8924
>   [2]        8924
>   -------
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> locateVariants(vr, txdb, region=AllVariants())
> GRanges object with 1 range and 9 metadata columns:
>       seqnames               ranges strand |   LOCATION  LOCSTART LOCEND
>  QUERYID      TXID         CDSID
>          <Rle>            <IRanges>  <Rle> |   <factor> <integer>
> <integer> <integer> <integer> <IntegerList>
>   [1]    chr15 [28356859, 28356859]      + | intergenic      <NA> <NA>
>      1      <NA>
>            GENEID                         PRECEDEID        FOLLOWID
>       <character>                   <CharacterList> <CharacterList>
>   [1]        <NA> 100132565,100289656,100616223,...            2567
>   -------
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
> Note that while we get the 3' UTR annotation from the strandless VCF
> object we do not get it from the VRanges object with the positive strand.
> To make my point clear: this positive strand shows up when you coerce a
> strandless VCF object to a VRanges one, because positive strandness seems
> to be the convention for VRanges objects:
>
> as(vcf, VRanges)
> VRanges object with 1 range and 1 metadata column:
>       seqnames               ranges strand         ref              alt
>  totalDepth       refDepth       altDepth
>          <Rle>            <IRanges>  <Rle> <character> <characterOrRle>
> <integerOrRle> <integerOrRle> <integerOrRle>
>   [1]    chr15 [28356859, 28356859]      +           A                G
>          12              5              7
>         sampleNames softFilterMatrix |      QUAL
>       <factorOrRle>         <matrix> | <numeric>
>   [1]             A                  |      <NA>
>   -------
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>   hardFilters: NULL
>
>
> Of course, if I run locateVariants() with the argument ignore.strand=TRUE,
> then I get the expected annotation:
>
> locateVariants(vr, txdb, region=AllVariants(), ignore.strand=TRUE)
> GRanges object with 2 ranges and 9 metadata columns:
>       seqnames               ranges strand | LOCATION  LOCSTART LOCEND
>  QUERYID        TXID         CDSID
>          <Rle>            <IRanges>  <Rle> | <factor> <integer> <integer>
> <integer> <character> <IntegerList>
>   [1]    chr15 [28356859, 28356859]      + | threeUTR       677 677
>  1       55386
>   [2]    chr15 [28356859, 28356859]      + | threeUTR       677 677
>  1       55387
>            GENEID       PRECEDEID        FOLLOWID
>       <character> <CharacterList> <CharacterList>
>   [1]        8924
>   [2]        8924
>   -------
>   seqinfo: 1 sequence from an unspecified genome; no seqlengths
>
>
> So, my question is, given that VRanges objects are enforced to have a
> positive strand, would not be better to have ignore.strand=TRUE as default
> in locateVariants?
>
> Alternatively, I would recommend that locateVariants() issues a warning,
> or maybe an error, when the input object is VRanges and ignore.strand=FALSE.
>
> Finally, out of curiosity, why a VRanges object enforces the positive
> strand in all its genomic ranges? Would not be better just taking the
> strand of the CollapsedVCF object when coercing the CollapsedVCF object to
> VRanges?
>
>
> thanks!!
>
>
> robert.
>
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>

	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list