[BioC] problems with strand in predictCoding

Fri Apr 20 03:59:26 CEST 2012

On Thu, Apr 19, 2012 at 9:46 PM, Jeremiah Degenhardt
<degenhardt.jeremiah at gene.com> wrote:
> Hello BioC,
>
> I am using the predictCoding function in the VariantAnnotation package
> and have run into a few issues.
>
> The first issue that I found is a small one. While I can supply any
> txdb object that I want to the predictCoding, the function seems to
> still be looking for TxDb.Hsapiens.UCSC.hg19.knownGene. Running
> predict coding without this package installed results in the following
> error:
>
> Error in isCircular(TxDb.Hsapiens.UCSC.hg19.knownGene) :
>   error in evaluating the argument 'x' in selecting a method for
> function 'isCircular': Error: object
> 'TxDb.Hsapiens.UCSC.hg19.knownGene' not found
>
> The second set of issues are a little more subtle. It seems that
> predictCoding is not properly handling the strand of the variants, at
> least in my opinion.
>
> If I provide an unstranded GRanges with variants that overlap a gene
> on the negative strand, the predictCoding function will not reverse
> complement the varAllele resulting in an incorrect functional
> consequence prediction. Here is some code to generate an example
>
> library(VariantAnnotation)
> library(BSgenome.Hsapiens.UCSC.hg19)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
> GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1),
> variant = DNAStringSet("T"))
> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
>
> GRanges with 4 ranges and 13 elementMetadata cols:
>      seqnames               ranges strand |        variant
> varAllele
>         <Rle>            <IRanges>  <Rle> | <DNAStringSet>
> <DNAStringSet>
>  [1]    chr17 [63554271, 63554271]      * |              T
>  T
>  [2]    chr17 [63554271, 63554271]      * |              T
>  T
>  [3]    chr17 [63554271, 63554271]      * |              T
>  T
>  [4]    chr17 [63554271, 63554271]      * |              T
>  T
>          cdsLoc              proteinLoc   queryID        txID
> cdsID
>       <IRanges> <CompressedIntegerList> <integer> <character>
> <integer>
>  [1] [468, 468]                     156         1       65536
> 193561
>  [2] [468, 468]                     156         1       65537
> 193561
>  [3] [468, 468]                     156         1       65538
> 193561
>  [4] [468, 468]                     156         1       65539
> 193564
>           geneID consequence       refCodon       varCodon
> refAA
>      <character>    <factor> <DNAStringSet> <DNAStringSet>
> <AAStringSet>
>  [1]        8313  synonymous            TAC            TAT
>  Y
>  [2]        8313  synonymous            TAC            TAT
>  Y
>  [3]        8313  synonymous            TAC            TAT
>  Y
>  [4]        8313  synonymous            TAC            TAT
>  Y
>              varAA
>      <AAStringSet>
>  [1]             Y
>  [2]             Y
>  [3]             Y
>  [4]             Y
>  ---
>  seqlengths:
>   chr17
>      NA
>
> Running this the function predicts this to be TAC> TAT, a synonymous
> change. However, as the gene is on the negative strand the actual
> result of this variant should be TAC> TAA or a stop mutation.
>
> If I instead give predictCoding a stranded GRanges to let the function
> know that the base reported is with respect to the positive strand I
> get this result:
>
>> GR <- GRanges(seqnames="chr17", IRanges(start = 63554271, width=1), strand="+", variant = DNAStringSet("T"))
>> predictCoding(GR, txdb, Hsapiens, values(GR)$variant)
> GRanges with 0 ranges and 0 elementMetadata cols:
>
>   seqnames    ranges strand
>      <Rle> <IRanges>  <Rle>
>  ---
>  seqlengths:
>
>
> Now instead of reverse complementing the varAllele, and giving me the
> correct annotation, the function simply misses that the mutation
> overlaps with any gene and returns an empty GRanges.
>
> This issue is not really a bug in predictCoding, but results from what
> is (again in my opinion) an unfortunate treatment of the negative and
> positive strand as separate entities in GRanges. From a biological
> prospective this almost never makes sense and is not how I would
> expect the functions in GRanges to behave. Just because a gene is
> annotated on the negative strand does not mean that I don't want to
> know it overlaps with a gene on the positive strand. Maybe there are
> cases where this is the desired behavior, so it might make sense to
> add this a switch you could turn on or off but in most cases this will
> simply lead to erroneous results such as the one above. I think the
> most useful information that comes from knowing the strand is knowing
> when you need to reverse complement a sequence for the underlying
> position of interest.
>
> A third possible option here would be to call the function twice, once
> with for the positive strand and once for the negative strand with the
> variants reverse complemented and then merge the results. This
> however, seems quite inefficient and not like something that I should
> have to do.
>
> >From my perspective the proper behavior of predictCoding with respect
> to strand would be to treat unstranded GRanges as positive strand as
> this is the reference strand for things like the genome builds and vcf
> files. Then the variants should overlap positive and negative strand
> genes and should be reverse complemented for consequence prediction on
> the negative strand genes. It should also allow overlap of negative
> and positive stranded variants with both negative and positive
> stranded genes, but properly reverse complement the variant in each
> case to get the proper consequence.

I agree with Jeremiah that the treatment of unstranded variants should
be to default to treating them as on the positive strand and reverse
complement them for negative strand genes.  All other variant
prediction softwares that I know of make this assumption and the VCF
format itself seems to make something of an implicit assumption on
this point.

Sean

> thanks in advance,
>
> Jeremiah
>
> here is my session info incase it is needed:
>
>> sessionInfo()
> R Under development (unstable) (2012-04-11 r58985)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>  [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>  [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>  [7] LC_PAPER=C                 LC_NAME=C
>  [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
>  [1] TxDb.Hsapiens.UCSC.hg19.knownGene_2.7.1
>  [2] GenomicFeatures_1.8.1
>  [3] AnnotationDbi_1.18.0
>  [4] Biobase_2.16.0
>  [5] BSgenome.Hsapiens.UCSC.hg19_1.3.17
>  [6] BSgenome_1.24.0
>  [7] VariantAnnotation_1.2.5
>  [8] Rsamtools_1.8.1
>  [9] Biostrings_2.24.1
> [10] GenomicRanges_1.8.3
> [11] IRanges_1.14.2
> [12] BiocGenerics_0.2.0
>
> loaded via a namespace (and not attached):
>  [1] biomaRt_2.12.0     bitops_1.0-4.1     compiler_2.16.0
> DBI_0.2-5
>  [5] grid_2.16.0        lattice_0.20-6     Matrix_1.0-6
> RCurl_1.91-1
>  [9] RSQLite_0.11.1     rtracklayer_1.16.1 snpStats_1.6.0
> splines_2.16.0
> [13] stats4_2.16.0      survival_2.36-12   tools_2.16.0
> XML_3.9-4
> [17] zlibbioc_1.2.0
>
> --
> Jeremiah Degenhardt, Ph.D.
> Computational Biologist
> Bioinformatics and Computational Biology
> Genentech, Inc.
> degenhardt.jeremiah at gene.com
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor