[Bioc-devel] contradictions in the return value of predictCoding() from VariantAnnotation

Thu Mar 7 14:43:19 CET 2013

Here is the second attachment ("variants.vcf") that was missing from the
first message.

Michael

On 03/07/2013 02:40 PM, Michael Stadler wrote:
> Dear Valerie and colleagues,
> 
> I have recently started using the VariantAnnotation package, which is a
> huge asset for much of my work with sequence variants - thank you!
> 
> I have a question regarding the following example (it makes use of two
> small text files, "annot.gtf" and "variants.vcf", attached to this message):
> 
> library(VariantAnnotation)
> library(GenomicFeatures)
> library(BSgenome.Celegans.UCSC.ce6)
> 
> # build a TranscriptDb from "annot.gtf"
> chrominfo <- data.frame(chrom=as.character(seqnames(Celegans)),
>                         length=seqlengths(Celegans),
>                         is_circular=FALSE)
> txdb <- makeTranscriptDbFromGFF(file="annot.gtf",
>                                 format="gtf",
>                                 chrominfo=chrominfo,
>                                 dataSource="WormBase v. WS190",
>                                 species="Caenorhabditis elegans")
> txdb
> #TranscriptDb object:
> #| Db type: TranscriptDb
> #| Supporting package: GenomicFeatures
> #| Data source: WormBase v. WS190
> #| Organism: Caenorhabditis elegans
> #| miRBase build ID: NA
> #| transcript_nrow: 7
> #| exon_nrow: 42
> #| cds_nrow: 42
> #| Db created by: GenomicFeatures package from Bioconductor
> #| Creation time: 2013-03-07 14:21:08 +0100 (Thu, 07 Mar 2013)
> #| GenomicFeatures version at creation time: 1.11.14
> #| RSQLite version at creation time: 0.11.2
> #| DBSCHEMAVERSION: 1.0
> 
> # read in sequence variants
> vcf <- readVcf("variants.vcf", genome="ce6")
> dim(vcf)
> #[1] 3 1
> 
> # predict the impact on coding sequences
> aa <- predictCoding(query=vcf, subject=txdb, seqSource=Celegans)
> length(aa)
> #[1] 7
> 
> 
> # My question is related to the impact of variant "chrV:15822727":
> # This SNV is overlapping two transcripts (same base, plus strand),
> # yet the reported VARCODON values are different (GAT -> TAT/AAT):
> aa[names(aa)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]
> 
> #GRanges with 2 ranges and 5 metadata columns:
> #                seqnames               ranges strand |
> #                   <Rle>            <IRanges>  <Rle> |
> #  chrV:15822727     chrV [15822727, 15822727]      + |
> #  chrV:15822727     chrV [15822727, 15822727]      + |
> #                           REF                ALT      varAllele
> #                <DNAStringSet> <DNAStringSetList> <DNAStringSet>
> #  chrV:15822727              G                  A              A
> #  chrV:15822727              G                  A              A
> #                      REFCODON       VARCODON
> #                <DNAStringSet> <DNAStringSet>
> #  chrV:15822727            GAT            TAT
> #  chrV:15822727            GAT            AAT
> #  ---
> #  seqlengths:
> #   chrI chrV
> #     NA   NA
> 
> 
> # interestingly, when altering the annotation or the set of variants,
> # this contradictions disapears (GAT -> AAT):
> vcf2 <- vcf[-1] # remove the first SNV
> aa2 <- predictCoding(query=vcf2, subject=txdb, seqSource=Celegans)
> length(aa2)
> #[1] 5
> aa2[names(aa2)=="chrV:15822727",c("REF","ALT","varAllele","REFCODON","VARCODON")]
> 
> #GRanges with 2 ranges and 5 metadata columns:
> #                seqnames               ranges strand |
> #                   <Rle>            <IRanges>  <Rle> |
> #  chrV:15822727     chrV [15822727, 15822727]      + |
> #  chrV:15822727     chrV [15822727, 15822727]      + |
> #                           REF                ALT      varAllele
> #                <DNAStringSet> <DNAStringSetList> <DNAStringSet>
> #  chrV:15822727              G                  A              A
> #  chrV:15822727              G                  A              A
> #                      REFCODON       VARCODON
> #                <DNAStringSet> <DNAStringSet>
> #  chrV:15822727            GAT            AAT
> #  chrV:15822727            GAT            AAT
> #  ---
> #  seqlengths:
> #   chrI chrV
> #     NA   NA
> 
> Do you know why this could be the case?
> Regards,
> Michael
> 
> 
> Here is my envirnoment:
> sessionInfo()
> R Under development (unstable) (2013-02-25 r62061)
> Platform: x86_64-unknown-linux-gnu (64-bit)
> 
> locale:
> [1] C
> 
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets
> [7] methods   base
> 
> other attached packages:
>  [1] BSgenome.Celegans.UCSC.ce6_1.3.19
>  [2] BSgenome_1.27.1
>  [3] GenomicFeatures_1.11.14
>  [4] AnnotationDbi_1.21.12
>  [5] Biobase_2.19.3
>  [6] VariantAnnotation_1.5.42
>  [7] Rsamtools_1.11.21
>  [8] Biostrings_2.27.11
>  [9] GenomicRanges_1.11.35
> [10] IRanges_1.17.35
> [11] BiocGenerics_0.5.6
> [12] RColorBrewer_1.0-5
> 
> loaded via a namespace (and not attached):
>  [1] DBI_0.2-5          RCurl_1.95-4.1     RSQLite_0.11.2
>  [4] XML_3.95-0.1       biomaRt_2.15.0     bitops_1.0-5
>  [7] rtracklayer_1.19.9 stats4_3.0.0       tools_3.0.0
> [10] zlibbioc_1.5.0
> 
> 
> 
> _______________________________________________
> Bioc-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/bioc-devel
> 

-- 
--------------------------------------------
Michael Stadler, PhD
Head of Computational Biology
Friedrich Miescher Institute
Basel (Switzerland)
Phone : +41 61 697 6492
Fax   : +41 61 697 3976
Mail  : michael.stadler at fmi.ch
--------------------------------------------