[BioC] ensemblVEP: passing on coverage / frequency information

Wed Apr 10 18:05:50 CEST 2013

Dear Valerie,

thanks a lot for making your great ensemblVEP package available. 

I have been using it to assess the consequences of variants detected by the VariantTools package (version 1.6.1).  
ensemblVEP retrieves the variantEffectPredictor output, but triggers a number of warnings (see below).

library(ensemblVEP)

## example.vcf is available at http://dl.dropbox.com/u/126180/example.vcf
vcf <- readVcf( "example.vcf", genome="hg19")

## the vcf object contains coverage information
vcf
geno(vcf)$DP

## running VEP triggers  warnings
vep.param <- VEPParam()
output( vep.param )$vcf <- TRUE

vep <- ensemblVEP( "example.vcf",
           genome="hg19",
           param=vep.param
           )

warnings()

1: In doTryCatch(return(expr), name, parentenv, handler) : 
  record 1 (and others?) INFO 'AD:DP:AP' not found 
2: In doTryCatch(return(expr), name, parentenv, handler)
  record 1 (and others?) FORMAT '0,2' not found
3: In doTryCatch(return(expr), name, parentenv, handler) :
  record 1 (and others?) FORMAT '2' not found
  ...

I think these warnings refer to the "geno" slot of the vcf file. When I request a VCF object as output from ensemblVEP, the object contains the same elements in its geno slot as the original vcf input file, but they only contain NAs. Is this expected or should the geno slot be passed on to the VCF object generated by ensemblVEP ?

My final objective is to obtain a GRanges or data.frame containing both the predicted consequences and the coverage / frequencies from the vcf input file for each variant. I have seen that the parseCSQToGRanges returns a "VCFRowID" column associating each row with the original entry in the VCF object. 

Would you recommend to use this column to extract the corresponding rows from the vectors / arrays stored in the "geno" slot ? Or is there a simpler, more elegant solution you could point me to ?

Thanks a lot !
Thomas

 -- output of sessionInfo(): 

R version 3.0.0 (2013-04-03)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
 [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
 [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods
[8] base

other attached packages:
[1] ensemblVEP_1.0.0        VariantAnnotation_1.6.1 Rsamtools_1.12.0
[4] Biostrings_2.28.0       GenomicRanges_1.12.1    IRanges_1.18.0
[7] BiocGenerics_0.6.0

loaded via a namespace (and not attached):
 [1] AnnotationDbi_1.22.1   Biobase_2.20.0         biomaRt_2.16.0
 [4] bitops_1.0-5           BSgenome_1.28.0        compiler_3.0.0
 [7] DBI_0.2-5              GenomicFeatures_1.12.0 RCurl_1.95-4.1
[10] RSQLite_0.11.2         rtracklayer_1.20.0     stats4_3.0.0
[13] tools_3.0.0            XML_3.96-1.1           zlibbioc_1.6.0

--
Sent via the guest posting facility at bioconductor.org.