[BioC] ensemblVEP: passing on coverage / frequency information

Wed Apr 10 23:51:38 CEST 2013

Hi Thomas,

You've hit an interesting bug. The problem is with how CSQ is returned 
from Ensembl.

The documentation states that CSQ will be returned as an INFO column. 
However your file doesn't have INFO data and CSQ is being appended to 
the FORMAT variables instead.

This is a problem when parsing the vcf with scanVcf/readVcf because now 
the number of FORMAT variables doesn't match the number of FORMAT data 
fields for each sample. This is why you see NA for all geno variables, 
they aren't being parsed.

For example, the FORMAT field is coming back like this,
AD:DP:AP;CSQ=C|ENSG00000128602|ENST00000462420|...

but the samples only have data for the 3 original fields,
0,2:2:0,1

This seems like a bug and not intended behavior. Ensembl has even 
created a new INFO header line in the vcf returned to indicate the data 
would be in the INFO column,

##INFO=<ID=CSQ,Number=.,Type=String,Description="Consequence type as 
predicted by VEP. ....

If you want to see this yourself you can write the vcf to disk and open 
it in an editor.

param <- VEPParam(output=c(vcf=TRUE),
                   input=c(output_file="/path/myfile.vcf"))
ensemblVEP("example.vcf", param=param)

I will report this bug to Ensembl.

As for putting together the summary GRanges, yes, I would use the 
'VCFRowID' column. Using ex2.vcf as an example,

 > file <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
 > myparam <- VEPParam(output=c(vcf=TRUE))
 > vep <- ensemblVEP(file, param=myparam)
 > vcf <- readVcf(file, "hg19")
 > gr <- parseCSQToGRanges(vep)

Subset the full vcf on 'VCFRowID':
 > vcf_sub <- vcf[gr$VCFRowID]

Add info or geno data as mcols():
 > df <- DataFrame(mcols(gr), AF=info(vcf_sub)$AF, DP=geno(vcf_sub)$DP)
 > mcols(gr) <- df

I've checked in version 1.1.1 to devel. 'VCFRowID' is now returned as an 
integer so you can do the subsetting shown above (previously returned as 
factor). I also added the 'database' field to the VEPParam database 
options. This was a recent addition from Ensembl. Testing was done with 
the most recent VEP API version 71.

Valerie

On 04/10/2013 09:05 AM, Thomas Sandmann [guest] wrote:
> Dear Valerie,
>
> thanks a lot for making your great ensemblVEP package available.
>
> I have been using it to assess the consequences of variants detected by the VariantTools package (version 1.6.1).
> ensemblVEP retrieves the variantEffectPredictor output, but triggers a number of warnings (see below).
>
> library(ensemblVEP)
>
> ## example.vcf is available at http://dl.dropbox.com/u/126180/example.vcf
> vcf <- readVcf( "example.vcf", genome="hg19")
>
> ## the vcf object contains coverage information
> vcf
> geno(vcf)$DP
>
> ## running VEP triggers  warnings
> vep.param <- VEPParam()
> output( vep.param )$vcf <- TRUE
>
> vep <- ensemblVEP( "example.vcf",
>             genome="hg19",
>             param=vep.param
>             )
>
> warnings()
>
> 1: In doTryCatch(return(expr), name, parentenv, handler) :
>    record 1 (and others?) INFO 'AD:DP:AP' not found
> 2: In doTryCatch(return(expr), name, parentenv, handler)
>    record 1 (and others?) FORMAT '0,2' not found
> 3: In doTryCatch(return(expr), name, parentenv, handler) :
>    record 1 (and others?) FORMAT '2' not found
>    ...
>
> I think these warnings refer to the "geno" slot of the vcf file. When I request a VCF object as output from ensemblVEP, the object contains the same elements in its geno slot as the original vcf input file, but they only contain NAs. Is this expected or should the geno slot be passed on to the VCF object generated by ensemblVEP ?
>
> My final objective is to obtain a GRanges or data.frame containing both the predicted consequences and the coverage / frequencies from the vcf input file for each variant. I have seen that the parseCSQToGRanges returns a "VCFRowID" column associating each row with the original entry in the VCF object.
>
> Would you recommend to use this column to extract the corresponding rows from the vectors / arrays stored in the "geno" slot ? Or is there a simpler, more elegant solution you could point me to ?
>
> Thanks a lot !
> Thomas
>
>   -- output of sessionInfo():
>
> R version 3.0.0 (2013-04-03)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=C                 LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] parallel  stats     graphics  grDevices utils     datasets  methods
> [8] base
>
> other attached packages:
> [1] ensemblVEP_1.0.0        VariantAnnotation_1.6.1 Rsamtools_1.12.0
> [4] Biostrings_2.28.0       GenomicRanges_1.12.1    IRanges_1.18.0
> [7] BiocGenerics_0.6.0
>
> loaded via a namespace (and not attached):
>   [1] AnnotationDbi_1.22.1   Biobase_2.20.0         biomaRt_2.16.0
>   [4] bitops_1.0-5           BSgenome_1.28.0        compiler_3.0.0
>   [7] DBI_0.2-5              GenomicFeatures_1.12.0 RCurl_1.95-4.1
> [10] RSQLite_0.11.2         rtracklayer_1.20.0     stats4_3.0.0
> [13] tools_3.0.0            XML_3.96-1.1           zlibbioc_1.6.0
>
>
> --
> Sent via the guest posting facility at bioconductor.org.
>