[BioC] VCF as custom variant annotation database

Wed Jan 8 19:09:03 CET 2014

Hi Francesco,

On 01/08/2014 05:59 AM, Francesco Lescai wrote:
> Hi there,
> following this suggestion I’ve converted the results of a number of
> analyses into VCF files.
> I believe however I never exploited the full potential of the VCF object
> and I’m not sure which would be the most efficient way for a targeted
> query, i.e. check if a variant from a VCF object is present in any of
> the others, like I would to with an SQL database.
>
> Ideally, I’d like to add a metadata field to one of my objects, that
> could be reported in the INFO field like any other annotation when I
> then write the VCF on a file.
> To do this I’d have to add the field both in the info(header(vcf)) and
> in the info(vcf), with/after the result of my query.
>
> Two questions therefore:
> 1) how do you add an INFO field in a way that doesn’t generate error
> from within R

 > fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
 > vcf <- readVcf(fl, "hg19")

Original info variables.

 > vcf <- readVcf(fl, "hg19")
 > names(info(vcf))
[1] "NS" "DP" "AF" "AA" "DB" "H2"
 > dim(info(vcf))
[1] 5 6

Add new variables with the info<- setter.

 > newInfo <- info(vcf)
 > newInfo$foo <- 1:5
 > info(vcf) <- newInfo
 > names(info(vcf))
[1] "NS"  "DP"  "AF"  "AA"  "DB"  "H2"  "foo"
 > info(vcf)$foo
[1] 1 2 3 4 5

>
> 2) what’s the most computationally efficient way to query for
> presence/absence of a variant (I thought about a match on the names, a
> countOverlaps on the GRanges etc, but that doesn’t exploit the tabix
> index, and still seems to me a bit clumsy).

You mention the tabix index which is only applicable to the file on disk 
and not the VCF class in R. I'm not sure if you are interested in 
querying the file on disk or class in R so below I've addressed both.

Query vcf file on disk:

To match variant position in the file using a ScanVcfParam with GRanges 
is best if you want the VCF class as the return object.

Alternatively you could use filterVcf() where the output is a reduced 
vcf file on disk subset by the filter criteria. filterVcf() can subset 
by position, character string or specific field values and has options 
of 'prefilter' and 'filter'. A 'prefilter' searches unparsed lines from 
the file for a specific character string. 'filters' operate on the data 
once it has been parsed into the VCF object (this allows 'filters' to be 
defined in VCF-class syntax).

Query VCF class in R:

To match on position I'd use subsetByOverlaps() on the VCF GRanges if 
you want a GRanges back. If you only want a count of the hits (i.e., 
location if present) then use countOverlaps(). This is essentially what 
you're doing now but if chromosome and/or strand are important then this 
is the best way to go. Others free to chime in.

You could gain a little performance if the data are large and chromosome 
and strand are not important. Using countOverlaps() on the ranges() 
extracted from the GRanges will save a some time on big data.

To match on variant name I would match on the colnames(vcf) as you are 
doing.

Are any of these operations taking prohibitively long? Please let me 
know if they are.

Valerie

>
> thanks for any advice on how to best use this class,
> best
> Francesco
>
>
>
> On 3 Oct 2013, at 15:53, Vincent Carey <stvjc at channing.harvard.edu
> <mailto:stvjc at channing.harvard.edu>> wrote:
>
>>
>>
>>
>> On Thu, Oct 3, 2013 at 7:18 AM, Francesco Lescai
>> <francesco.lescai at hum-gen.au.dk
>> <mailto:francesco.lescai at hum-gen.au.dk>> wrote:
>>
>>     Hi guys,
>>     I would like to store lists of variants (SNPs and INDELs) in a
>>     convenient format to be used frequently with other bioconductor
>>     packages, and I was thinking to store them as Annotation databases.
>>     In this way I could store and query my own list of variants with
>>     annotations coming from my work and/or previous experiments.
>>
>>
>>     Most of the examples/vignettes I could find like SQLForge are
>>     however gene-centric.
>>     Can I create a custom one, maybe using SNPlocs as template?
>>
>>
>> Have you considered tabix-indexed VCF?  This can be stored in a
>> package and Rsamtools/VariantAnnotation facilities can be used for
>> targeted querying.  The SNPlocs containers are likely to be redesigned
>> in the near future.
>>
>>     thanks for any suggestions,
>>     Francesco
>>
>>
>>
>>             [[alternative HTML version deleted]]
>>
>>     _______________________________________________
>>     Bioconductor mailing list
>>     Bioconductor at r-project.org <mailto:Bioconductor at r-project.org>
>>     https://stat.ethz.ch/mailman/listinfo/bioconductor
>>     Search the archives:
>>     http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>
>
>

-- 
Valerie Obenchain

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B155
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: vobencha at fhcrc.org
Phone:  (206) 667-3158
Fax:    (206) 667-1319