[BioC] VariantAnnotation::readVcf is reordering the FORMAT field of my VCF file
Valerie Obenchain
vobencha at fhcrc.org
Tue Mar 5 05:25:31 CET 2013
Hi Pete,
The order is taken from the header. readVcf() attempts to assign data
type and order based on header information. If there is no header line
then no data type is assigned and the variable is left unparsed.
> hd <- scanVcfHeader('readVcf_problem.vcf.gz')
> geno(hd)
DataFrame with 5 rows and 3 columns
Number Type
<character> <character>
AD . Integer
DP 1 Integer
GQ 1 Integer
GT 1 String
PL G Integer
...
...
> vcf <- readVcf("readVcf_problem.vcf.gz", "hg19")
> geno(vcf)
List of length 5
names(5): AD DP GQ GT PL
I haven't run into 'GT' being out of order before - or at least causing
problems downstream. Your quick fix is to move your GT header line above
'AD'. I will look into keeping 'GT' first reguardles of header order.
Thanks for pointing this out.
Valerie
On 03/04/13 18:32, hickey at wehi.EDU.AU wrote:
> Hi Valerie,
>
> readVcf is reordering the fields in the FORMAT field of my VCF file
> which is causing me problems downstream. I am using VariantAnnotation
> to subset my VCF file and then write the subset to file using
> writeVcf. However, the reordering of the FORMAT fields means by new
> VCF does not comply with VCF spec, specifically the GT-field is not
> the first sub-field of the FORMAT field. A reproducible example follows.
>
> Thanks,
> Pete
>
> > sessionInfo()
> R version 2.15.2 Patched (2013-02-10 r61900)
> Platform: x86_64-apple-darwin9.8.0/x86_64 (64-bit)
>
> locale:
> [1] en_AU.UTF-8/en_AU.UTF-8/en_AU.UTF-8/C/en_AU.UTF-8/en_AU.UTF-8
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] VariantAnnotation_1.4.9 Rsamtools_1.10.2 Biostrings_2.26.3
> GenomicRanges_1.10.7
> [5] IRanges_1.16.6 BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] AnnotationDbi_1.20.5 Biobase_2.18.0 biomaRt_2.14.0
> bitops_1.0-4.2
> [5] BSgenome_1.26.1 DBI_0.2-5 GenomicFeatures_1.10.2
> parallel_2.15.2
> [9] RCurl_1.95-3 RSQLite_0.11.2 rtracklayer_1.18.2
> stats4_2.15.2
> [13] tools_2.15.2 XML_3.95-0.1 zlibbioc_1.4.0
>
> #### Bug report: readVcf() reordering FORMAT column of VCF file in
> violation of VCF spec ####
> library(VariantAnnotation)
> ## Download my example VCF
> download.file('http://dl.dropbox.com/u/17421746/readVcf_problem.vcf.gz',
> destfile = '/tmp/readVcf_problem.vcf.gz')
> ## Cat the VCF: Note that the FORMAT column is GT:AD:DP:GQ:PL
> system(command = 'zcat /tmp/readVcf_problem.vcf.gz') # Might need to
> alter this command depending on system; works on my linux machine but
> need 'gzcat' on my mac
> ## Read in the VCF using readVcf()
> problem_vcf <- readVcf('/tmp/readVcf_problem.vcf.gz', genome = 'mm9')
> ## Note the order of geno(), which stores each field in the FORMAT
> column as a list element, is "AD DP GQ GT PL"
> geno(problem_vcf)
> ## Write the file back to disk using writeVcf()
> writeVcf(problem_vcf, '/tmp/problem.vcf', index = TRUE)
> ## Cat the newly written version of the VCF: Note that the FORMAT
> field is now AD:DP:GQ:GT:PL, which is contrary to VCF spec which says
> of the FORMAT field: "The first sub-field must always be the genotype
> (GT) if it is present" (see
> http://www.1000genomes.org/wiki/Analysis/Variant%20Call%20Format/vcf-variant-call-format-version-41)
> system(command = 'zcat /tmp/problem.vcf.gz') # Might need to alter
> this command depending on system; works on my linux machine but need
> 'gzcat' on my mac
>
> ## No such problem with the example VCF
> fl <- system.file("extdata", "ex2.vcf", package="VariantAnnotation")
> good_vcf <- readVcf(fl, genome = 'hg19')
> ## Sub-fields of FORMAT field are correctly ordered
> geno(good_vcf)
>
> --------------------------------
> Peter Hickey,
> PhD Student/Research Assistant,
> Bioinformatics Division,
> Walter and Eliza Hall Institute of Medical Research,
> 1G Royal Parade, Parkville, Vic 3052, Australia.
> Ph: +613 9345 2324
>
> hickey at wehi.edu.au <mailto:hickey at wehi.edu.au>
> http://www.wehi.edu.au
>
>
> ______________________________________________________________________
> The information in this email is confidential and inte...{{dropped:6}}
More information about the Bioconductor
mailing list