[Bioc-devel] VariantAnnotation bug when loading multiple regions from vcf

Moritz Gerstung mg14 at sanger.ac.uk
Tue Mar 5 16:05:58 CET 2013


Hi,

I discovered the following problem in the geno() elements of the VCF object when I load multiple regions from a .vcf file:

## Correct
> tmp = readVcf(system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation"), "hg19")[1:10]
> geno(tmp)$GT
           HG00096 HG00097 HG00099 HG00100 HG00101
rs7410291   "0|0"   "0|0"   "1|0"   "0|0"   "0|0"  
rs147922003 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs114143073 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs141778433 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs182170314 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs115145310 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs186769856 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs77627744  "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs193230365 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs9627788   "0|0"   "0|0"   "1|0"   "0|0"   "0|0"  

## Wrong: Load regions separately using ScanVcfParam(which= )
> tmp2 = readVcf(system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation"), "hg19", param=ScanVcfParam(which=rowData(tmp)))
> geno(tmp2)$GT
           HG00096 HG00097 HG00099 HG00100 HG00101
rs7410291   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs147922003 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs114143073 "1|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs141778433 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs182170314 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs115145310 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs186769856 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs77627744  "0|0"   "0|0"   "0|0"   "0|0"   "1|0"  
rs193230365 "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  
rs9627788   "0|0"   "0|0"   "0|0"   "0|0"   "0|0"  

Note how the elements have been added by column instead of by row.

The following solved the issue for me:

$svn diff VariantAnnotation/
Index: VariantAnnotation/R/AllUtilities.R
===================================================================
--- VariantAnnotation/R/AllUtilities.R	(revision 73920)
+++ VariantAnnotation/R/AllUtilities.R	(working copy)
@@ -62,8 +62,8 @@
                    cmb
                } else {
                    trans <- lapply(elt, t)
-                    cmb <- matrix(t(do.call(c, trans)), 
-                                  length(lst[["rowData"]]), d[2])
+                    cmb <- matrix(do.call(c, trans), 
+                                  length(lst[["rowData"]]), d[2], byrow=TRUE)
                    cmb
                }
            }) 


My suspicion is that the transposition t() on line 65 didn't do the desired job, as do.call() returned a vector.. 

I don't fully understand the code, so it may be worth double checking.
For example, I didn't check the case when the individual geno entries contain more than a single value for each variant and sample.

The same problem also occurred for VariantAnnotation_1.4.9.

Best wishes,

Moritz



P.S.

> sessionInfo()
R Under development (unstable) (2013-02-05 r61843)
Platform: x86_64-apple-darwin11.4.2 (64-bit)

locale:
[1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods  
[8] base     

other attached packages:
[1] VariantAnnotation_1.5.41 Rsamtools_1.11.20        Biostrings_2.27.11      
[4] biomaRt_2.15.0           GenomicRanges_1.11.34    IRanges_1.17.35         
[7] BiocGenerics_0.5.6       BiocInstaller_1.9.6      RColorBrewer_1.0-5      

loaded via a namespace (and not attached):
[1] AnnotationDbi_1.21.12   Biobase_2.19.2          bitops_1.0-5           
[4] BSgenome_1.27.1         DBI_0.2-5               GenomicFeatures_1.11.13
[7] RCurl_1.95-3            RSQLite_0.11.2          rtracklayer_1.19.9     
[10] stats4_3.0.0            tools_3.0.0             XML_3.95-0.1           
[13] zlibbioc_1.5.0 



---
Moritz Gerstung
Postdoctoral Fellow

Cancer Genome Project
Wellcome Trust Sanger Institute
Wellcome Trust Genome Campus
Hinxton, Cambridgeshire
CB10 1SA
UK



-- 
 The Wellcome Trust Sanger Institute is operated by Genome Research 
 Limited, a charity registered in England with number 1021457 and a 
 company registered in England with number 2742969, whose registered 
 office is 215 Euston Road, London, NW1 2BE.



More information about the Bioc-devel mailing list