[BioC] Bug in subsetting vcf file by subject on more than 5 samples

Robert Castelo robert.castelo at upf.edu
Tue Dec 17 12:28:51 CET 2013


hi,

when using the recently added feature of subsetting a vcf file by 
subject, described initially here:

https://stat.ethz.ch/pipermail/bioconductor/2013-April/052146.html

i found out that it breaks when the samples to subset are more than 5.
here is the code reproducing the problem:

suppressPackageStartupMessages(library(VariantAnnotation))
fl <- system.file("extdata", "gl_chr1.vcf", package="VariantAnnotation")
hdr <- scanVcfHeader(fl)
length(samples(hdr))
[1] 85

so the above example file has 85 samples, let's try to subset on the 
first 6 ones:

param <- ScanVcfParam(samples=samples(hdr)[1:6])
vcf <- readVcf(fl, "hg19", param)
Warning message:
In .vcf_map_samples(samples(hdr), samples) : samples not in file: ‘...’

next to this warning, the resulting 'CollapsedVCF' object has only 4 
samples instead of 6:

dim(vcf)
[1] 9 4

a hint about what might be happening comes from the sample names in the 
parameter object and the '...' referred to in the warning:

vcfSamples(param)
[1] "NA06984" "NA06986" "..."     "NA07000" "NA07037"

this happens in both the release and devel versions, here's my 
sessionInfo() for the release version:

R version 3.0.2 (2013-09-25)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.UTF8       LC_NUMERIC=C 
LC_TIME=en_US.UTF8
  [4] LC_COLLATE=en_US.UTF8     LC_MONETARY=en_US.UTF8 
LC_MESSAGES=en_US.UTF8
  [7] LC_PAPER=en_US.UTF8       LC_NAME=C                 LC_ADDRESS=C 

[10] LC_TELEPHONE=C            LC_MEASUREMENT=en_US.UTF8 
LC_IDENTIFICATION=C

attached base packages:
[1] parallel  stats     graphics  grDevices utils     datasets  methods 
   base

other attached packages:
  [1] VariantAnnotation_1.8.8 Rsamtools_1.14.2        Biostrings_2.30.1 

  [4] GenomicRanges_1.14.4    XVector_0.2.0           IRanges_1.20.6 

  [7] BiocGenerics_0.8.0      vimcom_0.9-92           setwidth_1.0-3 

[10] colorout_1.0-1

loaded via a namespace (and not attached):
  [1] AnnotationDbi_1.24.0   Biobase_2.22.0         biomaRt_2.18.0 
    bitops_1.0-6
  [5] BSgenome_1.30.0        DBI_0.2-7 
GenomicFeatures_1.14.2 RCurl_1.95-4.1
  [9] RSQLite_0.11.4         rtracklayer_1.22.0     stats4_3.0.2 
    tools_3.0.2
[13] XML_3.98-1.1           zlibbioc_1.8.0

thanks!
robert.

-- 
Robert Castelo, PhD
Associate Professor
Dept. of Experimental and Health Sciences
Universitat Pompeu Fabra (UPF)
Barcelona Biomedical Research Park (PRBB)
Dr Aiguader 88
E-08003 Barcelona, Spain
telf: +34.933.160.514
fax: +34.933.160.550



More information about the Bioconductor mailing list