[Bioc-devel] VariantAnnotation::readVcf() sets the wrong seqlevelsStyle in devel

Robert Castelo robert@c@@te|o @end|ng |rom up|@edu
Thu Aug 6 16:42:38 CEST 2020


hi Hervé,

thank you very much for your clarifications, but this behavior is 
different in release and has been different until now, this is BioC 3.11:

library(VariantAnnotation)

fl <- system.file("extdata", "chr22.vcf.gz", package="VariantAnnotation")
vcf <- readVcf(fl, "hg19")
seqlevels(vcf)
[1] "22"
seqlevelsStyle(vcf)
[1] "NCBI"    "Ensembl"

i appreciate that the behavior now in devel is more consistent, i 
actually never understood the need to specify the 'genome="hg19"' 
argument since this in principle can be figured out from the VCF header 
information. However, the documentation has become right now confusing, 
if you go to subsection 2.1 and 2.1.2 from the introductory vignette, it 
shows using readVcf() with "hg19" but then the sequence names are 
literally what they are in the VCF file (NCBI style)

because of the large user base of VariantAnnotation (top-49 download) 
and the many possible reverse dependencies downstream, i'd suggest that 
either readVcf() issues an error or, maybe even better, overrides the 
sequence level style in the VCF file maybe with a warning, when the 
'genome' argument does not match the sequence style of the VCF file.

cheers,

robert.


On 04/08/2020 18:29, Hervé Pagès wrote:
> Hi Robert,
>
> The VCF file uses "22" for the chromosome name which is the name used 
> by NCBI. So explicitly specifying "hg19" in the readVcf() call is like 
> saying that this chromosome name is a UCSC name which is why 
> seqlevelsStyle() gets confused later.
>
> If you specify the name of the NCBI assembly, things work as expected:
>
>   fl <- system.file("extdata", "chr22.vcf.gz", 
> package="VariantAnnotation")
>   vcf <- readVcf(fl, "GRCh37")
>   seqlevels(vcf)
>   # [1] "22"
>   seqlevelsStyle(vcf)
>   # [1] "NCBI"
>   seqlevelsStyle(vcf) <- "UCSC"
>   seqlevels(vcf)
>   # [1] "chr22"
>
> Or, if you don't know what reference genome the file is based on, 
> don't specify it:
>
>   fl <- system.file("extdata", "chr22.vcf.gz", 
> package="VariantAnnotation")
>   vcf <- readVcf(fl)
>   seqlevels(vcf)
>   # [1] "22"
>   seqlevelsStyle(vcf)
>   # [1] "NCBI"    "Ensembl"
>   seqlevelsStyle(vcf) <- "UCSC"
>   seqlevels(vcf)
>   # [1] "chr22"
>
> or specify it later:
>
>   genome(vcf) <- "hg19"
>   seqinfo(vcf)
>   # Seqinfo object with 1 sequence from hg19 genome; no seqlengths:
>   #   seqnames seqlengths isCircular genome
>   #   chr22            NA         NA   hg19
>
> Hope this helps,
> H.
>
>
> On 7/29/20 08:30, Robert Castelo wrote:
>> hi,
>>
>> it looks like either VariantAnnotation::readVcf() or something in the 
>> CollapsedVCF class broke in devel with respect to reading and setting 
>> sequence styles:
>>
>> library(VariantAnnotation)
>>
>> fl <- system.file("extdata", "chr22.vcf.gz", 
>> package="VariantAnnotation")
>> vcf <- readVcf(fl, "hg19")
>> seqlevels(vcf)
>> [1] "22"
>> seqlevelsStyle(vcf)
>> [1] "UCSC"
>> seqlevelsStyle(vcf) <- "UCSC"
>> seqlevels(vcf)
>> [1] "22"
>>
>> you can find my session information below. let me know if you want me 
>> to open an issue at the GitHub repo (VariantAnnotatoin or 
>> GenomeInfoDb?).
>>
>> thanks!
>>
>> robert.
>>
>> BiocManager::version()
>> [1] ‘3.12’
>> sessionInfo()
>> R version 4.0.0 (2020-04-24)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> Running under: Ubuntu 18.04.4 LTS
>>
>> Matrix products: default
>> BLAS/LAPACK: /usr/lib/x86_64-linux-gnu/libopenblasp-r0.2.20.so
>>
>> locale:
>>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>>   [5] LC_MONETARY=en_US.UTF-8    LC_MESSAGES=C
>>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats4    parallel  stats     graphics  grDevices utils datasets
>> [8] methods   base
>>
>> other attached packages:
>>   [1] VariantAnnotation_1.35.3 Rsamtools_2.5.3
>>   [3] Biostrings_2.57.2 XVector_0.29.3
>>   [5] SummarizedExperiment_1.19.6 DelayedArray_0.15.7
>>   [7] matrixStats_0.56.0 Matrix_1.2-18
>>   [9] Biobase_2.49.0 GenomicRanges_1.41.5
>> [11] GenomeInfoDb_1.25.8 IRanges_2.23.10
>> [13] S4Vectors_0.27.12 BiocGenerics_0.35.4
>> [15] BiocManager_1.30.10
>>
>> loaded via a namespace (and not attached):
>>   [1] progress_1.2.2           tidyselect_1.1.0 purrr_0.3.4
>>   [4] lattice_0.20-41          vctrs_0.3.1 generics_0.0.2
>>   [7] BiocFileCache_1.13.0     rtracklayer_1.49.4 GenomicFeatures_1.41.2
>> [10] blob_1.2.1               XML_3.99-0.4 rlang_0.4.6
>> [13] pillar_1.4.4             glue_1.4.1 DBI_1.1.0
>> [16] rappdirs_0.3.1           BiocParallel_1.23.2 bit64_0.9-7.1
>> [19] dbplyr_1.4.4             GenomeInfoDbData_1.2.3 lifecycle_0.2.0
>> [22] stringr_1.4.0            zlibbioc_1.35.0 memoise_1.1.0
>> [25] biomaRt_2.45.2           curl_4.3 AnnotationDbi_1.51.3
>> [28] Rcpp_1.0.4.6             BSgenome_1.57.5 openssl_1.4.1
>> [31] bit_1.1-15.2             hms_0.5.3 askpass_1.1
>> [34] digest_0.6.25            stringi_1.4.6 dplyr_1.0.0
>> [37] grid_4.0.0               tools_4.0.0 bitops_1.0-6
>> [40] magrittr_1.5             RCurl_1.98-1.2 RSQLite_2.2.0
>> [43] tibble_3.0.1             crayon_1.3.4 pkgconfig_2.0.3
>> [46] ellipsis_0.3.1           prettyunits_1.1.1 assertthat_0.2.1
>> [49] httr_1.4.1               R6_2.4.1 GenomicAlignments_1.25.3
>> [52] compiler_4.0.0
>>
>> _______________________________________________
>> Bioc-devel using r-project.org mailing list
>> https://urldefense.proofpoint.com/v2/url?u=https-3A__stat.ethz.ch_mailman_listinfo_bioc-2Ddevel&d=DwIDaQ&c=eRAMFD45gAfqt84VtBcfhQ&r=BK7q3XeAvimeWdGbWY_wJYbW0WYiZvSXAJJKaaPhzWA&m=gp0KKC6W1uS1YnyFI5iSuxF5WSUpOhbHwL94GRP8yu0&s=Co1P5SErF64uPYhHMltM3De48hQLl-XHK3gfZOEnSKc&e= 
>>
>



More information about the Bioc-devel mailing list