[Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

Robert Castelo robert.castelo at upf.edu
Sat Oct 25 00:41:53 CEST 2014


hi Michael,

if we assume that a seqname style does not imply a specific genome 
build, then i'd say that the error below about having incompatible 
genomes should not pop up because sequence styles have been already 
matched, right?


On 10/24/14 10:22 PM, Michael Lawrence wrote:
> I don't think a seqname style implies a specific genome build. But the 
> inverse might make sense. Given a genome build identifier, we could 
> check for consistent naming. Perhaps an option on "genome<-" could 
> support this?
>
>
>
> On Fri, Oct 24, 2014 at 11:52 AM, Valerie Obenchain 
> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>
>     This is a good question. I'm not sure we want seqlevelsStyle() to
>     also alter the genome value. I think it's a reasonable request but
>     I'd like to open it up to discussion. I've cc'd a few others for
>     input.
>
>     Valerie
>
>
>
>     On 10/24/14 09:05, Robert Castelo wrote:
>
>         hi Valerie,
>
>         thanks for the quick fix and updating the documentation, i have a
>         further question about the seqinfo slot and particularly the
>         use of
>         seqlevelsStyle(). Let me illustrate it with an example again:
>
>
>         ==============
>         library(VariantAnnotation)
>         library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
>         ## read again the same VCF file
>         fl <- file.path(system.file("extdata",
>         package="VariantFiltering"),
>         "CEUtrio.vcf.bgz")
>         vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>
>         txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>
>         ## select the standard chromosomes
>         vcf <- keepStandardChromosomes(vcf)
>
>         ## since the input VCF file had NCBI style, let's match
>         ## the style of the TxDb annotations
>         seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
>
>         ## drop the mitochondrial chromosome (b/c of the different lengths
>         ## between b37 and hg19
>         vcf <- dropSeqlevels(vcf, "chrM")
>
>         ## try to annotate the location of the variants. it prompts an
>         ## error because the 'genome' slot of the Seqinfo object still
>         ## has b37 after running seqlevelsStyle
>         vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>         Error in mergeNamedAtomicVectors(genome(x), genome(y), what =
>         c("sequence",  :
>            sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9,
>         chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18,
>         chr19,
>         chr20, chr21, chr22, chrX, chrY, chrM have incompatible genomes:
>            - in 'x': b37, b37, b37, b37, b37, b37, b37, b37, b37, b37,
>         b37, b37,
>         b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37
>            - in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>         hg19, hg19,
>         hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>         hg19, hg19,
>         hg19, hg19, hg19
>
>         ## this can be fixed by setting the 'genome' slot to the values of
>         ## the TxDb object
>         genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
>         names(genome(txdb)))]
>
>         ## now this works
>         vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>         =================
>
>         so my question is, should not seqlevelsStyle() also change the
>         'genome'
>         slot of the Seqinfo object in the updated object?
>
>         if not, would the solution be updating the 'genome' slot in
>         the way i
>         did it?
>
>         thanks!
>         robert.
>
>
>
>         On 10/23/2014 11:14 PM, Valerie Obenchain wrote:
>
>             Hi Robert,
>
>             Thanks for the bug report and reproducible example. Now
>             fixed in release
>             1.12.2 and devel 1.13.4.
>
>             I've also updated the docs to better explain how the
>             Seqinfo objects are
>             propagated / merged when supplied as 'genome'.
>
>             Valerie
>
>
>             On 10/23/2014 06:45 AM, Robert Castelo wrote:
>
>                 hi there,
>
>                 in my package VariantFiltering i have an example VCF
>                 file from a Hapmap
>                 CEU trio including three chromosomes only to
>                 illustrate its vignette.
>                 i've come across a problem with the function readVcf() in
>                 VariantAnnotation that may be specific of the
>                 situation of a VCF file
>                 not having all chromosomes, but which it will be great
>                 for me if this
>                 could be addressed.
>
>                 the problem is reproduced as follows:
>
>                 ===========================
>                 library(VariantAnnotation)
>
>                 fl <- file.path(system.file("extdata",
>                 package="VariantFiltering"),
>                 "CEUtrio.vcf.bgz")
>
>                 vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>                 Error in GenomeInfoDb:::makeNewSeqnames(x, new2old =
>                 new2old,
>                 seqlevels(value)) :
>                 when 'new2old' is NULL, the first elements in the
>                 supplied 'seqlevels' must be identical to 'seqlevels(x)'
>                 ====================
>
>                 this is caused because although i'm providing the
>                 Seqinfo object derived
>                 from the header of the VCF file itself, at some point
>                 the ordering of
>                 the seqlevels between the header and the rest of the
>                 VCF file differs
>                 due to the smaller subset of chromosomes in the VCF file.
>
>                 This can be easily fixed by replacing the line:
>
>                 if (length(newsi) > length(oldsi)) {
>
>                 within the .scanVcfToVCF() function in
>                 methods-readVcf.R, by
>
>                 if (length(newsi) >= length(oldsi)) {
>
>                 this is happening both in release and devel. i'm
>                 pasting below my
>                 sessionInfo() for the release.
>
>                 let me know if you think this fix is feasible or i'm
>                 wrongly using the
>                 function readVcf(). i'm basically trying to use
>                 readVcf() without having
>                 to figure out the appropriate value for the argument
>                 'genome', i.e.,
>                 without knowing beforehand what version of the genome
>                 was used to
>                 produce the VCF file.
>
>                 thanks!!
>                 robert.
>
>
>                 sessionInfo()
>                 R version 3.1.1 Patched (2014-10-13 r66751)
>                 Platform: x86_64-unknown-linux-gnu (64-bit)
>
>                 locale:
>                 [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
>                 [3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
>                 [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
>                 [7] LC_PAPER=en_US.UTF8 LC_NAME=C
>                 [9] LC_ADDRESS=C LC_TELEPHONE=C
>                 [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
>
>                 attached base packages:
>                 [1] stats4 parallel stats graphics grDevices
>                 [6] utils datasets methods base
>
>                 other attached packages:
>                 [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
>                 [3] Biostrings_2.34.0 XVector_0.6.0
>                 [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
>                 [7] IRanges_2.0.0 S4Vectors_0.4.0
>                 [9] BiocGenerics_0.12.0 vimcom_1.0-0
>                 [11] setwidth_1.0-3 colorout_1.0-3
>
>                 loaded via a namespace (and not attached):
>                 [1] AnnotationDbi_1.28.0 base64enc_0.1-2
>                 [3] BatchJobs_1.4 BBmisc_1.7
>                 [5] Biobase_2.26.0 BiocParallel_1.0.0
>                 [7] biomaRt_2.22.0 bitops_1.0-6
>                 [9] brew_1.0-6 BSgenome_1.34.0
>                 [11] checkmate_1.4 codetools_0.2-9
>                 [13] DBI_0.3.1 digest_0.6.4
>                 [15] fail_1.2 foreach_1.4.2
>                 [17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
>                 [19] iterators_1.0.7 RCurl_1.95-4.3
>                 [21] RSQLite_0.11.4 rtracklayer_1.26.0
>                 [23] sendmailR_1.2-1 stringr_0.6.2
>                 [25] tools_3.1.1 XML_3.98-1.1
>                 [27] zlibbioc_1.12.0
>
>                 _______________________________________________
>                 Bioc-devel at r-project.org
>                 <mailto:Bioc-devel at r-project.org> mailing list
>                 https://stat.ethz.ch/mailman/listinfo/bioc-devel
>
>
>
>
>
>


	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list