[Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

Robert Castelo robert.castelo at upf.edu
Sat Oct 25 01:21:26 CEST 2014


you're right, i was just thinking about b37 (w/o MT) and hg19 but one 
can have the same seqlevelsStyle with two different builds such as hg18 
and hg19.  Well then, the solution i was using below,

genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
names(genome(txdb)))]

was not that bad, but it's a bit cumbersome to write, something like 
what you propose below looks better to me. I also wonder why each 
chromosome should have a genome build (which forces me to intersect in 
the line before), could not we assume that all chromosomes come from the 
same build?

On 10/25/14 12:56 AM, Michael Lawrence wrote:
> Not sure I understand. Setting the seqlevelsStyle cannot change the 
> genome build, so the two Seqinfos will remain incompatible in that 
> way. I think what you want is to be able to say "let's consider these 
> two genome builds to be the same", which seems reasonable after 
> dropping chrM. I was proposing that this could be done with something 
> like:
>
> genome(vcf, rectifySeqlevels=TRUE) <- genome(txdb)
>
> For convenience, we might want that argument to be TRUE by default, 
> but that would change current behavior.
>
>
> On Fri, Oct 24, 2014 at 3:41 PM, Robert Castelo 
> <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>> wrote:
>
>     hi Michael,
>
>     if we assume that a seqname style does not imply a specific genome
>     build, then i'd say that the error below about having incompatible
>     genomes should not pop up because sequence styles have been
>     already matched, right?
>
>
>
>     On 10/24/14 10:22 PM, Michael Lawrence wrote:
>>     I don't think a seqname style implies a specific genome build.
>>     But the inverse might make sense. Given a genome build
>>     identifier, we could check for consistent naming. Perhaps an
>>     option on "genome<-" could support this?
>>
>>
>>
>>     On Fri, Oct 24, 2014 at 11:52 AM, Valerie Obenchain
>>     <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>>
>>         This is a good question. I'm not sure we want
>>         seqlevelsStyle() to also alter the genome value. I think it's
>>         a reasonable request but I'd like to open it up to
>>         discussion. I've cc'd a few others for input.
>>
>>         Valerie
>>
>>
>>
>>         On 10/24/14 09:05, Robert Castelo wrote:
>>
>>             hi Valerie,
>>
>>             thanks for the quick fix and updating the documentation,
>>             i have a
>>             further question about the seqinfo slot and particularly
>>             the use of
>>             seqlevelsStyle(). Let me illustrate it with an example again:
>>
>>
>>             ==============
>>             library(VariantAnnotation)
>>             library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>
>>             ## read again the same VCF file
>>             fl <- file.path(system.file("extdata",
>>             package="VariantFiltering"),
>>             "CEUtrio.vcf.bgz")
>>             vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>>
>>             txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>
>>             ## select the standard chromosomes
>>             vcf <- keepStandardChromosomes(vcf)
>>
>>             ## since the input VCF file had NCBI style, let's match
>>             ## the style of the TxDb annotations
>>             seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
>>
>>             ## drop the mitochondrial chromosome (b/c of the
>>             different lengths
>>             ## between b37 and hg19
>>             vcf <- dropSeqlevels(vcf, "chrM")
>>
>>             ## try to annotate the location of the variants. it
>>             prompts an
>>             ## error because the 'genome' slot of the Seqinfo object
>>             still
>>             ## has b37 after running seqlevelsStyle
>>             vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>>             Error in mergeNamedAtomicVectors(genome(x), genome(y), what =
>>             c("sequence",  :
>>                sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7,
>>             chr8, chr9,
>>             chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17,
>>             chr18, chr19,
>>             chr20, chr21, chr22, chrX, chrY, chrM have incompatible
>>             genomes:
>>                - in 'x': b37, b37, b37, b37, b37, b37, b37, b37, b37,
>>             b37, b37, b37,
>>             b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37,
>>             b37, b37
>>                - in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>>             hg19, hg19, hg19,
>>             hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>>             hg19, hg19, hg19,
>>             hg19, hg19, hg19
>>
>>             ## this can be fixed by setting the 'genome' slot to the
>>             values of
>>             ## the TxDb object
>>             genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
>>             names(genome(txdb)))]
>>
>>             ## now this works
>>             vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>>             =================
>>
>>             so my question is, should not seqlevelsStyle() also
>>             change the 'genome'
>>             slot of the Seqinfo object in the updated object?
>>
>>             if not, would the solution be updating the 'genome' slot
>>             in the way i
>>             did it?
>>
>>             thanks!
>>             robert.
>>
>>
>>
>>             On 10/23/2014 11:14 PM, Valerie Obenchain wrote:
>>
>>                 Hi Robert,
>>
>>                 Thanks for the bug report and reproducible example.
>>                 Now fixed in release
>>                 1.12.2 and devel 1.13.4.
>>
>>                 I've also updated the docs to better explain how the
>>                 Seqinfo objects are
>>                 propagated / merged when supplied as 'genome'.
>>
>>                 Valerie
>>
>>
>>                 On 10/23/2014 06:45 AM, Robert Castelo wrote:
>>
>>                     hi there,
>>
>>                     in my package VariantFiltering i have an example
>>                     VCF file from a Hapmap
>>                     CEU trio including three chromosomes only to
>>                     illustrate its vignette.
>>                     i've come across a problem with the function
>>                     readVcf() in
>>                     VariantAnnotation that may be specific of the
>>                     situation of a VCF file
>>                     not having all chromosomes, but which it will be
>>                     great for me if this
>>                     could be addressed.
>>
>>                     the problem is reproduced as follows:
>>
>>                     ===========================
>>                     library(VariantAnnotation)
>>
>>                     fl <- file.path(system.file("extdata",
>>                     package="VariantFiltering"),
>>                     "CEUtrio.vcf.bgz")
>>
>>                     vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>>                     Error in GenomeInfoDb:::makeNewSeqnames(x,
>>                     new2old = new2old,
>>                     seqlevels(value)) :
>>                     when 'new2old' is NULL, the first elements in the
>>                     supplied 'seqlevels' must be identical to
>>                     'seqlevels(x)'
>>                     ====================
>>
>>                     this is caused because although i'm providing the
>>                     Seqinfo object derived
>>                     from the header of the VCF file itself, at some
>>                     point the ordering of
>>                     the seqlevels between the header and the rest of
>>                     the VCF file differs
>>                     due to the smaller subset of chromosomes in the
>>                     VCF file.
>>
>>                     This can be easily fixed by replacing the line:
>>
>>                     if (length(newsi) > length(oldsi)) {
>>
>>                     within the .scanVcfToVCF() function in
>>                     methods-readVcf.R, by
>>
>>                     if (length(newsi) >= length(oldsi)) {
>>
>>                     this is happening both in release and devel. i'm
>>                     pasting below my
>>                     sessionInfo() for the release.
>>
>>                     let me know if you think this fix is feasible or
>>                     i'm wrongly using the
>>                     function readVcf(). i'm basically trying to use
>>                     readVcf() without having
>>                     to figure out the appropriate value for the
>>                     argument 'genome', i.e.,
>>                     without knowing beforehand what version of the
>>                     genome was used to
>>                     produce the VCF file.
>>
>>                     thanks!!
>>                     robert.
>>
>>
>>                     sessionInfo()
>>                     R version 3.1.1 Patched (2014-10-13 r66751)
>>                     Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>>                     locale:
>>                     [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
>>                     [3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
>>                     [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
>>                     [7] LC_PAPER=en_US.UTF8 LC_NAME=C
>>                     [9] LC_ADDRESS=C LC_TELEPHONE=C
>>                     [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
>>
>>                     attached base packages:
>>                     [1] stats4 parallel stats graphics grDevices
>>                     [6] utils datasets methods base
>>
>>                     other attached packages:
>>                     [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
>>                     [3] Biostrings_2.34.0 XVector_0.6.0
>>                     [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
>>                     [7] IRanges_2.0.0 S4Vectors_0.4.0
>>                     [9] BiocGenerics_0.12.0 vimcom_1.0-0
>>                     [11] setwidth_1.0-3 colorout_1.0-3
>>
>>                     loaded via a namespace (and not attached):
>>                     [1] AnnotationDbi_1.28.0 base64enc_0.1-2
>>                     [3] BatchJobs_1.4 BBmisc_1.7
>>                     [5] Biobase_2.26.0 BiocParallel_1.0.0
>>                     [7] biomaRt_2.22.0 bitops_1.0-6
>>                     [9] brew_1.0-6 BSgenome_1.34.0
>>                     [11] checkmate_1.4 codetools_0.2-9
>>                     [13] DBI_0.3.1 digest_0.6.4
>>                     [15] fail_1.2 foreach_1.4.2
>>                     [17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
>>                     [19] iterators_1.0.7 RCurl_1.95-4.3
>>                     [21] RSQLite_0.11.4 rtracklayer_1.26.0
>>                     [23] sendmailR_1.2-1 stringr_0.6.2
>>                     [25] tools_3.1.1 XML_3.98-1.1
>>                     [27] zlibbioc_1.12.0
>>
>>                     _______________________________________________
>>                     Bioc-devel at r-project.org
>>                     <mailto:Bioc-devel at r-project.org> mailing list
>>                     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>>
>>
>
>


	[[alternative HTML version deleted]]



More information about the Bioc-devel mailing list