[Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

Sat Oct 25 00:56:58 CEST 2014

Not sure I understand. Setting the seqlevelsStyle cannot change the genome
build, so the two Seqinfos will remain incompatible in that way. I think
what you want is to be able to say "let's consider these two genome builds
to be the same", which seems reasonable after dropping chrM. I was
proposing that this could be done with something like:

genome(vcf, rectifySeqlevels=TRUE) <- genome(txdb)

For convenience, we might want that argument to be TRUE by default, but
that would change current behavior.

On Fri, Oct 24, 2014 at 3:41 PM, Robert Castelo <robert.castelo at upf.edu>
wrote:

>  hi Michael,
>
> if we assume that a seqname style does not imply a specific genome build,
> then i'd say that the error below about having incompatible genomes should
> not pop up because sequence styles have been already matched, right?
>
>
>
> On 10/24/14 10:22 PM, Michael Lawrence wrote:
>
> I don't think a seqname style implies a specific genome build. But the
> inverse might make sense. Given a genome build identifier, we could check
> for consistent naming. Perhaps an option on "genome<-" could support this?
>
>
>
> On Fri, Oct 24, 2014 at 11:52 AM, Valerie Obenchain <vobencha at fhcrc.org>
> wrote:
>
>> This is a good question. I'm not sure we want seqlevelsStyle() to also
>> alter the genome value. I think it's a reasonable request but I'd like to
>> open it up to discussion. I've cc'd a few others for input.
>>
>> Valerie
>>
>>
>>
>> On 10/24/14 09:05, Robert Castelo wrote:
>>
>>> hi Valerie,
>>>
>>> thanks for the quick fix and updating the documentation, i have a
>>> further question about the seqinfo slot and particularly the use of
>>> seqlevelsStyle(). Let me illustrate it with an example again:
>>>
>>>
>>> ==============
>>> library(VariantAnnotation)
>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>
>>> ## read again the same VCF file
>>> fl <- file.path(system.file("extdata", package="VariantFiltering"),
>>> "CEUtrio.vcf.bgz")
>>> vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>>>
>>>  txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>
>>> ## select the standard chromosomes
>>> vcf <- keepStandardChromosomes(vcf)
>>>
>>> ## since the input VCF file had NCBI style, let's match
>>> ## the style of the TxDb annotations
>>> seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
>>>
>>> ## drop the mitochondrial chromosome (b/c of the different lengths
>>> ## between b37 and hg19
>>> vcf <- dropSeqlevels(vcf, "chrM")
>>>
>>> ## try to annotate the location of the variants. it prompts an
>>> ## error because the 'genome' slot of the Seqinfo object still
>>> ## has b37 after running seqlevelsStyle
>>> vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>>> Error in mergeNamedAtomicVectors(genome(x), genome(y), what =
>>> c("sequence",  :
>>>    sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7, chr8, chr9,
>>> chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr19,
>>> chr20, chr21, chr22, chrX, chrY, chrM have incompatible genomes:
>>>    - in 'x': b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37,
>>> b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37
>>>    - in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>>> hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>>> hg19, hg19, hg19
>>>
>>> ## this can be fixed by setting the 'genome' slot to the values of
>>> ## the TxDb object
>>> genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
>>> names(genome(txdb)))]
>>>
>>> ## now this works
>>> vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>>> =================
>>>
>>> so my question is, should not seqlevelsStyle() also change the 'genome'
>>> slot of the Seqinfo object in the updated object?
>>>
>>> if not, would the solution be updating the 'genome' slot in the way i
>>> did it?
>>>
>>> thanks!
>>> robert.
>>>
>>>
>>>
>>> On 10/23/2014 11:14 PM, Valerie Obenchain wrote:
>>>
>>>> Hi Robert,
>>>>
>>>> Thanks for the bug report and reproducible example. Now fixed in release
>>>> 1.12.2 and devel 1.13.4.
>>>>
>>>> I've also updated the docs to better explain how the Seqinfo objects are
>>>> propagated / merged when supplied as 'genome'.
>>>>
>>>> Valerie
>>>>
>>>>
>>>> On 10/23/2014 06:45 AM, Robert Castelo wrote:
>>>>
>>>>> hi there,
>>>>>
>>>>> in my package VariantFiltering i have an example VCF file from a Hapmap
>>>>> CEU trio including three chromosomes only to illustrate its vignette.
>>>>> i've come across a problem with the function readVcf() in
>>>>> VariantAnnotation that may be specific of the situation of a VCF file
>>>>> not having all chromosomes, but which it will be great for me if this
>>>>> could be addressed.
>>>>>
>>>>> the problem is reproduced as follows:
>>>>>
>>>>> ===========================
>>>>> library(VariantAnnotation)
>>>>>
>>>>> fl <- file.path(system.file("extdata", package="VariantFiltering"),
>>>>> "CEUtrio.vcf.bgz")
>>>>>
>>>>> vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>>>>> Error in GenomeInfoDb:::makeNewSeqnames(x, new2old = new2old,
>>>>> seqlevels(value)) :
>>>>> when 'new2old' is NULL, the first elements in the
>>>>> supplied 'seqlevels' must be identical to 'seqlevels(x)'
>>>>> ====================
>>>>>
>>>>> this is caused because although i'm providing the Seqinfo object
>>>>> derived
>>>>> from the header of the VCF file itself, at some point the ordering of
>>>>> the seqlevels between the header and the rest of the VCF file differs
>>>>> due to the smaller subset of chromosomes in the VCF file.
>>>>>
>>>>> This can be easily fixed by replacing the line:
>>>>>
>>>>> if (length(newsi) > length(oldsi)) {
>>>>>
>>>>> within the .scanVcfToVCF() function in methods-readVcf.R, by
>>>>>
>>>>> if (length(newsi) >= length(oldsi)) {
>>>>>
>>>>> this is happening both in release and devel. i'm pasting below my
>>>>> sessionInfo() for the release.
>>>>>
>>>>> let me know if you think this fix is feasible or i'm wrongly using the
>>>>> function readVcf(). i'm basically trying to use readVcf() without
>>>>> having
>>>>> to figure out the appropriate value for the argument 'genome', i.e.,
>>>>> without knowing beforehand what version of the genome was used to
>>>>> produce the VCF file.
>>>>>
>>>>> thanks!!
>>>>> robert.
>>>>>
>>>>>
>>>>> sessionInfo()
>>>>> R version 3.1.1 Patched (2014-10-13 r66751)
>>>>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>>>>
>>>>> locale:
>>>>> [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
>>>>> [3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
>>>>> [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
>>>>> [7] LC_PAPER=en_US.UTF8 LC_NAME=C
>>>>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>>> [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
>>>>>
>>>>> attached base packages:
>>>>> [1] stats4 parallel stats graphics grDevices
>>>>> [6] utils datasets methods base
>>>>>
>>>>> other attached packages:
>>>>> [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
>>>>> [3] Biostrings_2.34.0 XVector_0.6.0
>>>>> [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
>>>>> [7] IRanges_2.0.0 S4Vectors_0.4.0
>>>>> [9] BiocGenerics_0.12.0 vimcom_1.0-0
>>>>> [11] setwidth_1.0-3 colorout_1.0-3
>>>>>
>>>>> loaded via a namespace (and not attached):
>>>>> [1] AnnotationDbi_1.28.0 base64enc_0.1-2
>>>>> [3] BatchJobs_1.4 BBmisc_1.7
>>>>> [5] Biobase_2.26.0 BiocParallel_1.0.0
>>>>> [7] biomaRt_2.22.0 bitops_1.0-6
>>>>> [9] brew_1.0-6 BSgenome_1.34.0
>>>>> [11] checkmate_1.4 codetools_0.2-9
>>>>> [13] DBI_0.3.1 digest_0.6.4
>>>>> [15] fail_1.2 foreach_1.4.2
>>>>> [17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
>>>>> [19] iterators_1.0.7 RCurl_1.95-4.3
>>>>> [21] RSQLite_0.11.4 rtracklayer_1.26.0
>>>>> [23] sendmailR_1.2-1 stringr_0.6.2
>>>>> [25] tools_3.1.1 XML_3.98-1.1
>>>>> [27] zlibbioc_1.12.0
>>>>>
>>>>> _______________________________________________
>>>>> Bioc-devel at r-project.org mailing list
>>>>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>>>
>>>>
>>>>
>>>>
>>>
>>
>
>

	[[alternative HTML version deleted]]