[Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem

Sat Oct 25 03:05:53 CEST 2014

Hi Robert, Michael,

On 10/24/2014 04:21 PM, Robert Castelo wrote:
> you're right, i was just thinking about b37 (w/o MT) and hg19 but one
> can have the same seqlevelsStyle with two different builds such as hg18
> and hg19.  Well then, the solution i was using below,
>
> genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
> names(genome(txdb)))]
>
> was not that bad, but it's a bit cumbersome to write,

genome(vcf) <- genome(txdb)[[1]] should do the same and is simpler.
But we should be able to just do

   genome(vcf) <- genome(txdb)

instead. Unfortunately this didn't work (until a change I just
committed) because the genome() setter was being too paranoid
when checking the supplied value. I just relaxed its logic a little
so now the genome(x) <- genome(y) idiom should almost always work
(except for some rare edge cases).

This is in GenomeInfoDb 1.2.2 (release) and 1.3.6 (devel).

> something like
> what you propose below looks better to me. I also wonder why each
> chromosome should have a genome build (which forces me to intersect in
> the line before), could not we assume that all chromosomes come from the
> same build?

When the genome slot was added to Seqinfo objects, the decision was
made to make this slot parallel to the other slots (seqnames, etc...)
to support use cases where sequences come from different organisms.

Cheers,
H.

>
> On 10/25/14 12:56 AM, Michael Lawrence wrote:
>> Not sure I understand. Setting the seqlevelsStyle cannot change the
>> genome build, so the two Seqinfos will remain incompatible in that
>> way. I think what you want is to be able to say "let's consider these
>> two genome builds to be the same", which seems reasonable after
>> dropping chrM. I was proposing that this could be done with something
>> like:
>>
>> genome(vcf, rectifySeqlevels=TRUE) <- genome(txdb)
>>
>> For convenience, we might want that argument to be TRUE by default,
>> but that would change current behavior.
>>
>>
>> On Fri, Oct 24, 2014 at 3:41 PM, Robert Castelo
>> <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>> wrote:
>>
>>     hi Michael,
>>
>>     if we assume that a seqname style does not imply a specific genome
>>     build, then i'd say that the error below about having incompatible
>>     genomes should not pop up because sequence styles have been
>>     already matched, right?
>>
>>
>>
>>     On 10/24/14 10:22 PM, Michael Lawrence wrote:
>>>     I don't think a seqname style implies a specific genome build.
>>>     But the inverse might make sense. Given a genome build
>>>     identifier, we could check for consistent naming. Perhaps an
>>>     option on "genome<-" could support this?
>>>
>>>
>>>
>>>     On Fri, Oct 24, 2014 at 11:52 AM, Valerie Obenchain
>>>     <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>>>
>>>         This is a good question. I'm not sure we want
>>>         seqlevelsStyle() to also alter the genome value. I think it's
>>>         a reasonable request but I'd like to open it up to
>>>         discussion. I've cc'd a few others for input.
>>>
>>>         Valerie
>>>
>>>
>>>
>>>         On 10/24/14 09:05, Robert Castelo wrote:
>>>
>>>             hi Valerie,
>>>
>>>             thanks for the quick fix and updating the documentation,
>>>             i have a
>>>             further question about the seqinfo slot and particularly
>>>             the use of
>>>             seqlevelsStyle(). Let me illustrate it with an example again:
>>>
>>>
>>>             ==============
>>>             library(VariantAnnotation)
>>>             library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>
>>>             ## read again the same VCF file
>>>             fl <- file.path(system.file("extdata",
>>>             package="VariantFiltering"),
>>>             "CEUtrio.vcf.bgz")
>>>             vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>>>
>>>             txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>>
>>>             ## select the standard chromosomes
>>>             vcf <- keepStandardChromosomes(vcf)
>>>
>>>             ## since the input VCF file had NCBI style, let's match
>>>             ## the style of the TxDb annotations
>>>             seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
>>>
>>>             ## drop the mitochondrial chromosome (b/c of the
>>>             different lengths
>>>             ## between b37 and hg19
>>>             vcf <- dropSeqlevels(vcf, "chrM")
>>>
>>>             ## try to annotate the location of the variants. it
>>>             prompts an
>>>             ## error because the 'genome' slot of the Seqinfo object
>>>             still
>>>             ## has b37 after running seqlevelsStyle
>>>             vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>>>             Error in mergeNamedAtomicVectors(genome(x), genome(y), what =
>>>             c("sequence",  :
>>>                sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7,
>>>             chr8, chr9,
>>>             chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17,
>>>             chr18, chr19,
>>>             chr20, chr21, chr22, chrX, chrY, chrM have incompatible
>>>             genomes:
>>>                - in 'x': b37, b37, b37, b37, b37, b37, b37, b37, b37,
>>>             b37, b37, b37,
>>>             b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37,
>>>             b37, b37
>>>                - in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>>>             hg19, hg19, hg19,
>>>             hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>>>             hg19, hg19, hg19,
>>>             hg19, hg19, hg19
>>>
>>>             ## this can be fixed by setting the 'genome' slot to the
>>>             values of
>>>             ## the TxDb object
>>>             genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
>>>             names(genome(txdb)))]
>>>
>>>             ## now this works
>>>             vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>>>             =================
>>>
>>>             so my question is, should not seqlevelsStyle() also
>>>             change the 'genome'
>>>             slot of the Seqinfo object in the updated object?
>>>
>>>             if not, would the solution be updating the 'genome' slot
>>>             in the way i
>>>             did it?
>>>
>>>             thanks!
>>>             robert.
>>>
>>>
>>>
>>>             On 10/23/2014 11:14 PM, Valerie Obenchain wrote:
>>>
>>>                 Hi Robert,
>>>
>>>                 Thanks for the bug report and reproducible example.
>>>                 Now fixed in release
>>>                 1.12.2 and devel 1.13.4.
>>>
>>>                 I've also updated the docs to better explain how the
>>>                 Seqinfo objects are
>>>                 propagated / merged when supplied as 'genome'.
>>>
>>>                 Valerie
>>>
>>>
>>>                 On 10/23/2014 06:45 AM, Robert Castelo wrote:
>>>
>>>                     hi there,
>>>
>>>                     in my package VariantFiltering i have an example
>>>                     VCF file from a Hapmap
>>>                     CEU trio including three chromosomes only to
>>>                     illustrate its vignette.
>>>                     i've come across a problem with the function
>>>                     readVcf() in
>>>                     VariantAnnotation that may be specific of the
>>>                     situation of a VCF file
>>>                     not having all chromosomes, but which it will be
>>>                     great for me if this
>>>                     could be addressed.
>>>
>>>                     the problem is reproduced as follows:
>>>
>>>                     ===========================
>>>                     library(VariantAnnotation)
>>>
>>>                     fl <- file.path(system.file("extdata",
>>>                     package="VariantFiltering"),
>>>                     "CEUtrio.vcf.bgz")
>>>
>>>                     vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>>>                     Error in GenomeInfoDb:::makeNewSeqnames(x,
>>>                     new2old = new2old,
>>>                     seqlevels(value)) :
>>>                     when 'new2old' is NULL, the first elements in the
>>>                     supplied 'seqlevels' must be identical to
>>>                     'seqlevels(x)'
>>>                     ====================
>>>
>>>                     this is caused because although i'm providing the
>>>                     Seqinfo object derived
>>>                     from the header of the VCF file itself, at some
>>>                     point the ordering of
>>>                     the seqlevels between the header and the rest of
>>>                     the VCF file differs
>>>                     due to the smaller subset of chromosomes in the
>>>                     VCF file.
>>>
>>>                     This can be easily fixed by replacing the line:
>>>
>>>                     if (length(newsi) > length(oldsi)) {
>>>
>>>                     within the .scanVcfToVCF() function in
>>>                     methods-readVcf.R, by
>>>
>>>                     if (length(newsi) >= length(oldsi)) {
>>>
>>>                     this is happening both in release and devel. i'm
>>>                     pasting below my
>>>                     sessionInfo() for the release.
>>>
>>>                     let me know if you think this fix is feasible or
>>>                     i'm wrongly using the
>>>                     function readVcf(). i'm basically trying to use
>>>                     readVcf() without having
>>>                     to figure out the appropriate value for the
>>>                     argument 'genome', i.e.,
>>>                     without knowing beforehand what version of the
>>>                     genome was used to
>>>                     produce the VCF file.
>>>
>>>                     thanks!!
>>>                     robert.
>>>
>>>
>>>                     sessionInfo()
>>>                     R version 3.1.1 Patched (2014-10-13 r66751)
>>>                     Platform: x86_64-unknown-linux-gnu (64-bit)
>>>
>>>                     locale:
>>>                     [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
>>>                     [3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
>>>                     [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
>>>                     [7] LC_PAPER=en_US.UTF8 LC_NAME=C
>>>                     [9] LC_ADDRESS=C LC_TELEPHONE=C
>>>                     [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
>>>
>>>                     attached base packages:
>>>                     [1] stats4 parallel stats graphics grDevices
>>>                     [6] utils datasets methods base
>>>
>>>                     other attached packages:
>>>                     [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
>>>                     [3] Biostrings_2.34.0 XVector_0.6.0
>>>                     [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
>>>                     [7] IRanges_2.0.0 S4Vectors_0.4.0
>>>                     [9] BiocGenerics_0.12.0 vimcom_1.0-0
>>>                     [11] setwidth_1.0-3 colorout_1.0-3
>>>
>>>                     loaded via a namespace (and not attached):
>>>                     [1] AnnotationDbi_1.28.0 base64enc_0.1-2
>>>                     [3] BatchJobs_1.4 BBmisc_1.7
>>>                     [5] Biobase_2.26.0 BiocParallel_1.0.0
>>>                     [7] biomaRt_2.22.0 bitops_1.0-6
>>>                     [9] brew_1.0-6 BSgenome_1.34.0
>>>                     [11] checkmate_1.4 codetools_0.2-9
>>>                     [13] DBI_0.3.1 digest_0.6.4
>>>                     [15] fail_1.2 foreach_1.4.2
>>>                     [17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
>>>                     [19] iterators_1.0.7 RCurl_1.95-4.3
>>>                     [21] RSQLite_0.11.4 rtracklayer_1.26.0
>>>                     [23] sendmailR_1.2-1 stringr_0.6.2
>>>                     [25] tools_3.1.1 XML_3.98-1.1
>>>                     [27] zlibbioc_1.12.0
>>>
>>>                     _______________________________________________
>>>                     Bioc-devel at r-project.org
>>>                     <mailto:Bioc-devel at r-project.org> mailing list
>>>                     https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>>
>>>
>>>
>>>
>>>
>>>
>>
>>
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319