[Bioc-devel] VariantAnnotation::readVcf(fl, seqinfo(scanVcfHeader(fl)) problem
Robert Castelo
robert.castelo at upf.edu
Sat Oct 25 01:21:26 CEST 2014
you're right, i was just thinking about b37 (w/o MT) and hg19 but one
can have the same seqlevelsStyle with two different builds such as hg18
and hg19. Well then, the solution i was using below,
genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
names(genome(txdb)))]
was not that bad, but it's a bit cumbersome to write, something like
what you propose below looks better to me. I also wonder why each
chromosome should have a genome build (which forces me to intersect in
the line before), could not we assume that all chromosomes come from the
same build?
On 10/25/14 12:56 AM, Michael Lawrence wrote:
> Not sure I understand. Setting the seqlevelsStyle cannot change the
> genome build, so the two Seqinfos will remain incompatible in that
> way. I think what you want is to be able to say "let's consider these
> two genome builds to be the same", which seems reasonable after
> dropping chrM. I was proposing that this could be done with something
> like:
>
> genome(vcf, rectifySeqlevels=TRUE) <- genome(txdb)
>
> For convenience, we might want that argument to be TRUE by default,
> but that would change current behavior.
>
>
> On Fri, Oct 24, 2014 at 3:41 PM, Robert Castelo
> <robert.castelo at upf.edu <mailto:robert.castelo at upf.edu>> wrote:
>
> hi Michael,
>
> if we assume that a seqname style does not imply a specific genome
> build, then i'd say that the error below about having incompatible
> genomes should not pop up because sequence styles have been
> already matched, right?
>
>
>
> On 10/24/14 10:22 PM, Michael Lawrence wrote:
>> I don't think a seqname style implies a specific genome build.
>> But the inverse might make sense. Given a genome build
>> identifier, we could check for consistent naming. Perhaps an
>> option on "genome<-" could support this?
>>
>>
>>
>> On Fri, Oct 24, 2014 at 11:52 AM, Valerie Obenchain
>> <vobencha at fhcrc.org <mailto:vobencha at fhcrc.org>> wrote:
>>
>> This is a good question. I'm not sure we want
>> seqlevelsStyle() to also alter the genome value. I think it's
>> a reasonable request but I'd like to open it up to
>> discussion. I've cc'd a few others for input.
>>
>> Valerie
>>
>>
>>
>> On 10/24/14 09:05, Robert Castelo wrote:
>>
>> hi Valerie,
>>
>> thanks for the quick fix and updating the documentation,
>> i have a
>> further question about the seqinfo slot and particularly
>> the use of
>> seqlevelsStyle(). Let me illustrate it with an example again:
>>
>>
>> ==============
>> library(VariantAnnotation)
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>
>> ## read again the same VCF file
>> fl <- file.path(system.file("extdata",
>> package="VariantFiltering"),
>> "CEUtrio.vcf.bgz")
>> vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>>
>> txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
>>
>> ## select the standard chromosomes
>> vcf <- keepStandardChromosomes(vcf)
>>
>> ## since the input VCF file had NCBI style, let's match
>> ## the style of the TxDb annotations
>> seqlevelsStyle(vcf) <- seqlevelsStyle(txdb)
>>
>> ## drop the mitochondrial chromosome (b/c of the
>> different lengths
>> ## between b37 and hg19
>> vcf <- dropSeqlevels(vcf, "chrM")
>>
>> ## try to annotate the location of the variants. it
>> prompts an
>> ## error because the 'genome' slot of the Seqinfo object
>> still
>> ## has b37 after running seqlevelsStyle
>> vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>> Error in mergeNamedAtomicVectors(genome(x), genome(y), what =
>> c("sequence", :
>> sequences chr1, chr2, chr3, chr4, chr5, chr6, chr7,
>> chr8, chr9,
>> chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17,
>> chr18, chr19,
>> chr20, chr21, chr22, chrX, chrY, chrM have incompatible
>> genomes:
>> - in 'x': b37, b37, b37, b37, b37, b37, b37, b37, b37,
>> b37, b37, b37,
>> b37, b37, b37, b37, b37, b37, b37, b37, b37, b37, b37,
>> b37, b37
>> - in 'y': hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>> hg19, hg19, hg19,
>> hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19, hg19,
>> hg19, hg19, hg19,
>> hg19, hg19, hg19
>>
>> ## this can be fixed by setting the 'genome' slot to the
>> values of
>> ## the TxDb object
>> genome(vcf) <- genome(txdb)[intersect(names(genome(vcf)),
>> names(genome(txdb)))]
>>
>> ## now this works
>> vcf_annot <- locateVariants(vcf, txdb, AllVariants())
>> =================
>>
>> so my question is, should not seqlevelsStyle() also
>> change the 'genome'
>> slot of the Seqinfo object in the updated object?
>>
>> if not, would the solution be updating the 'genome' slot
>> in the way i
>> did it?
>>
>> thanks!
>> robert.
>>
>>
>>
>> On 10/23/2014 11:14 PM, Valerie Obenchain wrote:
>>
>> Hi Robert,
>>
>> Thanks for the bug report and reproducible example.
>> Now fixed in release
>> 1.12.2 and devel 1.13.4.
>>
>> I've also updated the docs to better explain how the
>> Seqinfo objects are
>> propagated / merged when supplied as 'genome'.
>>
>> Valerie
>>
>>
>> On 10/23/2014 06:45 AM, Robert Castelo wrote:
>>
>> hi there,
>>
>> in my package VariantFiltering i have an example
>> VCF file from a Hapmap
>> CEU trio including three chromosomes only to
>> illustrate its vignette.
>> i've come across a problem with the function
>> readVcf() in
>> VariantAnnotation that may be specific of the
>> situation of a VCF file
>> not having all chromosomes, but which it will be
>> great for me if this
>> could be addressed.
>>
>> the problem is reproduced as follows:
>>
>> ===========================
>> library(VariantAnnotation)
>>
>> fl <- file.path(system.file("extdata",
>> package="VariantFiltering"),
>> "CEUtrio.vcf.bgz")
>>
>> vcf <- readVcf(fl, seqinfo(scanVcfHeader(fl)))
>> Error in GenomeInfoDb:::makeNewSeqnames(x,
>> new2old = new2old,
>> seqlevels(value)) :
>> when 'new2old' is NULL, the first elements in the
>> supplied 'seqlevels' must be identical to
>> 'seqlevels(x)'
>> ====================
>>
>> this is caused because although i'm providing the
>> Seqinfo object derived
>> from the header of the VCF file itself, at some
>> point the ordering of
>> the seqlevels between the header and the rest of
>> the VCF file differs
>> due to the smaller subset of chromosomes in the
>> VCF file.
>>
>> This can be easily fixed by replacing the line:
>>
>> if (length(newsi) > length(oldsi)) {
>>
>> within the .scanVcfToVCF() function in
>> methods-readVcf.R, by
>>
>> if (length(newsi) >= length(oldsi)) {
>>
>> this is happening both in release and devel. i'm
>> pasting below my
>> sessionInfo() for the release.
>>
>> let me know if you think this fix is feasible or
>> i'm wrongly using the
>> function readVcf(). i'm basically trying to use
>> readVcf() without having
>> to figure out the appropriate value for the
>> argument 'genome', i.e.,
>> without knowing beforehand what version of the
>> genome was used to
>> produce the VCF file.
>>
>> thanks!!
>> robert.
>>
>>
>> sessionInfo()
>> R version 3.1.1 Patched (2014-10-13 r66751)
>> Platform: x86_64-unknown-linux-gnu (64-bit)
>>
>> locale:
>> [1] LC_CTYPE=en_US.UTF8 LC_NUMERIC=C
>> [3] LC_TIME=en_US.UTF8 LC_COLLATE=en_US.UTF8
>> [5] LC_MONETARY=en_US.UTF8 LC_MESSAGES=en_US.UTF8
>> [7] LC_PAPER=en_US.UTF8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_US.UTF8 LC_IDENTIFICATION=C
>>
>> attached base packages:
>> [1] stats4 parallel stats graphics grDevices
>> [6] utils datasets methods base
>>
>> other attached packages:
>> [1] VariantAnnotation_1.12.0 Rsamtools_1.18.0
>> [3] Biostrings_2.34.0 XVector_0.6.0
>> [5] GenomicRanges_1.18.0 GenomeInfoDb_1.2.0
>> [7] IRanges_2.0.0 S4Vectors_0.4.0
>> [9] BiocGenerics_0.12.0 vimcom_1.0-0
>> [11] setwidth_1.0-3 colorout_1.0-3
>>
>> loaded via a namespace (and not attached):
>> [1] AnnotationDbi_1.28.0 base64enc_0.1-2
>> [3] BatchJobs_1.4 BBmisc_1.7
>> [5] Biobase_2.26.0 BiocParallel_1.0.0
>> [7] biomaRt_2.22.0 bitops_1.0-6
>> [9] brew_1.0-6 BSgenome_1.34.0
>> [11] checkmate_1.4 codetools_0.2-9
>> [13] DBI_0.3.1 digest_0.6.4
>> [15] fail_1.2 foreach_1.4.2
>> [17] GenomicAlignments_1.2.0 GenomicFeatures_1.18.0
>> [19] iterators_1.0.7 RCurl_1.95-4.3
>> [21] RSQLite_0.11.4 rtracklayer_1.26.0
>> [23] sendmailR_1.2-1 stringr_0.6.2
>> [25] tools_3.1.1 XML_3.98-1.1
>> [27] zlibbioc_1.12.0
>>
>> _______________________________________________
>> Bioc-devel at r-project.org
>> <mailto:Bioc-devel at r-project.org> mailing list
>> https://stat.ethz.ch/mailman/listinfo/bioc-devel
>>
>>
>>
>>
>>
>>
>
>
[[alternative HTML version deleted]]
More information about the Bioc-devel
mailing list