[BioC] VCF predictCoding(...) providing inconsistent results?
Valerie Obenchain
vobencha at fhcrc.org
Wed Jan 23 07:47:30 CET 2013
Sorry about that, the alleles are right there in front of me in the
varAllele column.
When I run this code I get the same results as your second output:
library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(BSgenome.Hsapiens.UCSC.hg19)
gr <- GRanges("chr1", IRanges(177901629, width=1), strand="-")
predictCoding(gr, txdb, Hsapiens, varAllele=DNAStringSet("A"))
Can you provide the output of the subset and non-subset VCF for just
"rs78192809"? For example,
nonsubsetVCF["rs78192809",]
and
subsetVCF["rs78192809",]
Valerie
On 01/22/13 21:02, Valerie Obenchain wrote:
> Hi Murat,
>
> Sorry for the delayed response, I've been out of town. I can't
> reproduce this error because I don't have the alleles for the range.
> Can you send a small subset of the VCF?
>
> The seqlevel preprocessing is still necessary to make the annotation
> and VCF file match. This can be done with renameSeqlevels() or
> seqlevels(). However removing the structural variants is no longer
> needed in >= VariantAnnotation 1.5.28 in the devel branch. From that
> version on predictCoding() will ignore structural variants and return
> results for non-structural only.
>
> Valerie
>
>
> On 01/15/13 13:57, Murat Tasan wrote:
>> i should add that i actually had to pre-process the vcf file a bit to
>> remedy two problems:
>>
>> (1) i changed the seqlevels like so:
>>
>> seqlevels(vcf)<- sprintf("chr%s", seqlevels(vcf))
>>
>> and (2) i had to remove structural variants, but i think i did this
>> without corrupting the resulting vcf structure... here's that bit of
>> code:
>>
>> structural_lix<- rep(FALSE, length(alt(vcf)))
>> structural_lix[sapply(alt(vcf), function(x) any(grepl("<", x)))]<-
>> TRUE
>> table(structural_lix) ## just to see how many entries will be
>> removed.
>>
>> vcf<- vcf[!structural_lix,]
>> alt(vcf)<- VariantAnnotation:::.toDNAStringSetList(unlist(alt(vcf),
>> use.names = FALSE))
>>
>> (the structural-variant removal code was inspired and closely follows
>> that described here:
>> https://stat.ethz.ch/pipermail/bioconductor/2012-October/048448.html)
>>
>> after these two steps, the vcf object seemed to be perfectly happy,
>> until the predictCoding(...) strangeness described in the previous
>> post.
>>
>> cheers,
>>
>> -m
>>
>>
>> On Tue, Jan 15, 2013 at 4:34 PM, Murat Tasan<mmuurr at gmail.com> wrote:
>>> hi all - i found a strange case while examining a VCF object extracted
>>> from chromosome 1 of the 1000 Genomes Project "integrated_call_set"
>>> files.
>>> i first extracted all variants falling in CDS regions of some of my
>>> genes of interest into a vcf object 'vcf'.
>>>
>>> then i ran predictCoding(vcf, TXDB, Hsapiens), where TXDB and Hsapiens
>>> come from TxDb.Hsapiens.UCSC.hg19.knownGene and
>>> BSgenome.Hsapiens.UCSC.hg19, respectively:
>>>
>>> vcf_coding_info_GRanges<- predictCoding(vcf, TXDB, Hsapiens).
>>>
>>> now, examining a single variant i got pretty confused, from two cases:
>>>
>>> CASE 1
>>>
>>>> vcf_coding_info_GRanges[names(vcf_coding_info_GRanges) ==
>>>> "rs78192809",]
>>> GRanges with 6 ranges and 13 metadata columns:
>>> seqnames ranges strand | paramRangeID
>>> varAllele CDSLOC PROTEINLOC QUERYID
>>> TXID CDSID GENEID CONSEQUENCE
>>> <Rle> <IRanges> <Rle> |<factor>
>>> <DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
>>> <character> <integer> <character> <factor>
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [1564, 1564] 522 9426
>>> 6991 20852 89866 nonsense
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [1988, 1988] 663 9426
>>> 6992 20852 89866 nonsynonymous
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [3008, 3008] 1003 9426
>>> 6993 20852 89866 nonsynonymous
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [1985, 1985] 662 9426
>>> 6994 20852 89866 nonsynonymous
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [3011, 3011] 1004 9426
>>> 6995 20852 89866 synonymous
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [2039, 2039] 680 9426
>>> 6996 20852 89866 synonymous
>>> REFCODON VARCODON REFAA VARAA
>>> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
>>> rs78192809 CAG TAG Q *
>>> rs78192809 CCA CTA P L
>>> rs78192809 CCA CTA P L
>>> rs78192809 CCA CTA P L
>>> rs78192809 CCA CCA P P
>>> rs78192809 CCA CCA P P
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>> CASE 2: i extracted the subset of the original vcf object and ran
>>> predictCoding(...) on that sub-object:
>>>
>>>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
>>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
>>> GRanges with 6 ranges and 13 metadata columns:
>>> seqnames ranges strand | paramRangeID
>>> varAllele CDSLOC PROTEINLOC QUERYID
>>> TXID CDSID GENEID CONSEQUENCE
>>> <Rle> <IRanges> <Rle> |<factor>
>>> <DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
>>> <character> <integer> <character> <factor>
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [1564, 1564] 522 1
>>> 6991 20852 89866 nonsense
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [1988, 1988] 663 1
>>> 6992 20852 89866 nonsynonymous
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [3008, 3008] 1003 1
>>> 6993 20852 89866 nonsynonymous
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [1985, 1985] 662 1
>>> 6994 20852 89866 nonsynonymous
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [3011, 3011] 1004 1
>>> 6995 20852 89866 nonsynonymous
>>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>>> T [2039, 2039] 680 1
>>> 6996 20852 89866 nonsynonymous
>>> REFCODON VARCODON REFAA VARAA
>>> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
>>> rs78192809 CAG TAG Q *
>>> rs78192809 CCA CTA P L
>>> rs78192809 CCA CTA P L
>>> rs78192809 CCA CTA P L
>>> rs78192809 CCA CTA P L
>>> rs78192809 CCA CTA P L
>>> ---
>>> seqlengths:
>>> chr1
>>> NA
>>>
>>> for nearly all fields, especially the CDSID, TXID, and the location
>>> data of the variant in the sequences, the results are the same.
>>> BUT, the CONSEQUENCE, VARCODON, and VARAA are inconsistent between the
>>> two calls.
>>>
>>> have i done something wrong in the second attempt (i.e. does
>>> subsetting on the VCF variants first cause a problem when submitting
>>> the result to predictCoding(...))?
>>>
>>> thanks for any help/insight.
>>>
>>> -m
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives:
> http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list