[BioC] VCF predictCoding(...) providing inconsistent results?
Valerie Obenchain
vobencha at fhcrc.org
Wed Jan 23 06:02:16 CET 2013
Hi Murat,
Sorry for the delayed response, I've been out of town. I can't reproduce
this error because I don't have the alleles for the range. Can you send
a small subset of the VCF?
The seqlevel preprocessing is still necessary to make the annotation and
VCF file match. This can be done with renameSeqlevels() or seqlevels().
However removing the structural variants is no longer needed in >=
VariantAnnotation 1.5.28 in the devel branch. From that version on
predictCoding() will ignore structural variants and return results for
non-structural only.
Valerie
On 01/15/13 13:57, Murat Tasan wrote:
> i should add that i actually had to pre-process the vcf file a bit to
> remedy two problems:
>
> (1) i changed the seqlevels like so:
>
> seqlevels(vcf)<- sprintf("chr%s", seqlevels(vcf))
>
> and (2) i had to remove structural variants, but i think i did this
> without corrupting the resulting vcf structure... here's that bit of
> code:
>
> structural_lix<- rep(FALSE, length(alt(vcf)))
> structural_lix[sapply(alt(vcf), function(x) any(grepl("<", x)))]<- TRUE
> table(structural_lix) ## just to see how many entries will be
> removed.
>
> vcf<- vcf[!structural_lix,]
> alt(vcf)<- VariantAnnotation:::.toDNAStringSetList(unlist(alt(vcf),
> use.names = FALSE))
>
> (the structural-variant removal code was inspired and closely follows
> that described here:
> https://stat.ethz.ch/pipermail/bioconductor/2012-October/048448.html)
>
> after these two steps, the vcf object seemed to be perfectly happy,
> until the predictCoding(...) strangeness described in the previous
> post.
>
> cheers,
>
> -m
>
>
> On Tue, Jan 15, 2013 at 4:34 PM, Murat Tasan<mmuurr at gmail.com> wrote:
>> hi all - i found a strange case while examining a VCF object extracted
>> from chromosome 1 of the 1000 Genomes Project "integrated_call_set"
>> files.
>> i first extracted all variants falling in CDS regions of some of my
>> genes of interest into a vcf object 'vcf'.
>>
>> then i ran predictCoding(vcf, TXDB, Hsapiens), where TXDB and Hsapiens
>> come from TxDb.Hsapiens.UCSC.hg19.knownGene and
>> BSgenome.Hsapiens.UCSC.hg19, respectively:
>>
>> vcf_coding_info_GRanges<- predictCoding(vcf, TXDB, Hsapiens).
>>
>> now, examining a single variant i got pretty confused, from two cases:
>>
>> CASE 1
>>
>>> vcf_coding_info_GRanges[names(vcf_coding_info_GRanges) == "rs78192809",]
>> GRanges with 6 ranges and 13 metadata columns:
>> seqnames ranges strand | paramRangeID
>> varAllele CDSLOC PROTEINLOC QUERYID
>> TXID CDSID GENEID CONSEQUENCE
>> <Rle> <IRanges> <Rle> |<factor>
>> <DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
>> <character> <integer> <character> <factor>
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [1564, 1564] 522 9426
>> 6991 20852 89866 nonsense
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [1988, 1988] 663 9426
>> 6992 20852 89866 nonsynonymous
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [3008, 3008] 1003 9426
>> 6993 20852 89866 nonsynonymous
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [1985, 1985] 662 9426
>> 6994 20852 89866 nonsynonymous
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [3011, 3011] 1004 9426
>> 6995 20852 89866 synonymous
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [2039, 2039] 680 9426
>> 6996 20852 89866 synonymous
>> REFCODON VARCODON REFAA VARAA
>> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
>> rs78192809 CAG TAG Q *
>> rs78192809 CCA CTA P L
>> rs78192809 CCA CTA P L
>> rs78192809 CCA CTA P L
>> rs78192809 CCA CCA P P
>> rs78192809 CCA CCA P P
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>> CASE 2: i extracted the subset of the original vcf object and ran
>> predictCoding(...) on that sub-object:
>>
>>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
>> GRanges with 6 ranges and 13 metadata columns:
>> seqnames ranges strand | paramRangeID
>> varAllele CDSLOC PROTEINLOC QUERYID
>> TXID CDSID GENEID CONSEQUENCE
>> <Rle> <IRanges> <Rle> |<factor>
>> <DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
>> <character> <integer> <character> <factor>
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [1564, 1564] 522 1
>> 6991 20852 89866 nonsense
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [1988, 1988] 663 1
>> 6992 20852 89866 nonsynonymous
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [3008, 3008] 1003 1
>> 6993 20852 89866 nonsynonymous
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [1985, 1985] 662 1
>> 6994 20852 89866 nonsynonymous
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [3011, 3011] 1004 1
>> 6995 20852 89866 nonsynonymous
>> rs78192809 chr1 [177901629, 177901629] - | entrezgene:89866
>> T [2039, 2039] 680 1
>> 6996 20852 89866 nonsynonymous
>> REFCODON VARCODON REFAA VARAA
>> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
>> rs78192809 CAG TAG Q *
>> rs78192809 CCA CTA P L
>> rs78192809 CCA CTA P L
>> rs78192809 CCA CTA P L
>> rs78192809 CCA CTA P L
>> rs78192809 CCA CTA P L
>> ---
>> seqlengths:
>> chr1
>> NA
>>
>> for nearly all fields, especially the CDSID, TXID, and the location
>> data of the variant in the sequences, the results are the same.
>> BUT, the CONSEQUENCE, VARCODON, and VARAA are inconsistent between the
>> two calls.
>>
>> have i done something wrong in the second attempt (i.e. does
>> subsetting on the VCF variants first cause a problem when submitting
>> the result to predictCoding(...))?
>>
>> thanks for any help/insight.
>>
>> -m
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list