[BioC] VCF predictCoding(...) providing inconsistent results?

Wed Jan 23 07:47:30 CET 2013

Sorry about that, the alleles are right there in front of me in the 
varAllele column.

When I run this code I get the same results as your second output:

library(VariantAnnotation)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene
library(BSgenome.Hsapiens.UCSC.hg19)
gr <- GRanges("chr1", IRanges(177901629, width=1), strand="-")
predictCoding(gr, txdb, Hsapiens, varAllele=DNAStringSet("A"))

Can you provide the output of the subset and non-subset VCF for just 
"rs78192809"?  For example,

nonsubsetVCF["rs78192809",]

and

subsetVCF["rs78192809",]

Valerie

On 01/22/13 21:02, Valerie Obenchain wrote:
> Hi Murat,
>
> Sorry for the delayed response, I've been out of town. I can't 
> reproduce this error because I don't have the alleles for the range. 
> Can you send a small subset of the VCF?
>
> The seqlevel preprocessing is still necessary to make the annotation 
> and VCF file match. This can be done with renameSeqlevels() or 
> seqlevels(). However removing the structural variants is no longer 
> needed in >= VariantAnnotation 1.5.28 in the devel branch. From that 
> version on predictCoding() will ignore structural variants and return 
> results for non-structural only.
>
> Valerie
>
>
> On 01/15/13 13:57, Murat Tasan wrote:
>> i should add that i actually had to pre-process the vcf file a bit to
>> remedy two problems:
>>
>> (1) i changed the seqlevels like so:
>>
>>    seqlevels(vcf)<- sprintf("chr%s", seqlevels(vcf))
>>
>> and (2) i had to remove structural variants, but i think i did this
>> without corrupting the resulting vcf structure... here's that bit of
>> code:
>>
>>    structural_lix<- rep(FALSE, length(alt(vcf)))
>>    structural_lix[sapply(alt(vcf), function(x) any(grepl("<", x)))]<- 
>> TRUE
>>    table(structural_lix) ## just to see how many entries will be
>> removed.
>>
>>    vcf<- vcf[!structural_lix,]
>>    alt(vcf)<- VariantAnnotation:::.toDNAStringSetList(unlist(alt(vcf),
>> use.names = FALSE))
>>
>> (the structural-variant removal code was inspired and closely follows
>> that described here:
>> https://stat.ethz.ch/pipermail/bioconductor/2012-October/048448.html)
>>
>> after these two steps, the vcf object seemed to be perfectly happy,
>> until the predictCoding(...) strangeness described in the previous
>> post.
>>
>> cheers,
>>
>> -m
>>
>>
>> On Tue, Jan 15, 2013 at 4:34 PM, Murat Tasan<mmuurr at gmail.com>  wrote:
>>> hi all - i found a strange case while examining a VCF object extracted
>>> from chromosome 1 of the 1000 Genomes Project "integrated_call_set"
>>> files.
>>> i first extracted all variants falling in CDS regions of some of my
>>> genes of interest into a vcf object 'vcf'.
>>>
>>> then i ran predictCoding(vcf, TXDB, Hsapiens), where TXDB and Hsapiens
>>> come from TxDb.Hsapiens.UCSC.hg19.knownGene and
>>> BSgenome.Hsapiens.UCSC.hg19, respectively:
>>>
>>> vcf_coding_info_GRanges<- predictCoding(vcf, TXDB, Hsapiens).
>>>
>>> now, examining a single variant i got pretty confused, from two cases:
>>>
>>> CASE 1
>>>
>>>> vcf_coding_info_GRanges[names(vcf_coding_info_GRanges) == 
>>>> "rs78192809",]
>>> GRanges with 6 ranges and 13 metadata columns:
>>>               seqnames                 ranges strand |     paramRangeID
>>>       varAllele       CDSLOC              PROTEINLOC   QUERYID
>>> TXID     CDSID      GENEID   CONSEQUENCE
>>> <Rle> <IRanges> <Rle>  |<factor>
>>> <DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
>>> <character> <integer> <character> <factor>
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [1564, 1564]                     522      9426
>>> 6991     20852       89866      nonsense
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [1988, 1988]                     663      9426
>>> 6992     20852       89866 nonsynonymous
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [3008, 3008]                    1003      9426
>>> 6993     20852       89866 nonsynonymous
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [1985, 1985]                     662      9426
>>> 6994     20852       89866 nonsynonymous
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [3011, 3011]                    1004      9426
>>> 6995     20852       89866    synonymous
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [2039, 2039]                     680      9426
>>> 6996     20852       89866    synonymous
>>>                     REFCODON       VARCODON         REFAA         VARAA
>>> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
>>>    rs78192809            CAG            TAG             Q             *
>>>    rs78192809            CCA            CTA             P             L
>>>    rs78192809            CCA            CTA             P             L
>>>    rs78192809            CCA            CTA             P             L
>>>    rs78192809            CCA            CCA             P             P
>>>    rs78192809            CCA            CCA             P             P
>>>    ---
>>>    seqlengths:
>>>     chr1
>>>       NA
>>>
>>> CASE 2: i extracted the subset of the original vcf object and ran
>>> predictCoding(...) on that sub-object:
>>>
>>>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
>>> predictCoding(vcf["rs78192809",], TXDB, Hsapiens)
>>> GRanges with 6 ranges and 13 metadata columns:
>>>               seqnames                 ranges strand |     paramRangeID
>>>       varAllele       CDSLOC              PROTEINLOC   QUERYID
>>> TXID     CDSID      GENEID   CONSEQUENCE
>>> <Rle> <IRanges> <Rle>  |<factor>
>>> <DNAStringSet> <IRanges> <CompressedIntegerList> <integer>
>>> <character> <integer> <character> <factor>
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [1564, 1564]                     522         1
>>> 6991     20852       89866      nonsense
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [1988, 1988]                     663         1
>>> 6992     20852       89866 nonsynonymous
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [3008, 3008]                    1003         1
>>> 6993     20852       89866 nonsynonymous
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [1985, 1985]                     662         1
>>> 6994     20852       89866 nonsynonymous
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [3011, 3011]                    1004         1
>>> 6995     20852       89866 nonsynonymous
>>>    rs78192809     chr1 [177901629, 177901629]      - | entrezgene:89866
>>>               T [2039, 2039]                     680         1
>>> 6996     20852       89866 nonsynonymous
>>>                     REFCODON       VARCODON         REFAA         VARAA
>>> <DNAStringSet> <DNAStringSet> <AAStringSet> <AAStringSet>
>>>    rs78192809            CAG            TAG             Q             *
>>>    rs78192809            CCA            CTA             P             L
>>>    rs78192809            CCA            CTA             P             L
>>>    rs78192809            CCA            CTA             P             L
>>>    rs78192809            CCA            CTA             P             L
>>>    rs78192809            CCA            CTA             P             L
>>>    ---
>>>    seqlengths:
>>>     chr1
>>>       NA
>>>
>>> for nearly all fields, especially the CDSID, TXID, and the location
>>> data of the variant in the sequences, the results are the same.
>>> BUT, the CONSEQUENCE, VARCODON, and VARAA are inconsistent between the
>>> two calls.
>>>
>>> have i done something wrong in the second attempt (i.e. does
>>> subsetting on the VCF variants first cause a problem when submitting
>>> the result to predictCoding(...))?
>>>
>>> thanks for any help/insight.
>>>
>>> -m
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives: 
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor