[BioC] Finding coding SNPs with predictCoding

Tue Feb 28 17:20:17 CET 2012

Good suggestion. Yes, predictCoding is does this internally. I'll post 
back here when this has been added.

Valerie

On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
> Hi Valerie,
>
> Thanks everything works great now. One small feature request - would 
> it be hard to output the protein sequence position of the coding SNPs? 
> At the moment once I've run predictCoding I'm re-extracting the cds 
> and working out the position of each coding SNP so I can see where in 
> the protein sequence it is, but it seems like this is probably just 
> replicating what predictCoding must be doing internally anyway?
>
> Alex Gutteridge

On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
> Hi Alex,
>
> Thanks for the bug report. The cdsID was taken from an overlap between 
> the query and GRangesList of cds by transcripts. This gave the correct 
> transcript number but (incorrectly) took the first cds number in the 
> list by default. Now fixed in devel 1.1.55.
>
> I've also updated the man page.
>
> Valerie
>
>
>
> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
>> On 22.02.2012 18:58, Hervé Pagès wrote:
>>> Hi Alex,
>>>
>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>>
>> [...]
>>
>>>> But the predictCoding call gives this error:
>>>>
>>>> Error in .setSeqNames(x, value) :
>>>> The replacement value for isActiveSeq must be a logical vector, with
>>>> names that match the seqlevels of the object
>>>
>>> The error message doesn't help much but I think the pb is that you
>>> didn't rename chMT properly. Try to do this:
>>>
>>>   seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>>
>>> before you start the for(eg in entrez.ids){..} loop again.
>>>
>>> Cheers,
>>> H.
>>
>> Thanks Hervé that nailed it. I'm having some difficulty joining up 
>> the output of predictCoding() with the query SNPs though. If someone 
>> could point out where the disconnect in my thinking is I would 
>> appreciate it!
>>
>> Here's my (now edited down) script:
>>
>> library(BSgenome.Hsapiens.UCSC.hg19)
>> library(VariantAnnotation)
>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>
>> entrez.ids = c('6335')
>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>>
>> snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>
>> gene.list = cdsBy(txdb19, by="gene")
>> vsd.list = gene.list[entrez.ids]
>> cds.list = cdsBy(txdb19,by="tx")
>>
>> eg = entrez.ids[1]
>>
>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
>> eg.snps = snps[snp.idx]
>> iupac   = values(eg.snps)[,"alleles_as_ambig"]
>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
>> variant.alleles = 
>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
>>
>> aa = 
>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>>
>> #####
>>
>> Then if I query the predictCoding results in aa in an interactive 
>> session I get the following (see inline comments for what I think 
>> should be happening, but I must be misinterpreting what queryID means)
>>
>> The docs for predictCoding() contain a small typo 
>> (s/queryHits/queryID), but otherwise seem clear?
>>
>> Columns include ‘queryID’, ‘consequence’, ‘refSeq’, ‘varSeq’,
>>      ‘refAA’, ‘varAA’, ‘txID’, ‘geneID’, and ‘cdsID’.
>>
>>      ‘queryHits’ The ‘queryHits’ column provides a map back to the
>>           variants in the original ‘query’. If the ‘query’ was a ‘VCF’
>>           object this index corresponds to the row in the ‘GRanges’ in
>>           the ‘rowData’ slot. If ‘query’ was an expanded ‘GRanges’,
>>           ‘RangedData’ or ‘RangesList’ the index corresponds to the row
>>           in the expanded object.
>>
>> #####
>>
>>> aa[1,]
>> DataFrame with 1 row and 9 columns
>>     queryID   consequence         refSeq         varSeq         refAA
>> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
>> 1         1 nonsynonymous            CTC            ATC             L
>>           varAA        txID   geneID     cdsID
>> <AAStringSet> <character> <factor> <integer>
>> 1             I       10921     6335     33668
>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx 
>>> '10921' and cds '33668'.
>>> #If I look at the first query SNP I get this:
>>> eg.snps.exp[aa[1,'queryID'],]
>> GRanges with 1 range and 2 elementMetadata values:
>>       seqnames                 ranges strand |   RefSNP_id 
>> alleles_as_ambig
>> <Rle> <IRanges> <Rle> | <character> <character>
>>   [1]     chr2 [167055370, 167055370]      * |   
>> 111558968               R
>>   ---
>>   seqlengths:
>>     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  chrX  
>> chrY  chrM
>>       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    
>> NA    NA
>>> #So SNP 1 is at 167055370 on chr2
>>> #But if I check tx '10921' I see that the cds overlapping 167055370 
>>> is actually '33651'
>>> #And cds '33668' is at the other end of the tx:
>>> cds.list[[aa[1,'txID']]]
>> GRanges with 26 ranges and 3 elementMetadata values:
>>        seqnames                 ranges strand   |    cds_id    cds_name
>> <Rle> <IRanges> <Rle>   | <integer> <character>
>>    [1]     chr2 [167168009, 167168266]      -   |     33668 <NA>
>>    [2]     chr2 [167163466, 167163584]      -   |     33667 <NA>
>>    [3]     chr2 [167163020, 167163109]      -   |     33666 <NA>
>>    [4]     chr2 [167162302, 167162430]      -   |     33647 <NA>
>>    [5]     chr2 [167160748, 167160839]      -   |     33646 <NA>
>>    [6]     chr2 [167159600, 167159812]      -   |     33645 <NA>
>>    [7]     chr2 [167151109, 167151172]      -   |     33644 <NA>
>>    [8]     chr2 [167149741, 167149882]      -   |     33643 <NA>
>>    [9]     chr2 [167144947, 167145153]      -   |     33642 <NA>
>>    ...      ...                    ...    ... ...       ...         ...
>>   [18]     chr2 [167099012, 167099166]      -   |     33659 <NA>
>>   [19]     chr2 [167094604, 167094777]      -   |     33658 <NA>
>>   [20]     chr2 [167089850, 167089972]      -   |     33657 <NA>
>>   [21]     chr2 [167085201, 167085482]      -   |     33656 <NA>
>>   [22]     chr2 [167084180, 167084233]      -   |     33655 <NA>
>>   [23]     chr2 [167083077, 167083214]      -   |     33654 <NA>
>>   [24]     chr2 [167060870, 167060974]      -   |     33653 <NA>
>>   [25]     chr2 [167060465, 167060735]      -   |     33652 <NA>
>>   [26]     chr2 [167055182, 167056374]      -   |     33651 <NA>
>>        exon_rank
>> <integer>
>>    [1]         2
>>    [2]         3
>>    [3]         4
>>    [4]         5
>>    [5]         6
>>    [6]         7
>>    [7]         8
>>    [8]         9
>>    [9]        10
>>    ...       ...
>>   [18]        19
>>   [19]        20
>>   [20]        21
>>   [21]        22
>>   [22]        23
>>   [23]        24
>>   [24]        25
>>   [25]        26
>>   [26]        27
>>   ---
>>   seqlengths:
>>                     chr1                  chr2 ... chr18_gl000207_random
>>                249250621             243199373 ...                  4262
>>
>>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: 
> http://news.gmane.org/gmane.science.biology.informatics.conductor