[BioC] Finding coding SNPs with predictCoding
Alex Gutteridge
alexg at ruggedtextile.com
Fri Mar 2 10:11:53 CET 2012
Thanks Valerie - much appreciated!
On 01.03.2012 21:30, Valerie Obenchain wrote:
> A 'txLoc' column has been added to the output of predictCoding.
> Available in devel version 1.1.57.
>
> Valerie
>
>
> On 02/28/2012 08:20 AM, Valerie Obenchain wrote:
>> Good suggestion. Yes, predictCoding is does this internally. I'll
>> post back here when this has been added.
>>
>> Valerie
>>
>>
>>
>> On 02/28/2012 01:49 AM, Alex Gutteridge wrote:
>>> Hi Valerie,
>>>
>>> Thanks everything works great now. One small feature request -
>>> would it be hard to output the protein sequence position of the
>>> coding SNPs? At the moment once I've run predictCoding I'm
>>> re-extracting the cds and working out the position of each coding SNP
>>> so I can see where in the protein sequence it is, but it seems like
>>> this is probably just replicating what predictCoding must be doing
>>> internally anyway?
>>>
>>> Alex Gutteridge
>>
>>
>> On 02/24/2012 10:39 AM, Valerie Obenchain wrote:
>>> Hi Alex,
>>>
>>> Thanks for the bug report. The cdsID was taken from an overlap
>>> between the query and GRangesList of cds by transcripts. This gave
>>> the correct transcript number but (incorrectly) took the first cds
>>> number in the list by default. Now fixed in devel 1.1.55.
>>>
>>> I've also updated the man page.
>>>
>>> Valerie
>>>
>>>
>>>
>>> On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
>>>> On 22.02.2012 18:58, Hervé Pagès wrote:
>>>>> Hi Alex,
>>>>>
>>>>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>>>>
>>>> [...]
>>>>
>>>>>> But the predictCoding call gives this error:
>>>>>>
>>>>>> Error in .setSeqNames(x, value) :
>>>>>> The replacement value for isActiveSeq must be a logical vector,
>>>>>> with
>>>>>> names that match the seqlevels of the object
>>>>>
>>>>> The error message doesn't help much but I think the pb is that
>>>>> you
>>>>> didn't rename chMT properly. Try to do this:
>>>>>
>>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>>>>
>>>>> before you start the for(eg in entrez.ids){..} loop again.
>>>>>
>>>>> Cheers,
>>>>> H.
>>>>
>>>> Thanks Hervé that nailed it. I'm having some difficulty joining up
>>>> the output of predictCoding() with the query SNPs though. If someone
>>>> could point out where the disconnect in my thinking is I would
>>>> appreciate it!
>>>>
>>>> Here's my (now edited down) script:
>>>>
>>>> library(BSgenome.Hsapiens.UCSC.hg19)
>>>> library(VariantAnnotation)
>>>> library(SNPlocs.Hsapiens.dbSNP.20110815)
>>>> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>>>>
>>>> entrez.ids = c('6335')
>>>> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>>>>
>>>> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
>>>> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
>>>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>>>
>>>> gene.list = cdsBy(txdb19, by="gene")
>>>> vsd.list = gene.list[entrez.ids]
>>>> cds.list = cdsBy(txdb19,by="tx")
>>>>
>>>> eg = entrez.ids[1]
>>>>
>>>> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
>>>> eg.snps = snps[snp.idx]
>>>> iupac = values(eg.snps)[,"alleles_as_ambig"]
>>>> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
>>>> variant.alleles =
>>>> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
>>>>
>>>>
>>>> aa =
>>>> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>>>>
>>>> #####
>>>>
>>>> Then if I query the predictCoding results in aa in an interactive
>>>> session I get the following (see inline comments for what I think
>>>> should be happening, but I must be misinterpreting what queryID
>>>> means)
>>>>
>>>> The docs for predictCoding() contain a small typo
>>>> (s/queryHits/queryID), but otherwise seem clear?
>>>>
>>>> Columns include ‘queryID’, ‘consequence’, ‘refSeq’, ‘varSeq’,
>>>> ‘refAA’, ‘varAA’, ‘txID’, ‘geneID’, and ‘cdsID’.
>>>>
>>>> ‘queryHits’ The ‘queryHits’ column provides a map back to the
>>>> variants in the original ‘query’. If the ‘query’ was a
>>>> ‘VCF’
>>>> object this index corresponds to the row in the
>>>> ‘GRanges’ in
>>>> the ‘rowData’ slot. If ‘query’ was an expanded
>>>> ‘GRanges’,
>>>> ‘RangedData’ or ‘RangesList’ the index corresponds to
>>>> the row
>>>> in the expanded object.
>>>>
>>>> #####
>>>>
>>>>> aa[1,]
>>>> DataFrame with 1 row and 9 columns
>>>> queryID consequence refSeq varSeq
>>>> refAA
>>>> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
>>>> 1 1 nonsynonymous CTC ATC
>>>> L
>>>> varAA txID geneID cdsID
>>>> <AAStringSet> <character> <factor> <integer>
>>>> 1 I 10921 6335 33668
>>>>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx
>>>>> '10921' and cds '33668'.
>>>>> #If I look at the first query SNP I get this:
>>>>> eg.snps.exp[aa[1,'queryID'],]
>>>> GRanges with 1 range and 2 elementMetadata values:
>>>> seqnames ranges strand | RefSNP_id
>>>> alleles_as_ambig
>>>> <Rle> <IRanges> <Rle> | <character> <character>
>>>> [1] chr2 [167055370, 167055370] * | 111558968
>>>> R
>>>> ---
>>>> seqlengths:
>>>> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX
>>>> chrY chrM
>>>> NA NA NA NA NA NA ... NA NA NA NA
>>>> NA NA
>>>>> #So SNP 1 is at 167055370 on chr2
>>>>> #But if I check tx '10921' I see that the cds overlapping
>>>>> 167055370 is actually '33651'
>>>>> #And cds '33668' is at the other end of the tx:
>>>>> cds.list[[aa[1,'txID']]]
>>>> GRanges with 26 ranges and 3 elementMetadata values:
>>>> seqnames ranges strand | cds_id
>>>> cds_name
>>>> <Rle> <IRanges> <Rle> | <integer> <character>
>>>> [1] chr2 [167168009, 167168266] - | 33668 <NA>
>>>> [2] chr2 [167163466, 167163584] - | 33667 <NA>
>>>> [3] chr2 [167163020, 167163109] - | 33666 <NA>
>>>> [4] chr2 [167162302, 167162430] - | 33647 <NA>
>>>> [5] chr2 [167160748, 167160839] - | 33646 <NA>
>>>> [6] chr2 [167159600, 167159812] - | 33645 <NA>
>>>> [7] chr2 [167151109, 167151172] - | 33644 <NA>
>>>> [8] chr2 [167149741, 167149882] - | 33643 <NA>
>>>> [9] chr2 [167144947, 167145153] - | 33642 <NA>
>>>> ... ... ... ... ... ...
>>>> ...
>>>> [18] chr2 [167099012, 167099166] - | 33659 <NA>
>>>> [19] chr2 [167094604, 167094777] - | 33658 <NA>
>>>> [20] chr2 [167089850, 167089972] - | 33657 <NA>
>>>> [21] chr2 [167085201, 167085482] - | 33656 <NA>
>>>> [22] chr2 [167084180, 167084233] - | 33655 <NA>
>>>> [23] chr2 [167083077, 167083214] - | 33654 <NA>
>>>> [24] chr2 [167060870, 167060974] - | 33653 <NA>
>>>> [25] chr2 [167060465, 167060735] - | 33652 <NA>
>>>> [26] chr2 [167055182, 167056374] - | 33651 <NA>
>>>> exon_rank
>>>> <integer>
>>>> [1] 2
>>>> [2] 3
>>>> [3] 4
>>>> [4] 5
>>>> [5] 6
>>>> [6] 7
>>>> [7] 8
>>>> [8] 9
>>>> [9] 10
>>>> ... ...
>>>> [18] 19
>>>> [19] 20
>>>> [20] 21
>>>> [21] 22
>>>> [22] 23
>>>> [23] 24
>>>> [24] 25
>>>> [25] 26
>>>> [26] 27
>>>> ---
>>>> seqlengths:
>>>> chr1 chr2 ...
>>>> chr18_gl000207_random
>>>> 249250621 243199373 ...
>>>> 4262
>>>>
>>>>
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>> _______________________________________________
>> Bioconductor mailing list
>> Bioconductor at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>> Search the archives:
>> http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Alex Gutteridge
More information about the Bioconductor
mailing list