[BioC] Finding coding SNPs with predictCoding
Valerie Obenchain
vobencha at fhcrc.org
Fri Feb 24 19:39:51 CET 2012
Hi Alex,
Thanks for the bug report. The cdsID was taken from an overlap between
the query and GRangesList of cds by transcripts. This gave the correct
transcript number but (incorrectly) took the first cds number in the
list by default. Now fixed in devel 1.1.55.
I've also updated the man page.
Valerie
On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
> On 22.02.2012 18:58, Hervé Pagès wrote:
>> Hi Alex,
>>
>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>
> [...]
>
>>> But the predictCoding call gives this error:
>>>
>>> Error in .setSeqNames(x, value) :
>>> The replacement value for isActiveSeq must be a logical vector, with
>>> names that match the seqlevels of the object
>>
>> The error message doesn't help much but I think the pb is that you
>> didn't rename chMT properly. Try to do this:
>>
>> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>
>> before you start the for(eg in entrez.ids){..} loop again.
>>
>> Cheers,
>> H.
>
> Thanks Hervé that nailed it. I'm having some difficulty joining up the
> output of predictCoding() with the query SNPs though. If someone could
> point out where the disconnect in my thinking is I would appreciate it!
>
> Here's my (now edited down) script:
>
> library(BSgenome.Hsapiens.UCSC.hg19)
> library(VariantAnnotation)
> library(SNPlocs.Hsapiens.dbSNP.20110815)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
> entrez.ids = c('6335')
> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>
> snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>
> gene.list = cdsBy(txdb19, by="gene")
> vsd.list = gene.list[entrez.ids]
> cds.list = cdsBy(txdb19,by="tx")
>
> eg = entrez.ids[1]
>
> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
> eg.snps = snps[snp.idx]
> iupac = values(eg.snps)[,"alleles_as_ambig"]
> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
> variant.alleles =
> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
>
> aa =
> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>
> #####
>
> Then if I query the predictCoding results in aa in an interactive
> session I get the following (see inline comments for what I think
> should be happening, but I must be misinterpreting what queryID means)
>
> The docs for predictCoding() contain a small typo
> (s/queryHits/queryID), but otherwise seem clear?
>
> Columns include ‘queryID’, ‘consequence’, ‘refSeq’, ‘varSeq’,
> ‘refAA’, ‘varAA’, ‘txID’, ‘geneID’, and ‘cdsID’.
>
> ‘queryHits’ The ‘queryHits’ column provides a map back to the
> variants in the original ‘query’. If the ‘query’ was a ‘VCF’
> object this index corresponds to the row in the ‘GRanges’ in
> the ‘rowData’ slot. If ‘query’ was an expanded ‘GRanges’,
> ‘RangedData’ or ‘RangesList’ the index corresponds to the row
> in the expanded object.
>
> #####
>
>> aa[1,]
> DataFrame with 1 row and 9 columns
> queryID consequence refSeq varSeq refAA
> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
> 1 1 nonsynonymous CTC ATC L
> varAA txID geneID cdsID
> <AAStringSet> <character> <factor> <integer>
> 1 I 10921 6335 33668
>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx
>> '10921' and cds '33668'.
>> #If I look at the first query SNP I get this:
>> eg.snps.exp[aa[1,'queryID'],]
> GRanges with 1 range and 2 elementMetadata values:
> seqnames ranges strand | RefSNP_id
> alleles_as_ambig
> <Rle> <IRanges> <Rle> | <character> <character>
> [1] chr2 [167055370, 167055370] * |
> 111558968 R
> ---
> seqlengths:
> chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX
> chrY chrM
> NA NA NA NA NA NA ... NA NA NA NA
> NA NA
>> #So SNP 1 is at 167055370 on chr2
>> #But if I check tx '10921' I see that the cds overlapping 167055370
>> is actually '33651'
>> #And cds '33668' is at the other end of the tx:
>> cds.list[[aa[1,'txID']]]
> GRanges with 26 ranges and 3 elementMetadata values:
> seqnames ranges strand | cds_id cds_name
> <Rle> <IRanges> <Rle> | <integer> <character>
> [1] chr2 [167168009, 167168266] - | 33668 <NA>
> [2] chr2 [167163466, 167163584] - | 33667 <NA>
> [3] chr2 [167163020, 167163109] - | 33666 <NA>
> [4] chr2 [167162302, 167162430] - | 33647 <NA>
> [5] chr2 [167160748, 167160839] - | 33646 <NA>
> [6] chr2 [167159600, 167159812] - | 33645 <NA>
> [7] chr2 [167151109, 167151172] - | 33644 <NA>
> [8] chr2 [167149741, 167149882] - | 33643 <NA>
> [9] chr2 [167144947, 167145153] - | 33642 <NA>
> ... ... ... ... ... ... ...
> [18] chr2 [167099012, 167099166] - | 33659 <NA>
> [19] chr2 [167094604, 167094777] - | 33658 <NA>
> [20] chr2 [167089850, 167089972] - | 33657 <NA>
> [21] chr2 [167085201, 167085482] - | 33656 <NA>
> [22] chr2 [167084180, 167084233] - | 33655 <NA>
> [23] chr2 [167083077, 167083214] - | 33654 <NA>
> [24] chr2 [167060870, 167060974] - | 33653 <NA>
> [25] chr2 [167060465, 167060735] - | 33652 <NA>
> [26] chr2 [167055182, 167056374] - | 33651 <NA>
> exon_rank
> <integer>
> [1] 2
> [2] 3
> [3] 4
> [4] 5
> [5] 6
> [6] 7
> [7] 8
> [8] 9
> [9] 10
> ... ...
> [18] 19
> [19] 20
> [20] 21
> [21] 22
> [22] 23
> [23] 24
> [24] 25
> [25] 26
> [26] 27
> ---
> seqlengths:
> chr1 chr2 ... chr18_gl000207_random
> 249250621 243199373 ... 4262
>
>
More information about the Bioconductor
mailing list