[BioC] Finding coding SNPs with predictCoding
Alex Gutteridge
alexg at ruggedtextile.com
Fri Feb 24 11:08:17 CET 2012
On 22.02.2012 18:58, Hervé Pagès wrote:
> Hi Alex,
>
> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
[...]
>> But the predictCoding call gives this error:
>>
>> Error in .setSeqNames(x, value) :
>> The replacement value for isActiveSeq must be a logical vector, with
>> names that match the seqlevels of the object
>
> The error message doesn't help much but I think the pb is that you
> didn't rename chMT properly. Try to do this:
>
> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>
> before you start the for(eg in entrez.ids){..} loop again.
>
> Cheers,
> H.
Thanks Hervé that nailed it. I'm having some difficulty joining up the
output of predictCoding() with the query SNPs though. If someone could
point out where the disconnect in my thinking is I would appreciate it!
Here's my (now edited down) script:
library(BSgenome.Hsapiens.UCSC.hg19)
library(VariantAnnotation)
library(SNPlocs.Hsapiens.dbSNP.20110815)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)
entrez.ids = c('6335')
txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
snps = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
gene.list = cdsBy(txdb19, by="gene")
vsd.list = gene.list[entrez.ids]
cds.list = cdsBy(txdb19,by="tx")
eg = entrez.ids[1]
snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
eg.snps = snps[snp.idx]
iupac = values(eg.snps)[,"alleles_as_ambig"]
eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
variant.alleles =
DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
aa =
predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
#####
Then if I query the predictCoding results in aa in an interactive
session I get the following (see inline comments for what I think should
be happening, but I must be misinterpreting what queryID means)
The docs for predictCoding() contain a small typo
(s/queryHits/queryID), but otherwise seem clear?
Columns include ‘queryID’, ‘consequence’, ‘refSeq’, ‘varSeq’,
‘refAA’, ‘varAA’, ‘txID’, ‘geneID’, and ‘cdsID’.
‘queryHits’ The ‘queryHits’ column provides a map back to the
variants in the original ‘query’. If the ‘query’ was a ‘VCF’
object this index corresponds to the row in the ‘GRanges’ in
the ‘rowData’ slot. If ‘query’ was an expanded ‘GRanges’,
‘RangedData’ or ‘RangesList’ the index corresponds to the row
in the expanded object.
#####
> aa[1,]
DataFrame with 1 row and 9 columns
queryID consequence refSeq varSeq refAA
<integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
1 1 nonsynonymous CTC ATC L
varAA txID geneID cdsID
<AAStringSet> <character> <factor> <integer>
1 I 10921 6335 33668
>#So the first SNP (queryID: 1) is nonsynonymous and maps to tx '10921'
> and cds '33668'.
>#If I look at the first query SNP I get this:
> eg.snps.exp[aa[1,'queryID'],]
GRanges with 1 range and 2 elementMetadata values:
seqnames ranges strand | RefSNP_id
alleles_as_ambig
<Rle> <IRanges> <Rle> | <character>
<character>
[1] chr2 [167055370, 167055370] * | 111558968
R
---
seqlengths:
chr1 chr2 chr3 chr4 chr5 chr6 ... chr20 chr21 chr22 chrX
chrY chrM
NA NA NA NA NA NA ... NA NA NA NA
NA NA
>#So SNP 1 is at 167055370 on chr2
>#But if I check tx '10921' I see that the cds overlapping 167055370 is
> actually '33651'
>#And cds '33668' is at the other end of the tx:
> cds.list[[aa[1,'txID']]]
GRanges with 26 ranges and 3 elementMetadata values:
seqnames ranges strand | cds_id cds_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr2 [167168009, 167168266] - | 33668 <NA>
[2] chr2 [167163466, 167163584] - | 33667 <NA>
[3] chr2 [167163020, 167163109] - | 33666 <NA>
[4] chr2 [167162302, 167162430] - | 33647 <NA>
[5] chr2 [167160748, 167160839] - | 33646 <NA>
[6] chr2 [167159600, 167159812] - | 33645 <NA>
[7] chr2 [167151109, 167151172] - | 33644 <NA>
[8] chr2 [167149741, 167149882] - | 33643 <NA>
[9] chr2 [167144947, 167145153] - | 33642 <NA>
... ... ... ... ... ... ...
[18] chr2 [167099012, 167099166] - | 33659 <NA>
[19] chr2 [167094604, 167094777] - | 33658 <NA>
[20] chr2 [167089850, 167089972] - | 33657 <NA>
[21] chr2 [167085201, 167085482] - | 33656 <NA>
[22] chr2 [167084180, 167084233] - | 33655 <NA>
[23] chr2 [167083077, 167083214] - | 33654 <NA>
[24] chr2 [167060870, 167060974] - | 33653 <NA>
[25] chr2 [167060465, 167060735] - | 33652 <NA>
[26] chr2 [167055182, 167056374] - | 33651 <NA>
exon_rank
<integer>
[1] 2
[2] 3
[3] 4
[4] 5
[5] 6
[6] 7
[7] 8
[8] 9
[9] 10
... ...
[18] 19
[19] 20
[20] 21
[21] 22
[22] 23
[23] 24
[24] 25
[25] 26
[26] 27
---
seqlengths:
chr1 chr2 ...
chr18_gl000207_random
249250621 243199373 ...
4262
--
Alex Gutteridge
More information about the Bioconductor
mailing list