[BioC] Finding coding SNPs with predictCoding

Fri Feb 24 19:39:51 CET 2012

Hi Alex,

Thanks for the bug report. The cdsID was taken from an overlap between 
the query and GRangesList of cds by transcripts. This gave the correct 
transcript number but (incorrectly) took the first cds number in the 
list by default. Now fixed in devel 1.1.55.

I've also updated the man page.

Valerie

On 02/24/2012 02:08 AM, Alex Gutteridge wrote:
> On 22.02.2012 18:58, Hervé Pagès wrote:
>> Hi Alex,
>>
>> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:
>
> [...]
>
>>> But the predictCoding call gives this error:
>>>
>>> Error in .setSeqNames(x, value) :
>>> The replacement value for isActiveSeq must be a logical vector, with
>>> names that match the seqlevels of the object
>>
>> The error message doesn't help much but I think the pb is that you
>> didn't rename chMT properly. Try to do this:
>>
>>   seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>>
>> before you start the for(eg in entrez.ids){..} loop again.
>>
>> Cheers,
>> H.
>
> Thanks Hervé that nailed it. I'm having some difficulty joining up the 
> output of predictCoding() with the query SNPs though. If someone could 
> point out where the disconnect in my thinking is I would appreciate it!
>
> Here's my (now edited down) script:
>
> library(BSgenome.Hsapiens.UCSC.hg19)
> library(VariantAnnotation)
> library(SNPlocs.Hsapiens.dbSNP.20110815)
> library(TxDb.Hsapiens.UCSC.hg19.knownGene)
>
> entrez.ids = c('6335')
> txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene
>
> snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
> seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
> seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>
> gene.list = cdsBy(txdb19, by="gene")
> vsd.list = gene.list[entrez.ids]
> cds.list = cdsBy(txdb19,by="tx")
>
> eg = entrez.ids[1]
>
> snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
> eg.snps = snps[snp.idx]
> iupac   = values(eg.snps)[,"alleles_as_ambig"]
> eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
> variant.alleles = 
> DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])
>
> aa = 
> predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)
>
> #####
>
> Then if I query the predictCoding results in aa in an interactive 
> session I get the following (see inline comments for what I think 
> should be happening, but I must be misinterpreting what queryID means)
>
> The docs for predictCoding() contain a small typo 
> (s/queryHits/queryID), but otherwise seem clear?
>
> Columns include ‘queryID’, ‘consequence’, ‘refSeq’, ‘varSeq’,
>      ‘refAA’, ‘varAA’, ‘txID’, ‘geneID’, and ‘cdsID’.
>
>      ‘queryHits’ The ‘queryHits’ column provides a map back to the
>           variants in the original ‘query’. If the ‘query’ was a ‘VCF’
>           object this index corresponds to the row in the ‘GRanges’ in
>           the ‘rowData’ slot. If ‘query’ was an expanded ‘GRanges’,
>           ‘RangedData’ or ‘RangesList’ the index corresponds to the row
>           in the expanded object.
>
> #####
>
>> aa[1,]
> DataFrame with 1 row and 9 columns
>     queryID   consequence         refSeq         varSeq         refAA
> <integer> <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
> 1         1 nonsynonymous            CTC            ATC             L
>           varAA        txID   geneID     cdsID
> <AAStringSet> <character> <factor> <integer>
> 1             I       10921     6335     33668
>> #So the first SNP (queryID: 1) is nonsynonymous and maps to tx 
>> '10921' and cds '33668'.
>> #If I look at the first query SNP I get this:
>> eg.snps.exp[aa[1,'queryID'],]
> GRanges with 1 range and 2 elementMetadata values:
>       seqnames                 ranges strand |   RefSNP_id 
> alleles_as_ambig
> <Rle> <IRanges> <Rle> | <character> <character>
>   [1]     chr2 [167055370, 167055370]      * |   
> 111558968               R
>   ---
>   seqlengths:
>     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  chrX  
> chrY  chrM
>       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    
> NA    NA
>> #So SNP 1 is at 167055370 on chr2
>> #But if I check tx '10921' I see that the cds overlapping 167055370 
>> is actually '33651'
>> #And cds '33668' is at the other end of the tx:
>> cds.list[[aa[1,'txID']]]
> GRanges with 26 ranges and 3 elementMetadata values:
>        seqnames                 ranges strand   |    cds_id    cds_name
> <Rle> <IRanges> <Rle>   | <integer> <character>
>    [1]     chr2 [167168009, 167168266]      -   |     33668 <NA>
>    [2]     chr2 [167163466, 167163584]      -   |     33667 <NA>
>    [3]     chr2 [167163020, 167163109]      -   |     33666 <NA>
>    [4]     chr2 [167162302, 167162430]      -   |     33647 <NA>
>    [5]     chr2 [167160748, 167160839]      -   |     33646 <NA>
>    [6]     chr2 [167159600, 167159812]      -   |     33645 <NA>
>    [7]     chr2 [167151109, 167151172]      -   |     33644 <NA>
>    [8]     chr2 [167149741, 167149882]      -   |     33643 <NA>
>    [9]     chr2 [167144947, 167145153]      -   |     33642 <NA>
>    ...      ...                    ...    ... ...       ...         ...
>   [18]     chr2 [167099012, 167099166]      -   |     33659 <NA>
>   [19]     chr2 [167094604, 167094777]      -   |     33658 <NA>
>   [20]     chr2 [167089850, 167089972]      -   |     33657 <NA>
>   [21]     chr2 [167085201, 167085482]      -   |     33656 <NA>
>   [22]     chr2 [167084180, 167084233]      -   |     33655 <NA>
>   [23]     chr2 [167083077, 167083214]      -   |     33654 <NA>
>   [24]     chr2 [167060870, 167060974]      -   |     33653 <NA>
>   [25]     chr2 [167060465, 167060735]      -   |     33652 <NA>
>   [26]     chr2 [167055182, 167056374]      -   |     33651 <NA>
>        exon_rank
> <integer>
>    [1]         2
>    [2]         3
>    [3]         4
>    [4]         5
>    [5]         6
>    [6]         7
>    [7]         8
>    [8]         9
>    [9]        10
>    ...       ...
>   [18]        19
>   [19]        20
>   [20]        21
>   [21]        22
>   [22]        23
>   [23]        24
>   [24]        25
>   [25]        26
>   [26]        27
>   ---
>   seqlengths:
>                     chr1                  chr2 ... chr18_gl000207_random
>                249250621             243199373 ...                  4262
>
>