[BioC] Finding coding SNPs with predictCoding

Fri Feb 24 11:08:17 CET 2012

On 22.02.2012 18:58, Hervé Pagès wrote:
> Hi Alex,
>
> On 02/22/2012 03:56 AM, Alex Gutteridge wrote:

[...]

>> But the predictCoding call gives this error:
>>
>> Error in .setSeqNames(x, value) :
>> The replacement value for isActiveSeq must be a logical vector, with
>> names that match the seqlevels of the object
>
> The error message doesn't help much but I think the pb is that you
> didn't rename chMT properly. Try to do this:
>
>   seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))
>
> before you start the for(eg in entrez.ids){..} loop again.
>
> Cheers,
> H.

Thanks Hervé that nailed it. I'm having some difficulty joining up the 
output of predictCoding() with the query SNPs though. If someone could 
point out where the disconnect in my thinking is I would appreciate it!

Here's my (now edited down) script:

library(BSgenome.Hsapiens.UCSC.hg19)
library(VariantAnnotation)
library(SNPlocs.Hsapiens.dbSNP.20110815)
library(TxDb.Hsapiens.UCSC.hg19.knownGene)

entrez.ids = c('6335')
txdb19 = TxDb.Hsapiens.UCSC.hg19.knownGene

snps   = getSNPlocs(c("ch1","ch2"),as.GRanges=T)
seqlevels(snps) <- gsub("ch", "chr", seqlevels(snps))
seqlevels(snps) <- gsub("chrMT", "chrM", seqlevels(snps))

gene.list = cdsBy(txdb19, by="gene")
vsd.list = gene.list[entrez.ids]
cds.list = cdsBy(txdb19,by="tx")

eg = entrez.ids[1]

snp.idx = unique(queryHits(findOverlaps(snps, vsd.list[[eg]])))
eg.snps = snps[snp.idx]
iupac   = values(eg.snps)[,"alleles_as_ambig"]
eg.snps.exp = rep(eg.snps, nchar(IUPAC_CODE_MAP[iupac]))
variant.alleles = 
DNAStringSet(strsplit(paste(IUPAC_CODE_MAP[iupac],collapse=""),"")[[1]])

aa = 
predictCoding(eg.snps.exp,txdb19,seqSource=Hsapiens,varAllele=variant.alleles)

#####

Then if I query the predictCoding results in aa in an interactive 
session I get the following (see inline comments for what I think should 
be happening, but I must be misinterpreting what queryID means)

The docs for predictCoding() contain a small typo 
(s/queryHits/queryID), but otherwise seem clear?

Columns include ‘queryID’, ‘consequence’, ‘refSeq’, ‘varSeq’,
      ‘refAA’, ‘varAA’, ‘txID’, ‘geneID’, and ‘cdsID’.

      ‘queryHits’ The ‘queryHits’ column provides a map back to the
           variants in the original ‘query’. If the ‘query’ was a ‘VCF’
           object this index corresponds to the row in the ‘GRanges’ in
           the ‘rowData’ slot. If ‘query’ was an expanded ‘GRanges’,
           ‘RangedData’ or ‘RangesList’ the index corresponds to the row
           in the expanded object.

#####

> aa[1,]
DataFrame with 1 row and 9 columns
     queryID   consequence         refSeq         varSeq         refAA
   <integer>      <factor> <DNAStringSet> <DNAStringSet> <AAStringSet>
1         1 nonsynonymous            CTC            ATC             L
           varAA        txID   geneID     cdsID
   <AAStringSet> <character> <factor> <integer>
1             I       10921     6335     33668
>#So the first SNP (queryID: 1) is nonsynonymous and maps to tx '10921' 
> and cds '33668'.
>#If I look at the first query SNP I get this:
> eg.snps.exp[aa[1,'queryID'],]
GRanges with 1 range and 2 elementMetadata values:
       seqnames                 ranges strand |   RefSNP_id 
alleles_as_ambig
          <Rle>              <IRanges>  <Rle> | <character>      
<character>
   [1]     chr2 [167055370, 167055370]      * |   111558968              
  R
   ---
   seqlengths:
     chr1  chr2  chr3  chr4  chr5  chr6 ... chr20 chr21 chr22  chrX  
chrY  chrM
       NA    NA    NA    NA    NA    NA ...    NA    NA    NA    NA    
NA    NA
>#So SNP 1 is at 167055370 on chr2
>#But if I check tx '10921' I see that the cds overlapping 167055370 is 
> actually '33651'
>#And cds '33668' is at the other end of the tx:
> cds.list[[aa[1,'txID']]]
GRanges with 26 ranges and 3 elementMetadata values:
        seqnames                 ranges strand   |    cds_id    cds_name
           <Rle>              <IRanges>  <Rle>   | <integer> <character>
    [1]     chr2 [167168009, 167168266]      -   |     33668        <NA>
    [2]     chr2 [167163466, 167163584]      -   |     33667        <NA>
    [3]     chr2 [167163020, 167163109]      -   |     33666        <NA>
    [4]     chr2 [167162302, 167162430]      -   |     33647        <NA>
    [5]     chr2 [167160748, 167160839]      -   |     33646        <NA>
    [6]     chr2 [167159600, 167159812]      -   |     33645        <NA>
    [7]     chr2 [167151109, 167151172]      -   |     33644        <NA>
    [8]     chr2 [167149741, 167149882]      -   |     33643        <NA>
    [9]     chr2 [167144947, 167145153]      -   |     33642        <NA>
    ...      ...                    ...    ... ...       ...         ...
   [18]     chr2 [167099012, 167099166]      -   |     33659        <NA>
   [19]     chr2 [167094604, 167094777]      -   |     33658        <NA>
   [20]     chr2 [167089850, 167089972]      -   |     33657        <NA>
   [21]     chr2 [167085201, 167085482]      -   |     33656        <NA>
   [22]     chr2 [167084180, 167084233]      -   |     33655        <NA>
   [23]     chr2 [167083077, 167083214]      -   |     33654        <NA>
   [24]     chr2 [167060870, 167060974]      -   |     33653        <NA>
   [25]     chr2 [167060465, 167060735]      -   |     33652        <NA>
   [26]     chr2 [167055182, 167056374]      -   |     33651        <NA>
        exon_rank
        <integer>
    [1]         2
    [2]         3
    [3]         4
    [4]         5
    [5]         6
    [6]         7
    [7]         8
    [8]         9
    [9]        10
    ...       ...
   [18]        19
   [19]        20
   [20]        21
   [21]        22
   [22]        23
   [23]        24
   [24]        25
   [25]        26
   [26]        27
   ---
   seqlengths:
                     chr1                  chr2 ... 
chr18_gl000207_random
                249250621             243199373 ...                  
4262

-- 
Alex Gutteridge