[BioC] RE : maping SNPs

Simon Noël simon.noel.2 at ulaval.ca
Thu Feb 9 21:28:39 CET 2012


Hello Mr. Pagès,

At the begining of my master, you really helped me to map my SNPs to their gene with the code bellow.  As you remember, with your help we changed it a little bit and we have got with R2.10 :

library("IRanges")
library("GenomicRanges")
library("GenomicFeatures")
#À changer si une version plus récente de la librairie est téléchargée.
library(SNPlocs.Hsapiens.dbSNP.20101109)
library("org.Hs.eg.db")

#Allocation de la mémoire sous windows
memory.limit(size = 4095)
#vérification de la librairie SNPlocs.Hsapiens.dbSNP
getSNPcount()
ch22snps <- getSNPlocs("ch22")
ch22snps[1:5, ]

#Create a GRange objetc to use with GenomicRanges library
makeGRangesFromRefSNPids <- function(myids, verbose=FALSE)
{
     ans_seqnames <- character(length(myids))
     ans_seqnames[] <- "unknown"
     ans_locs <- integer(length(myids))
     for (seqname in names(getSNPcount())) {
         if (verbose)
             cat("Processing ", seqname, " SNPs ...\n", sep="")
         locs <- getSNPlocs(seqname)
         ids <- paste("rs", locs$RefSNP_id, sep="")
         myrows <- match(myids, ids)
         hit_idx <- !is.na(myrows)
         ans_seqnames[hit_idx] <- seqname
         ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]]
     }
     GRanges(seqnames=ans_seqnames,
             IRanges(start=ans_locs, width=1),
             RefSNP_id=myids)
}

#Test en utilisant les premières sondes du premier et second chormosome
#myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436")
#ouverture du fichier pour aller chercher nos numéros rs
rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE)
myids <- rs_SNPs[,1]
mysnps <- makeGRangesFromRefSNPids(myids)
mysnps  # a GRanges object with 1 SNP per row
#create a TranscriptDb
txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
txdb
#extract the transcript locations together with their genes
tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
tx  # a GRanges object with 1 transcript per row
#rename chromosome to fit USCS standard
seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
#vérifier pour X/Y  -> seqnames(mysnps) <- sub("chrX", "chrY", seqnames(mysnps))
#mapping but not on a readable format
map <- as.matrix(findOverlaps(mysnps, tx))

#making the mapping readable
mapped_genes <- values(tx)$gene_id[map[, 2]]
mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes))
snp2gene <- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes)))
rownames(snp2gene) <- NULL
snp2gene

#snp2gene se travaille mal alors on le transfère en matrice
snp2geneTmp = t(t(snp2gene))
#aller chercher les symboles correspondants à nos gene id
symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA))

save.image(file = "map.RData")





And everything was working perfectly.

Now, I have done a lot of script to analyse my data on a lot of way and I think it's time to update my old mapping.  I have try the same script on R 2.14 but changed library(SNPlocs.Hsapiens.dbSNP.20101109) for library(SNPlocs.Hsapiens.dbSNP.20110815) and now I get some error...  Can you help me?  The problems seems to start with "map <- as.matrix(findOverlaps(mysnps, tx))" and the other error seems to result from that problem.



sessionInfo()
R version 2.14.1 (2011-12-22)
Platform: x86_64-pc-linux-gnu (64-bit)
locale:
 [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
 [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
 [5] LC_MONETARY=en_CA.UTF-8    LC_MESSAGES=en_CA.UTF-8
 [7] LC_PAPER=C                 LC_NAME=C
 [9] LC_ADDRESS=C               LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base
other attached packages:
[1] org.Hs.eg.db_2.6.4
[2] RSQLite_0.11.1
[3] DBI_0.2-5
[4] SNPlocs.Hsapiens.dbSNP.20110815_0.99.6
[5] GenomicFeatures_1.6.7
[6] AnnotationDbi_1.16.11
[7] Biobase_2.14.0
[8] GenomicRanges_1.6.7
[9] IRanges_1.12.6
loaded via a namespace (and not attached):
[1] biomaRt_2.10.0     Biostrings_2.22.0  BSgenome_1.22.0    RCurl_1.9-5
[5] rtracklayer_1.14.4 tools_2.14.1       XML_3.9-4          zlibbioc_1.0.0

> library("IRanges")
Attaching package: âIRangesâ
The following object(s) are masked from âpackage:baseâ:
    cbind, eval, intersect, Map, mapply, order, paste, pmax, pmax.int,
    pmin, pmin.int, rbind, rep.int, setdiff, table, union
> library("GenomicRanges")
> library("GenomicFeatures")
Loading required package: AnnotationDbi
Loading required package: Biobase
Welcome to Bioconductor
  Vignettes contain introductory material. To view, type
  'browseVignettes()'. To cite Bioconductor, see
  'citation("Biobase")' and for packages 'citation("pkgname")'.

Attaching package: âBiobaseâ
The following object(s) are masked from âpackage:IRangesâ:
    updateObject
> #À changer si une version plus récente de la librairie est téléchargée.
> library(SNPlocs.Hsapiens.dbSNP.20110815)
> library("org.Hs.eg.db")
Loading required package: DBI
>
>
> #Allocation de la mémoire sous windows
> memory.limit(size = 4095)
[1] Inf
Warning message:
'memory.limit()' is Windows-specific
>
> #vérification de la librairie SNPlocs.Hsapiens.dbSNP
> getSNPcount()
    ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9    ch10
2509872 2612484 2240663 2143896 1964926 1975896 1818616 1699977 1403368 1544307
   ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19    ch20
1542256 1521919 1104719 1031214  949642 1084538  917737  886293  732039  788556
   ch21    ch22     chX     chY    chMT
 468379  454939  920890   75108     942
> ch22snps <- getSNPlocs("ch22")
> ch22snps[1:5, ]
  RefSNP_id alleles_as_ambig      loc
1  56342815                K 16050353
2 149201999                Y 16050408
3 146752890                S 16050612
4 139377059                Y 16050678
5 143300205                R 16050822
>
> #########################À FAIRE CANOPUS###################
>
> #Create a GRange objetc to use with GenomicRanges library
> makeGRangesFromRefSNPids <- function(myids, verbose=FALSE)
+ {
+      ans_seqnames <- character(length(myids))
+      ans_seqnames[] <- "unknown"
+      ans_locs <- integer(length(myids))
+      for (seqname in names(getSNPcount())) {
+          if (verbose)
+              cat("Processing ", seqname, " SNPs ...\n", sep="")
+          locs <- getSNPlocs(seqname)
+          ids <- paste("rs", locs$RefSNP_id, sep="")
+          myrows <- match(myids, ids)
+          hit_idx <- !is.na(myrows)
+          ans_seqnames[hit_idx] <- seqname
+          ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]]
+      }
+      GRanges(seqnames=ans_seqnames,
+              IRanges(start=ans_locs, width=1),
+              RefSNP_id=myids)
+ }
>
>
> #Test en utilisant les premières sondes du premier et second chormosome
> #myids <- c("rs4637157", "rs11900053", "rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs9681213", "rs1516321", "rs1400176", "rs990284", "rs954824", "rs10915459", "rs16838750", "rs12128230", "rs12557436")
>
> #ouverture du fichier pour aller chercher nos numéros rs
> rs_SNPs <- read.csv("info_snps.txt", sep = "\t", header=TRUE)
> myids <- rs_SNPs[,1]
>
> mysnps <- makeGRangesFromRefSNPids(myids)
> mysnps  # a GRanges object with 1 SNP per row
GRanges with 348411 ranges and 1 elementMetadata value:
           seqnames                 ranges strand   |  RefSNP_id
              <Rle>              <IRanges>  <Rle>   |   <factor>
       [1]      ch1     [2195117, 2195117]      *   |  rs7547453
       [2]      ch1     [2291680, 2291680]      *   |  rs2840542
       [3]      ch1     [3256108, 3256108]      *   |  rs1999527
       [4]      ch1     [3577321, 3577321]      *   |  rs4648545
       [5]      ch1     [4230463, 4230463]      *   | rs10915459
       [6]      ch1     [4404344, 4404344]      *   | rs16838750
       [7]      ch1     [4501911, 4501911]      *   | rs12128230
       [8]      ch1     [4535148, 4535148]      *   |  rs7541288
       [9]      ch1     [4581230, 4581230]      *   | rs12039682
       ...      ...                    ...    ... ...        ...
  [348403]      chX [154514047, 154514047]      *   |   rs499428
  [348404]      chX [154514919, 154514919]      *   |   rs507127
  [348405]      chX [154737376, 154737376]      *   |  rs5940372
  [348406]      chX [154780283, 154780283]      *   |  rs6642287
  [348407]      chX [154830377, 154830377]      *   |  rs5983658
  [348408]      chX [154870197, 154870197]      *   |   rs473772
  [348409]      chX [154892230, 154892230]      *   |   rs553678
  [348410]      chX [154899846, 154899846]      *   |   rs473491
  [348411]      chX [154929412, 154929412]      *   |   rs557132
  ---
  seqlengths:
       ch1    ch10    ch11    ch12    ch13 ...     ch8     ch9     chX unknown
        NA      NA      NA      NA      NA ...      NA      NA      NA      NA
>
> #create a TranscriptDb
> txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
Download the refGene table ... OK
Download the refLink table ... OK
Extract the 'transcripts' data frame ... OK
Extract the 'splicings' data frame ... OK
Download and preprocess the 'chrominfo' data frame ... OK
Prepare the 'metadata' data frame ... OK
Make the TranscriptDb object ... OK
There were 50 or more warnings (use warnings() to see the first 50)
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| Genus and Species: Homo sapiens
| UCSC Table: refGene
| Resource URL: http://genome.ucsc.edu/
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 41677
| exon_nrow: 235596
| cds_nrow: 206518
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2012-02-09 15:20:45 -0500 (Thu, 09 Feb 2012)
| GenomicFeatures version at creation time: 1.6.7
| RSQLite version at creation time: 0.11.1
| DBSCHEMAVERSION: 1.0
| package: GenomicFeatures
>
> #extract the transcript locations together with their genes
> tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx  # a GRanges object with 1 transcript per row
GRanges with 41677 ranges and 3 elementMetadata values:
          seqnames               ranges strand   |     tx_id      tx_name
             <Rle>            <IRanges>  <Rle>   | <integer>  <character>
      [1]     chr1     [ 11874,  14408]      +   |      1127    NR_046018
      [2]     chr1     [ 69091,  70008]      +   |      1128 NM_001005484
      [3]     chr1     [323892, 328581]      +   |      1130    NR_028327
      [4]     chr1     [323892, 328581]      +   |      1132    NR_028325
      [5]     chr1     [323892, 328581]      +   |      1133    NR_028322
      [6]     chr1     [367659, 368597]      +   |      1131 NM_001005221
      [7]     chr1     [367659, 368597]      +   |      1134 NM_001005224
      [8]     chr1     [367659, 368597]      +   |      1135 NM_001005277
      [9]     chr1     [763064, 789740]      +   |       198    NR_015368
      ...      ...                  ...    ... ...       ...          ...
  [41669]     chrY [27177050, 27198251]      -   |     20790    NM_004678
  [41670]     chrY [27177050, 27198251]      -   |     20793 NM_001002761
  [41671]     chrY [27177050, 27198251]      -   |     20794 NM_001002760
  [41672]     chrY [27209230, 27246039]      -   |     20791    NR_002177
  [41673]     chrY [27209230, 27246039]      -   |     20792    NR_002178
  [41674]     chrY [27209230, 27246039]      -   |     20795    NR_001525
  [41675]     chrY [27329790, 27330920]      -   |     20796    NR_002179
  [41676]     chrY [27329790, 27330920]      -   |     20797    NR_002180
  [41677]     chrY [27329790, 27330920]      -   |     20798    NR_001526
                            gene_id
          <CompressedCharacterList>
      [1]                 100287102
      [2]                     79501
      [3]                 100133331
      [4]                 100132062
      [5]                 100132287
      [6]                    729759
      [7]                     26683
      [8]                     81399
      [9]                    643837
      ...                       ...
  [41669]                      9083
  [41670]                    442868
  [41671]                    442867
  [41672]                    474150
  [41673]                    474149
  [41674]                    114761
  [41675]                    474152
  [41676]                    474151
  [41677]                    252949
  ---
  seqlengths:
                    chr1                  chr2 ... chr18_gl000207_random
               249250621             243199373 ...                  4262
>
> #rename chromosome to fit USCS standard
> seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Error in `seqnames<-`(`*tmp*`, value = <S4 object of class "Rle">) :
  levels of supplied 'seqnames' must be identical to 'seqlevels(x)'
> #vérifier pour X/Y  -> seqnames(mysnps) <- sub("chrX", "chrY", seqnames(mysnps))
>
> #mapping but not on a readable format
> map <- as.matrix(findOverlaps(mysnps, tx))
Warning message:
In .Seqinfo.mergexy(x, y) :
  Each of the 2 combined objects has sequence levels not in the other:
  - in 'x': ch1, ch10, ch11, ch12, ch13, ch14, ch15, ch16, ch17, ch18, ch19, ch2, ch20, ch21, ch22, ch3, ch4, ch5, ch6, ch7, ch8, ch9, chX, unknown
  - in 'y': chr1, chr2, chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9, chr10, chr11, chr12, chr13, chr14, chr15, chr16, chr17, chr18, chr20, chrY, chr19, chr22, chr21, chr6_ssto_hap7, chr6_mcf_hap5, chr6_cox_hap2, chr6_mann_hap4, chr6_apd_hap1, chr6_qbl_hap6, chr6_dbb_hap3, chr17_ctg5_hap1, chr4_ctg9_hap1, chr1_gl000192_random, chrUn_gl000225, chr4_gl000194_random, chr4_gl000193_random, chr9_gl000200_random, chrUn_gl000222, chrUn_gl000212, chr7_gl000195_random, chrUn_gl000223, chrUn_gl000224, chrUn_gl000219, chr17_gl000205_random, chrUn_gl000215, chrUn_gl000216, chrUn_gl000217, chr9_gl000199_random, chrUn_gl000211, chrUn_gl000213, chrUn_gl000220, chrUn_gl000218, chr19_gl000209_random, chrUn_gl000221, chrUn_gl000214, chrUn_gl000228, chrUn_gl000227, chr1_gl000191_random, ch [... truncated]
>
>
> #making the mapping readable
> mapped_genes <- values(tx)$gene_id[map[, 2]]
> mapped_snps <- rep.int(values(mysnps)$RefSNP_id[map[, 1]], elementLengths(mapped_genes))
> snp2gene <- unique(data.frame(SNPNAME=mapped_snps, gene_id=unlist(mapped_genes)))
> rownames(snp2gene) <- NULL
> snp2gene
[1] SNPNAME gene_id
<0 rows> (or 0-length row.names)
>
>
> #snp2gene se travaille mal alors on le transfère en matrice
> snp2geneTmp = t(t(snp2gene))
>
> #aller chercher les symboles correspondants à nos gene id
> symbol <- unlist(mget(snp2geneTmp[,2], org.Hs.egSYMBOL, ifnotfound = NA))
Error in unlist(mget(snp2geneTmp[, 2], org.Hs.egSYMBOL, ifnotfound = NA)) :
  error in evaluating the argument 'x' in selecting a method for function 'unlist': Error in .checkKeysAreWellFormed(keys) :
  keys must be supplied in a character vector with no NAs
>
>
> save.image(file = "map.RData")
>





Simon Noël
CdeC
________________________________________
De : Hervé Pagès [hpages at fhcrc.org]
Date d'envoi : 5 décembre 2010 23:43
À : Simon Noël
Cc : bioconductor at r-project.org
Objet : Re: [BioC] maping SNPs

Hi Simon,

On 12/03/2010 10:17 AM, Simon Noël wrote:
>
>     Hi,
>
>
>
>     I have a really big list of SNPs names like :
>
>
>
>     SNPNAME
>
>     rs7547453
>
>     rs2840542
>
>     rs1999527
>
>     rs4648545
>
>     rs10915459
>
>     rs16838750
>
>     rs12128230
>
>     ...
>
>
>
>     I woudlike to map them to their official gene symbol.  What the best way to
>     procede?

Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
from SNPs to genes and I don't think we have this kind of mappings
either in our collection of annotations (*.db packages).

But if your SNPs are Human then you can do the mapping yourself by
using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
packages.

The latest SNPlocs.Hsapies.dbSNP.* package is
SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
SNP locations relative to the GRCh37 genome:

 > library(SNPlocs.Hsapiens.dbSNP.20101109)
 > getSNPcount()
     ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9
    ch10
1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129  995075
1158707
    ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19
    ch20
1147722 1105364  815729  740129  657719  757926  641905  645646  520666
  586708
    ch21    ch22     chX     chY    chMT
  338254  331060  529608   67438     624

Note that it doesn't contain *all* SNPs from dbSNP Build 132:
only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
for the details).

 > ch22snps <- getSNPlocs("ch22")
 > ch22snps[1:5, ]
   RefSNP_id alleles_as_ambig      loc
1  56342815                K 16050353
2   7288968                S 16050994
3   6518357                M 16051107
4   7292503                R 16051209
5   6518368                Y 16051241

Note that the rs prefix has been dropped.

So here is how to proceed:

First you can use the following function to make a GRanges object from
your SNP ids:

makeGRangesFromRefSNPids <- function(myids)
{
     ans_seqnames <- character(length(myids))
     ans_seqnames[] <- "unknown"
     ans_locs <- integer(length(myids))
     for (seqname in names(getSNPcount())) {
         locs <- getSNPlocs(seqname)
         ids <- paste("rs", locs$RefSNP_id, sep="")
         myrows <- match(myids, ids)
         ans_seqnames[!is.na(myrows)] <- seqname
         ans_locs[!is.na(myrows)] <- locs$loc[myrows]
     }
     GRanges(seqnames=ans_seqnames,
             IRanges(start=ans_locs, width=1),
             RefSNP_id=myids)
}

This takes between 3 and 5 minutes:

 > myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
              "rs10915459", "rs16838750", "rs12128230", "rs999999999")
 > mysnps <- makeGRangesFromRefSNPids(myids)
 > mysnps  # a GRanges object with 1 SNP per row
GRanges with 8 ranges and 1 elementMetadata value
     seqnames             ranges strand |       myids
        <Rle>          <IRanges>  <Rle> | <character>
[1]      ch1 [2195117, 2195117]      * |   rs7547453
[2]      ch1 [2291680, 2291680]      * |   rs2840542
[3]      ch1 [3256108, 3256108]      * |   rs1999527
[4]      ch1 [3577321, 3577321]      * |   rs4648545
[5]      ch1 [4230463, 4230463]      * |  rs10915459
[6]      ch1 [4404344, 4404344]      * |  rs16838750
[7]      ch1 [4501911, 4501911]      * |  rs12128230
[8]  unknown [      0,       0]      * | rs999999999

seqlengths
      ch1 unknown
       NA      NA

The next step is to create a TranscriptDb object with
makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
from the GenomicFeatures package. This TranscriptDb object will
contain the transcript locations and their associated
genes extracted from the annotation source you choose.
For example, if you want to use RefSeq genes:

## Takes about 3 minutes:
 > txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
 > txdb
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| UCSC Table: refGene
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 37924
| exon_nrow: 230024
| cds_nrow: 204571
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
| GenomicFeatures version at creation time: 1.2.2
| RSQLite version at creation time: 0.9-4
| DBSCHEMAVERSION: 1.0

Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb
object: this means that later you will be able to use the org.Hs.eg.db
package to map your gene ids to their symbol (the org.*.eg.db packages
are Entrez Gene ID centric).

To extract the transcript locations together with their genes:

 > tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
 > tx  # a GRanges object with 1 transcript per row
GRanges with 37924 ranges and 1 elementMetadata value
         seqnames               ranges strand   |                   gene_id
            <Rle>            <IRanges>  <Rle>   | <CompressedCharacterList>
     [1]     chr1     [ 69091,  70008]      +   |                     79501
     [2]     chr1     [323892, 328581]      +   |                 100133331
     [3]     chr1     [323892, 328581]      +   |                 100132287
     [4]     chr1     [323892, 328581]      +   |                 100132062
     [5]     chr1     [367659, 368597]      +   |                     81399
     [6]     chr1     [367659, 368597]      +   |                    729759
     [7]     chr1     [367659, 368597]      +   |                     26683
     [8]     chr1     [763064, 789740]      +   |                    643837
     [9]     chr1     [861121, 879961]      +   |                    148398
     ...      ...                  ...    ... ...                       ...
[37916]     chrY [27177050, 27198251]      -   |                      9083
[37917]     chrY [27177050, 27198251]      -   |                    442867
[37918]     chrY [27177050, 27198251]      -   |                    442868
[37919]     chrY [27209230, 27246039]      -   |                    114761
[37920]     chrY [27209230, 27246039]      -   |                    474150
[37921]     chrY [27209230, 27246039]      -   |                    474149
[37922]     chrY [27329790, 27330920]      -   |                    252949
[37923]     chrY [27329790, 27330920]      -   |                    474152
[37924]     chrY [27329790, 27330920]      -   |                    474151

seqlengths
                   chr1                  chr2 ... chr18_gl000207_random
              249250621             243199373 ...                  4262

Now you can use findOverlaps() on 'mysnps' and 'tx' to find
the transcripts hits by your snps. But before you can do this,
you need to rename the sequences in 'mysnps' because dbSNPs and
UCSC use different naming conventions for the chromosomes:

 > seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))

Then:

 > map <- as.matrix(findOverlaps(mysnps, tx))

'map' contains the mapping between your SNPs and their genes but not
in a readable form (this matrix contains indices) so we make the
'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
the associated gene ids:

 > mapped_genes <- values(tx)$gene_id[map[, 2]]
 > mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]],
elementLengths(mapped_genes))
 > snp2gene <- unique(data.frame(snp_id=mapped_snps,
gene_id=unlist(mapped_genes)))
 > rownames(snp2gene) <- NULL
 > snp2gene[1:4, ]
      snp_id gene_id
1 rs7547453    6497
2 rs2840542   79906
3 rs1999527   63976
4 rs4648545    7161

Note that there is no guarantee that the number of rows in this
data frame is the number of your original SNP ids because the
relation between SNP ids and gene ids is of course not one-to-one.

Also the method described above considers that a SNP hits a gene
if it's located between the start and the end of one of its
transcripts but it doesn't take in account the exon structure of
the transcripts. If you want to do this you need to use exonsBy()
(from GenomicFeatures) to extract the exons grouped by transcripts
(this will be stored in a GRangesList object) and use this object
instead of 'tx' in the call to findOverlaps().

Hope this helps,
H.


>
>
>
>     Simon Noël
>     CdeC
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor


--
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list