[BioC] From location to get the SNP rs ID

Hervé Pagès hpages at fhcrc.org
Fri May 31 09:42:55 CEST 2013


Hi Fabrice,

With the loc2rsid() function (see code at the bottom) and using
SNPlocs.Hsapiens.dbSNP.20100427 (I don't have
SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment):


   mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1, 1)),
                     IRanges(c(200, 16369861, 146, 73, 9411609, 
16051107), width=1))

   > mylocs
   GRanges with 6 ranges and 0 metadata columns:
         seqnames               ranges strand
            <Rle>            <IRanges>  <Rle>
     [1]     ch22 [     200,      200]      *
     [2]     ch22 [16369861, 16369861]      *
     [3]      chM [     146,      146]      *
     [4]      chM [      73,       73]      *
     [5]     ch21 [ 9411609,  9411609]      *
     [6]     ch22 [16051107, 16051107]      *
     ---
     seqlengths:
      ch21 ch22  chM
        NA   NA   NA

Then:

   library(SNPlocs.Hsapiens.dbSNP.20100427)

   > loc2rsid(mylocs)
   CharacterList of length 6
   [[1]] character(0)
   [[2]] rs75258394 rs78180314
   [[3]] character(0)
   [[4]] character(0)
   [[5]] rs76676778
   [[6]] rs6518357

You need to be careful to use the same chromosome naming convention as
the SNPlocs package (which uses dbSNP naming convention). For example
in the 'mylocs' object, the mitochondrial chromosome is named chM but
in the SNPlocs package it's chMT:

   > getSNPcount()
       ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8 
ch9    ch10
   1369185 1422439 1186807 1191984 1064540 1040203  977275  919207 
740516  859433
      ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18 
ch19    ch20
    855225  822825  615382  543041  499101  556394  464686  482359 
375259  450685
      ch21    ch22     chX     chY    chMT
    253480  244482  445596   53908     667

This is why the locations on chM are not mapped to anything. But
after renaming:

   > seqlevels(mylocs)[3] <- "chMT"
   > seqlevels(mylocs)
   [1] "ch21" "ch22" "chMT"

   > loc2rsid(mylocs)
   CharacterList of length 6
   [[1]] character(0)
   [[2]] rs75258394 rs78180314
   [[3]] rs72619361
   [[4]] rs3087742
   [[5]] rs76676778
   [[6]] rs6518357

they get mapped.

The CharacterList object returned by loc2rsid() can be set as a
metadata column of the input GRanges object:

   > mcols(mylocs)$rsid <- loc2rsid(mylocs)
   > mylocs
   GRanges with 6 ranges and 1 metadata column:
         seqnames               ranges strand |                  rsid
            <Rle>            <IRanges>  <Rle> |       <CharacterList>
     [1]     ch22 [     200,      200]      * |
     [2]     ch22 [16369861, 16369861]      * | rs75258394,rs78180314
     [3]     chMT [     146,      146]      * |            rs72619361
     [4]     chMT [      73,       73]      * |             rs3087742
     [5]     ch21 [ 9411609,  9411609]      * |            rs76676778
     [6]     ch22 [16051107, 16051107]      * |             rs6518357
     ---
     seqlengths:
      ch21 ch22 chMT
        NA   NA   NA

Hope this helps,
H.


## Maps the genomic locations in 'locs' to the corresponding rs ids.
## 'locs' must be a GRanges object where all ranges are of with 1.
## Because dbSNP can assign distinct ids to SNPs located at the same
## position, the mapping is a 1-to-many relationship.
## Returns a CharacterList of the same length as 'locs'.
loc2rsid <- function(locs)
{
     if (!is(locs, "GRanges"))
         stop("'locs' must be a GRanges object")
     if (!all(width(locs) == 1L))
         stop("all ranges in 'locs' must be of width 1")
     common_seqlevels <- intersect(seqlevels(locs), names(getSNPcount()))
     if (length(common_seqlevels) == 0L)
         stop("chromosome names (a.k.a. seqlevels) in 'locs' don't seem 
to ",
              "be\n  compatible with the chromosome names in the SNPlocs ",
              "package. Maybe they\n  use a different naming convention? ",
              "If that's the case then you first need\n  to rename the ",
              "seqlevels in 'locs'. See '?seqlevels' for how to do this.")
     f <- as.factor(seqnames(locs))
     locs_by_chrom <- split(start(locs), f)
     rsids_by_chrom <- lapply(seq_along(locs_by_chrom),
            function(i)
            {
                seqname <- levels(f)[i]
                locs2 <- locs_by_chrom[[i]]
                ans2 <- vector("list", length=length(locs2))
                if (length(locs2) == 0L || !(seqname %in% common_seqlevels))
                    return(ans2)
                snplocs <- getSNPlocs(seqname, as.GRanges=FALSE)
                hits <- findMatches(locs2, snplocs$loc)
                rsids <- paste0("rs", snplocs$RefSNP_id[subjectHits(hits)])
                q_hits <- queryHits(hits)
                tmp <- split(rsids, q_hits)
                ans2[as.integer(names(tmp))] <- tmp
                ans2
            })
     CharacterList(unsplit(rsids_by_chrom, f))
}

PS: This functionality will make it to the SNPlocs packages (in this
form or another).


On 05/30/2013 09:54 PM, Fabrice Tourre wrote:
> Dear list,
>
> I have a list of genome positions. Is there is a package to get the
> snp name at this position?
>
> It is the reverse function as rsid2loc in SNPlocs.Hsapiens.dbSNP.20120608
>
> Thank you.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319



More information about the Bioconductor mailing list