[BioC] From location to get the SNP rs ID

Mon Jun 3 22:58:40 CEST 2013

Hi Fabrice,

On 06/03/2013 11:55 AM, Fabrice Tourre wrote:
> Hervé Pagès,
>
> When I get rsid from position, I think some of them cannot mapped to a
> rsid. How can we let they also return a value, otherwise we will have
> a problem to map the position to rsid.

Any genomic position can be mapped to 0, 1, or more than 1 rs ids (the
"more than 1 rs ids" case can happen because dbSNP sometimes assign
distinct ids to SNPs located at the same position, don't ask me why).
This is summarized by saying that the mapping is a 1-to-many
relationship.

>
> For example:
>
> posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T)
> chrs <- paste("ch",posi[,1],sep="")
> mylocs <- GRanges(chrs,
>          IRanges(as.integer(posi[,2]), width=1))
> rsids <- loc2rsid(mylocs)
>
> dat$SNP <- unlist(rsids)
> Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570",
> "rs145939433",  :
>    replacement has 92428 rows, data has 92489
>> length(chrs)
> [1] 92489
>> length(unlist(rsids))
> [1] 92428

THE PROBLEM: Because the relationship is 1-to-many, loc2rsid() cannot
in general return a character vector of the same length as the input
vector (i.e. the GRanges object holding the genomic positions). This
is why the function returns a *CharacterList* object. This CharacterList
is guaranteed to have the length of the input vector. However the list
elements of that CharacterList can be of length 0, 1, or more than 1.
Because of that, you should not expect that the result of unlisting this
CharacterList will give you a character vector of the length of the
input vector.

THE SOLUTION: The general advice is to not use unlist() if you want to
preserve length and positioning (i.e. if you want the i-th element in
the result to correspond to the i-th in the input):

   > x <- CharacterList("a", "bb", character(), "ccc")

   > x
   CharacterList of length 4
   [[1]] a
   [[2]] bb
   [[3]] character(0)
   [[4]] ccc

   > unlist(x)
   [1] "a"   "bb"  "ccc"

Use as.character() instead:

   > as.character(x)
   [1] "a"   "bb"  NA    "ccc"

as.character() is guaranteed to preserve length and positioning... or
to fail. It will fail if a list element has a length >= 2. Because that
means a choice needs to be made and as.character() won't make that
choice for you:

   > x2 <- CharacterList(letters[1:3], "bb", character(), "ccc")

   > x2
   CharacterList of length 4
   [[1]] a b c
   [[2]] bb
   [[3]] character(0)
   [[4]] ccc

   > as.character(x2)
   Error in as.vector(x, mode = "character") :
     coercing an AtomicList object to an atomic vector is supported only for
     objects with top-level elements of length <= 1

Possible choices are (the goal being to end up with a CharacterList
object where all the list elements have a length <= 1):

   (1) Keep 1 arbitrary element of any list element of length >= 2.
       For example, to keep the 1st element:

         idx <- elementLengths(x2) >= 2
         x2_subset <- x2[idx]
         x2_subset <- sapply(x2_subset, `[`, 1L)
         x2[idx] <- x2_subset

       Equivalently, you could choose to keep the last element:

         idx <- elementLengths(x2) >= 2
         x2_subset <- x2[idx]
         x2_subset <- sapply(x2_subset, function(x) x[length(x)])
         x2[idx] <- x2_subset

       Note that the above sapply() can be replaced by much faster

         x2_subset <- unlist(x2_subset)[cumsum(elementLengths(x2_subset))]

   (2) Keep nothing:

         idx <- elementLengths(x2) >= 2
         x2[idx] <- NA

   (3) Keep an element randomly...

Then you should be able to do 'as.character(x2)'.

Genomic positions mapped to more than 1 rs id are rare though so it
could be that most of the times you won't run into that problem.
However if the code you're writing is going to be re-used by other
people, they might supply those problematic genomic positions in
the input so it's good if your code can handle them gracefully...

Cheers,
H.

>
> On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>> On 05/31/2013 03:31 AM, Fabrice Tourre wrote:
>>>
>>> Hervé Pagès,
>>>
>>> It helps. Thank you very much.
>>>
>>> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042
>>> and SNPlocs.Hsapiens.dbSNP.20120608.
>>
>>
>> Nope. That's why I sent you the code ;-)
>>
>>
>> H.
>>
>>>
>>> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>
>>>> Hi Fabrice,
>>>>
>>>> With the loc2rsid() function (see code at the bottom) and using
>>>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have
>>>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment):
>>>>
>>>>
>>>>     mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1,
>>>> 1)),
>>>>                       IRanges(c(200, 16369861, 146, 73, 9411609,
>>>> 16051107),
>>>> width=1))
>>>>
>>>>     > mylocs
>>>>     GRanges with 6 ranges and 0 metadata columns:
>>>>           seqnames               ranges strand
>>>>              <Rle>            <IRanges>  <Rle>
>>>>       [1]     ch22 [     200,      200]      *
>>>>       [2]     ch22 [16369861, 16369861]      *
>>>>       [3]      chM [     146,      146]      *
>>>>       [4]      chM [      73,       73]      *
>>>>       [5]     ch21 [ 9411609,  9411609]      *
>>>>       [6]     ch22 [16051107, 16051107]      *
>>>>       ---
>>>>       seqlengths:
>>>>        ch21 ch22  chM
>>>>          NA   NA   NA
>>>>
>>>> Then:
>>>>
>>>>     library(SNPlocs.Hsapiens.dbSNP.20100427)
>>>>
>>>>     > loc2rsid(mylocs)
>>>>     CharacterList of length 6
>>>>     [[1]] character(0)
>>>>     [[2]] rs75258394 rs78180314
>>>>     [[3]] character(0)
>>>>     [[4]] character(0)
>>>>     [[5]] rs76676778
>>>>     [[6]] rs6518357
>>>>
>>>> You need to be careful to use the same chromosome naming convention as
>>>> the SNPlocs package (which uses dbSNP naming convention). For example
>>>> in the 'mylocs' object, the mitochondrial chromosome is named chM but
>>>> in the SNPlocs package it's chMT:
>>>>
>>>>     > getSNPcount()
>>>>         ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8 ch9
>>>> ch10
>>>>     1369185 1422439 1186807 1191984 1064540 1040203  977275  919207 740516
>>>> 859433
>>>>        ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18 ch19
>>>> ch20
>>>>      855225  822825  615382  543041  499101  556394  464686  482359 375259
>>>> 450685
>>>>        ch21    ch22     chX     chY    chMT
>>>>      253480  244482  445596   53908     667
>>>>
>>>> This is why the locations on chM are not mapped to anything. But
>>>> after renaming:
>>>>
>>>>     > seqlevels(mylocs)[3] <- "chMT"
>>>>     > seqlevels(mylocs)
>>>>     [1] "ch21" "ch22" "chMT"
>>>>
>>>>     > loc2rsid(mylocs)
>>>>     CharacterList of length 6
>>>>     [[1]] character(0)
>>>>     [[2]] rs75258394 rs78180314
>>>>     [[3]] rs72619361
>>>>     [[4]] rs3087742
>>>>     [[5]] rs76676778
>>>>     [[6]] rs6518357
>>>>
>>>> they get mapped.
>>>>
>>>> The CharacterList object returned by loc2rsid() can be set as a
>>>> metadata column of the input GRanges object:
>>>>
>>>>     > mcols(mylocs)$rsid <- loc2rsid(mylocs)
>>>>     > mylocs
>>>>     GRanges with 6 ranges and 1 metadata column:
>>>>           seqnames               ranges strand |                  rsid
>>>>              <Rle>            <IRanges>  <Rle> |       <CharacterList>
>>>>       [1]     ch22 [     200,      200]      * |
>>>>       [2]     ch22 [16369861, 16369861]      * | rs75258394,rs78180314
>>>>       [3]     chMT [     146,      146]      * |            rs72619361
>>>>       [4]     chMT [      73,       73]      * |             rs3087742
>>>>       [5]     ch21 [ 9411609,  9411609]      * |            rs76676778
>>>>       [6]     ch22 [16051107, 16051107]      * |             rs6518357
>>>>       ---
>>>>       seqlengths:
>>>>        ch21 ch22 chMT
>>>>          NA   NA   NA
>>>>
>>>> Hope this helps,
>>>> H.
>>>>
>>>>
>>>> ## Maps the genomic locations in 'locs' to the corresponding rs ids.
>>>> ## 'locs' must be a GRanges object where all ranges are of with 1.
>>>> ## Because dbSNP can assign distinct ids to SNPs located at the same
>>>> ## position, the mapping is a 1-to-many relationship.
>>>> ## Returns a CharacterList of the same length as 'locs'.
>>>> loc2rsid <- function(locs)
>>>> {
>>>>       if (!is(locs, "GRanges"))
>>>>           stop("'locs' must be a GRanges object")
>>>>       if (!all(width(locs) == 1L))
>>>>           stop("all ranges in 'locs' must be of width 1")
>>>>       common_seqlevels <- intersect(seqlevels(locs), names(getSNPcount()))
>>>>       if (length(common_seqlevels) == 0L)
>>>>           stop("chromosome names (a.k.a. seqlevels) in 'locs' don't seem
>>>> to ",
>>>>                "be\n  compatible with the chromosome names in the SNPlocs
>>>> ",
>>>>                "package. Maybe they\n  use a different naming convention?
>>>> ",
>>>>                "If that's the case then you first need\n  to rename the ",
>>>>                "seqlevels in 'locs'. See '?seqlevels' for how to do
>>>> this.")
>>>>       f <- as.factor(seqnames(locs))
>>>>       locs_by_chrom <- split(start(locs), f)
>>>>       rsids_by_chrom <- lapply(seq_along(locs_by_chrom),
>>>>              function(i)
>>>>              {
>>>>                  seqname <- levels(f)[i]
>>>>                  locs2 <- locs_by_chrom[[i]]
>>>>                  ans2 <- vector("list", length=length(locs2))
>>>>                  if (length(locs2) == 0L || !(seqname %in%
>>>> common_seqlevels))
>>>>                      return(ans2)
>>>>                  snplocs <- getSNPlocs(seqname, as.GRanges=FALSE)
>>>>                  hits <- findMatches(locs2, snplocs$loc)
>>>>                  rsids <- paste0("rs",
>>>> snplocs$RefSNP_id[subjectHits(hits)])
>>>>                  q_hits <- queryHits(hits)
>>>>                  tmp <- split(rsids, q_hits)
>>>>                  ans2[as.integer(names(tmp))] <- tmp
>>>>                  ans2
>>>>              })
>>>>       CharacterList(unsplit(rsids_by_chrom, f))
>>>> }
>>>>
>>>> PS: This functionality will make it to the SNPlocs packages (in this
>>>> form or another).
>>>>
>>>>
>>>>
>>>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote:
>>>>>
>>>>>
>>>>> Dear list,
>>>>>
>>>>> I have a list of genome positions. Is there is a package to get the
>>>>> snp name at this position?
>>>>>
>>>>> It is the reverse function as rsid2loc in
>>>>> SNPlocs.Hsapiens.dbSNP.20120608
>>>>>
>>>>> Thank you.
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone:  (206) 667-5791
>>>> Fax:    (206) 667-1319
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319