[BioC] From location to get the SNP rs ID
Hervé Pagès
hpages at fhcrc.org
Mon Jun 3 22:58:40 CEST 2013
Hi Fabrice,
On 06/03/2013 11:55 AM, Fabrice Tourre wrote:
> Hervé Pagès,
>
> When I get rsid from position, I think some of them cannot mapped to a
> rsid. How can we let they also return a value, otherwise we will have
> a problem to map the position to rsid.
Any genomic position can be mapped to 0, 1, or more than 1 rs ids (the
"more than 1 rs ids" case can happen because dbSNP sometimes assign
distinct ids to SNPs located at the same position, don't ask me why).
This is summarized by saying that the mapping is a 1-to-many
relationship.
>
> For example:
>
> posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T)
> chrs <- paste("ch",posi[,1],sep="")
> mylocs <- GRanges(chrs,
> IRanges(as.integer(posi[,2]), width=1))
> rsids <- loc2rsid(mylocs)
>
> dat$SNP <- unlist(rsids)
> Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570",
> "rs145939433", :
> replacement has 92428 rows, data has 92489
>> length(chrs)
> [1] 92489
>> length(unlist(rsids))
> [1] 92428
THE PROBLEM: Because the relationship is 1-to-many, loc2rsid() cannot
in general return a character vector of the same length as the input
vector (i.e. the GRanges object holding the genomic positions). This
is why the function returns a *CharacterList* object. This CharacterList
is guaranteed to have the length of the input vector. However the list
elements of that CharacterList can be of length 0, 1, or more than 1.
Because of that, you should not expect that the result of unlisting this
CharacterList will give you a character vector of the length of the
input vector.
THE SOLUTION: The general advice is to not use unlist() if you want to
preserve length and positioning (i.e. if you want the i-th element in
the result to correspond to the i-th in the input):
> x <- CharacterList("a", "bb", character(), "ccc")
> x
CharacterList of length 4
[[1]] a
[[2]] bb
[[3]] character(0)
[[4]] ccc
> unlist(x)
[1] "a" "bb" "ccc"
Use as.character() instead:
> as.character(x)
[1] "a" "bb" NA "ccc"
as.character() is guaranteed to preserve length and positioning... or
to fail. It will fail if a list element has a length >= 2. Because that
means a choice needs to be made and as.character() won't make that
choice for you:
> x2 <- CharacterList(letters[1:3], "bb", character(), "ccc")
> x2
CharacterList of length 4
[[1]] a b c
[[2]] bb
[[3]] character(0)
[[4]] ccc
> as.character(x2)
Error in as.vector(x, mode = "character") :
coercing an AtomicList object to an atomic vector is supported only for
objects with top-level elements of length <= 1
Possible choices are (the goal being to end up with a CharacterList
object where all the list elements have a length <= 1):
(1) Keep 1 arbitrary element of any list element of length >= 2.
For example, to keep the 1st element:
idx <- elementLengths(x2) >= 2
x2_subset <- x2[idx]
x2_subset <- sapply(x2_subset, `[`, 1L)
x2[idx] <- x2_subset
Equivalently, you could choose to keep the last element:
idx <- elementLengths(x2) >= 2
x2_subset <- x2[idx]
x2_subset <- sapply(x2_subset, function(x) x[length(x)])
x2[idx] <- x2_subset
Note that the above sapply() can be replaced by much faster
x2_subset <- unlist(x2_subset)[cumsum(elementLengths(x2_subset))]
(2) Keep nothing:
idx <- elementLengths(x2) >= 2
x2[idx] <- NA
(3) Keep an element randomly...
Then you should be able to do 'as.character(x2)'.
Genomic positions mapped to more than 1 rs id are rare though so it
could be that most of the times you won't run into that problem.
However if the code you're writing is going to be re-used by other
people, they might supply those problematic genomic positions in
the input so it's good if your code can handle them gracefully...
Cheers,
H.
>
> On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>> On 05/31/2013 03:31 AM, Fabrice Tourre wrote:
>>>
>>> Hervé Pagès,
>>>
>>> It helps. Thank you very much.
>>>
>>> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042
>>> and SNPlocs.Hsapiens.dbSNP.20120608.
>>
>>
>> Nope. That's why I sent you the code ;-)
>>
>>
>> H.
>>
>>>
>>> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>
>>>> Hi Fabrice,
>>>>
>>>> With the loc2rsid() function (see code at the bottom) and using
>>>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have
>>>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment):
>>>>
>>>>
>>>> mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1,
>>>> 1)),
>>>> IRanges(c(200, 16369861, 146, 73, 9411609,
>>>> 16051107),
>>>> width=1))
>>>>
>>>> > mylocs
>>>> GRanges with 6 ranges and 0 metadata columns:
>>>> seqnames ranges strand
>>>> <Rle> <IRanges> <Rle>
>>>> [1] ch22 [ 200, 200] *
>>>> [2] ch22 [16369861, 16369861] *
>>>> [3] chM [ 146, 146] *
>>>> [4] chM [ 73, 73] *
>>>> [5] ch21 [ 9411609, 9411609] *
>>>> [6] ch22 [16051107, 16051107] *
>>>> ---
>>>> seqlengths:
>>>> ch21 ch22 chM
>>>> NA NA NA
>>>>
>>>> Then:
>>>>
>>>> library(SNPlocs.Hsapiens.dbSNP.20100427)
>>>>
>>>> > loc2rsid(mylocs)
>>>> CharacterList of length 6
>>>> [[1]] character(0)
>>>> [[2]] rs75258394 rs78180314
>>>> [[3]] character(0)
>>>> [[4]] character(0)
>>>> [[5]] rs76676778
>>>> [[6]] rs6518357
>>>>
>>>> You need to be careful to use the same chromosome naming convention as
>>>> the SNPlocs package (which uses dbSNP naming convention). For example
>>>> in the 'mylocs' object, the mitochondrial chromosome is named chM but
>>>> in the SNPlocs package it's chMT:
>>>>
>>>> > getSNPcount()
>>>> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9
>>>> ch10
>>>> 1369185 1422439 1186807 1191984 1064540 1040203 977275 919207 740516
>>>> 859433
>>>> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19
>>>> ch20
>>>> 855225 822825 615382 543041 499101 556394 464686 482359 375259
>>>> 450685
>>>> ch21 ch22 chX chY chMT
>>>> 253480 244482 445596 53908 667
>>>>
>>>> This is why the locations on chM are not mapped to anything. But
>>>> after renaming:
>>>>
>>>> > seqlevels(mylocs)[3] <- "chMT"
>>>> > seqlevels(mylocs)
>>>> [1] "ch21" "ch22" "chMT"
>>>>
>>>> > loc2rsid(mylocs)
>>>> CharacterList of length 6
>>>> [[1]] character(0)
>>>> [[2]] rs75258394 rs78180314
>>>> [[3]] rs72619361
>>>> [[4]] rs3087742
>>>> [[5]] rs76676778
>>>> [[6]] rs6518357
>>>>
>>>> they get mapped.
>>>>
>>>> The CharacterList object returned by loc2rsid() can be set as a
>>>> metadata column of the input GRanges object:
>>>>
>>>> > mcols(mylocs)$rsid <- loc2rsid(mylocs)
>>>> > mylocs
>>>> GRanges with 6 ranges and 1 metadata column:
>>>> seqnames ranges strand | rsid
>>>> <Rle> <IRanges> <Rle> | <CharacterList>
>>>> [1] ch22 [ 200, 200] * |
>>>> [2] ch22 [16369861, 16369861] * | rs75258394,rs78180314
>>>> [3] chMT [ 146, 146] * | rs72619361
>>>> [4] chMT [ 73, 73] * | rs3087742
>>>> [5] ch21 [ 9411609, 9411609] * | rs76676778
>>>> [6] ch22 [16051107, 16051107] * | rs6518357
>>>> ---
>>>> seqlengths:
>>>> ch21 ch22 chMT
>>>> NA NA NA
>>>>
>>>> Hope this helps,
>>>> H.
>>>>
>>>>
>>>> ## Maps the genomic locations in 'locs' to the corresponding rs ids.
>>>> ## 'locs' must be a GRanges object where all ranges are of with 1.
>>>> ## Because dbSNP can assign distinct ids to SNPs located at the same
>>>> ## position, the mapping is a 1-to-many relationship.
>>>> ## Returns a CharacterList of the same length as 'locs'.
>>>> loc2rsid <- function(locs)
>>>> {
>>>> if (!is(locs, "GRanges"))
>>>> stop("'locs' must be a GRanges object")
>>>> if (!all(width(locs) == 1L))
>>>> stop("all ranges in 'locs' must be of width 1")
>>>> common_seqlevels <- intersect(seqlevels(locs), names(getSNPcount()))
>>>> if (length(common_seqlevels) == 0L)
>>>> stop("chromosome names (a.k.a. seqlevels) in 'locs' don't seem
>>>> to ",
>>>> "be\n compatible with the chromosome names in the SNPlocs
>>>> ",
>>>> "package. Maybe they\n use a different naming convention?
>>>> ",
>>>> "If that's the case then you first need\n to rename the ",
>>>> "seqlevels in 'locs'. See '?seqlevels' for how to do
>>>> this.")
>>>> f <- as.factor(seqnames(locs))
>>>> locs_by_chrom <- split(start(locs), f)
>>>> rsids_by_chrom <- lapply(seq_along(locs_by_chrom),
>>>> function(i)
>>>> {
>>>> seqname <- levels(f)[i]
>>>> locs2 <- locs_by_chrom[[i]]
>>>> ans2 <- vector("list", length=length(locs2))
>>>> if (length(locs2) == 0L || !(seqname %in%
>>>> common_seqlevels))
>>>> return(ans2)
>>>> snplocs <- getSNPlocs(seqname, as.GRanges=FALSE)
>>>> hits <- findMatches(locs2, snplocs$loc)
>>>> rsids <- paste0("rs",
>>>> snplocs$RefSNP_id[subjectHits(hits)])
>>>> q_hits <- queryHits(hits)
>>>> tmp <- split(rsids, q_hits)
>>>> ans2[as.integer(names(tmp))] <- tmp
>>>> ans2
>>>> })
>>>> CharacterList(unsplit(rsids_by_chrom, f))
>>>> }
>>>>
>>>> PS: This functionality will make it to the SNPlocs packages (in this
>>>> form or another).
>>>>
>>>>
>>>>
>>>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote:
>>>>>
>>>>>
>>>>> Dear list,
>>>>>
>>>>> I have a list of genome positions. Is there is a package to get the
>>>>> snp name at this position?
>>>>>
>>>>> It is the reverse function as rsid2loc in
>>>>> SNPlocs.Hsapiens.dbSNP.20120608
>>>>>
>>>>> Thank you.
>>>>>
>>>>> _______________________________________________
>>>>> Bioconductor mailing list
>>>>> Bioconductor at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>> Search the archives:
>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>
>>>>
>>>> --
>>>> Hervé Pagès
>>>>
>>>> Program in Computational Biology
>>>> Division of Public Health Sciences
>>>> Fred Hutchinson Cancer Research Center
>>>> 1100 Fairview Ave. N, M1-B514
>>>> P.O. Box 19024
>>>> Seattle, WA 98109-1024
>>>>
>>>> E-mail: hpages at fhcrc.org
>>>> Phone: (206) 667-5791
>>>> Fax: (206) 667-1319
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone: (206) 667-5791
>> Fax: (206) 667-1319
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list