[BioC] From location to get the SNP rs ID
Hervé Pagès
hpages at fhcrc.org
Fri May 31 09:42:55 CEST 2013
Hi Fabrice,
With the loc2rsid() function (see code at the bottom) and using
SNPlocs.Hsapiens.dbSNP.20100427 (I don't have
SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment):
mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1, 1)),
IRanges(c(200, 16369861, 146, 73, 9411609,
16051107), width=1))
> mylocs
GRanges with 6 ranges and 0 metadata columns:
seqnames ranges strand
<Rle> <IRanges> <Rle>
[1] ch22 [ 200, 200] *
[2] ch22 [16369861, 16369861] *
[3] chM [ 146, 146] *
[4] chM [ 73, 73] *
[5] ch21 [ 9411609, 9411609] *
[6] ch22 [16051107, 16051107] *
---
seqlengths:
ch21 ch22 chM
NA NA NA
Then:
library(SNPlocs.Hsapiens.dbSNP.20100427)
> loc2rsid(mylocs)
CharacterList of length 6
[[1]] character(0)
[[2]] rs75258394 rs78180314
[[3]] character(0)
[[4]] character(0)
[[5]] rs76676778
[[6]] rs6518357
You need to be careful to use the same chromosome naming convention as
the SNPlocs package (which uses dbSNP naming convention). For example
in the 'mylocs' object, the mitochondrial chromosome is named chM but
in the SNPlocs package it's chMT:
> getSNPcount()
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8
ch9 ch10
1369185 1422439 1186807 1191984 1064540 1040203 977275 919207
740516 859433
ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
ch19 ch20
855225 822825 615382 543041 499101 556394 464686 482359
375259 450685
ch21 ch22 chX chY chMT
253480 244482 445596 53908 667
This is why the locations on chM are not mapped to anything. But
after renaming:
> seqlevels(mylocs)[3] <- "chMT"
> seqlevels(mylocs)
[1] "ch21" "ch22" "chMT"
> loc2rsid(mylocs)
CharacterList of length 6
[[1]] character(0)
[[2]] rs75258394 rs78180314
[[3]] rs72619361
[[4]] rs3087742
[[5]] rs76676778
[[6]] rs6518357
they get mapped.
The CharacterList object returned by loc2rsid() can be set as a
metadata column of the input GRanges object:
> mcols(mylocs)$rsid <- loc2rsid(mylocs)
> mylocs
GRanges with 6 ranges and 1 metadata column:
seqnames ranges strand | rsid
<Rle> <IRanges> <Rle> | <CharacterList>
[1] ch22 [ 200, 200] * |
[2] ch22 [16369861, 16369861] * | rs75258394,rs78180314
[3] chMT [ 146, 146] * | rs72619361
[4] chMT [ 73, 73] * | rs3087742
[5] ch21 [ 9411609, 9411609] * | rs76676778
[6] ch22 [16051107, 16051107] * | rs6518357
---
seqlengths:
ch21 ch22 chMT
NA NA NA
Hope this helps,
H.
## Maps the genomic locations in 'locs' to the corresponding rs ids.
## 'locs' must be a GRanges object where all ranges are of with 1.
## Because dbSNP can assign distinct ids to SNPs located at the same
## position, the mapping is a 1-to-many relationship.
## Returns a CharacterList of the same length as 'locs'.
loc2rsid <- function(locs)
{
if (!is(locs, "GRanges"))
stop("'locs' must be a GRanges object")
if (!all(width(locs) == 1L))
stop("all ranges in 'locs' must be of width 1")
common_seqlevels <- intersect(seqlevels(locs), names(getSNPcount()))
if (length(common_seqlevels) == 0L)
stop("chromosome names (a.k.a. seqlevels) in 'locs' don't seem
to ",
"be\n compatible with the chromosome names in the SNPlocs ",
"package. Maybe they\n use a different naming convention? ",
"If that's the case then you first need\n to rename the ",
"seqlevels in 'locs'. See '?seqlevels' for how to do this.")
f <- as.factor(seqnames(locs))
locs_by_chrom <- split(start(locs), f)
rsids_by_chrom <- lapply(seq_along(locs_by_chrom),
function(i)
{
seqname <- levels(f)[i]
locs2 <- locs_by_chrom[[i]]
ans2 <- vector("list", length=length(locs2))
if (length(locs2) == 0L || !(seqname %in% common_seqlevels))
return(ans2)
snplocs <- getSNPlocs(seqname, as.GRanges=FALSE)
hits <- findMatches(locs2, snplocs$loc)
rsids <- paste0("rs", snplocs$RefSNP_id[subjectHits(hits)])
q_hits <- queryHits(hits)
tmp <- split(rsids, q_hits)
ans2[as.integer(names(tmp))] <- tmp
ans2
})
CharacterList(unsplit(rsids_by_chrom, f))
}
PS: This functionality will make it to the SNPlocs packages (in this
form or another).
On 05/30/2013 09:54 PM, Fabrice Tourre wrote:
> Dear list,
>
> I have a list of genome positions. Is there is a package to get the
> snp name at this position?
>
> It is the reverse function as rsid2loc in SNPlocs.Hsapiens.dbSNP.20120608
>
> Thank you.
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list