[BioC] From location to get the SNP rs ID
Fabrice Tourre
fabrice.ciup at gmail.com
Mon Jun 3 20:55:05 CEST 2013
Hervé Pagès,
When I get rsid from position, I think some of them cannot mapped to a
rsid. How can we let they also return a value, otherwise we will have
a problem to map the position to rsid.
For example:
posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T)
chrs <- paste("ch",posi[,1],sep="")
mylocs <- GRanges(chrs,
IRanges(as.integer(posi[,2]), width=1))
rsids <- loc2rsid(mylocs)
dat$SNP <- unlist(rsids)
Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570",
"rs145939433", :
replacement has 92428 rows, data has 92489
> length(chrs)
[1] 92489
> length(unlist(rsids))
[1] 92428
On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
> On 05/31/2013 03:31 AM, Fabrice Tourre wrote:
>>
>> Hervé Pagès,
>>
>> It helps. Thank you very much.
>>
>> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042
>> and SNPlocs.Hsapiens.dbSNP.20120608.
>
>
> Nope. That's why I sent you the code ;-)
>
>
> H.
>
>>
>> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>
>>> Hi Fabrice,
>>>
>>> With the loc2rsid() function (see code at the bottom) and using
>>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have
>>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment):
>>>
>>>
>>> mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1,
>>> 1)),
>>> IRanges(c(200, 16369861, 146, 73, 9411609,
>>> 16051107),
>>> width=1))
>>>
>>> > mylocs
>>> GRanges with 6 ranges and 0 metadata columns:
>>> seqnames ranges strand
>>> <Rle> <IRanges> <Rle>
>>> [1] ch22 [ 200, 200] *
>>> [2] ch22 [16369861, 16369861] *
>>> [3] chM [ 146, 146] *
>>> [4] chM [ 73, 73] *
>>> [5] ch21 [ 9411609, 9411609] *
>>> [6] ch22 [16051107, 16051107] *
>>> ---
>>> seqlengths:
>>> ch21 ch22 chM
>>> NA NA NA
>>>
>>> Then:
>>>
>>> library(SNPlocs.Hsapiens.dbSNP.20100427)
>>>
>>> > loc2rsid(mylocs)
>>> CharacterList of length 6
>>> [[1]] character(0)
>>> [[2]] rs75258394 rs78180314
>>> [[3]] character(0)
>>> [[4]] character(0)
>>> [[5]] rs76676778
>>> [[6]] rs6518357
>>>
>>> You need to be careful to use the same chromosome naming convention as
>>> the SNPlocs package (which uses dbSNP naming convention). For example
>>> in the 'mylocs' object, the mitochondrial chromosome is named chM but
>>> in the SNPlocs package it's chMT:
>>>
>>> > getSNPcount()
>>> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9
>>> ch10
>>> 1369185 1422439 1186807 1191984 1064540 1040203 977275 919207 740516
>>> 859433
>>> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19
>>> ch20
>>> 855225 822825 615382 543041 499101 556394 464686 482359 375259
>>> 450685
>>> ch21 ch22 chX chY chMT
>>> 253480 244482 445596 53908 667
>>>
>>> This is why the locations on chM are not mapped to anything. But
>>> after renaming:
>>>
>>> > seqlevels(mylocs)[3] <- "chMT"
>>> > seqlevels(mylocs)
>>> [1] "ch21" "ch22" "chMT"
>>>
>>> > loc2rsid(mylocs)
>>> CharacterList of length 6
>>> [[1]] character(0)
>>> [[2]] rs75258394 rs78180314
>>> [[3]] rs72619361
>>> [[4]] rs3087742
>>> [[5]] rs76676778
>>> [[6]] rs6518357
>>>
>>> they get mapped.
>>>
>>> The CharacterList object returned by loc2rsid() can be set as a
>>> metadata column of the input GRanges object:
>>>
>>> > mcols(mylocs)$rsid <- loc2rsid(mylocs)
>>> > mylocs
>>> GRanges with 6 ranges and 1 metadata column:
>>> seqnames ranges strand | rsid
>>> <Rle> <IRanges> <Rle> | <CharacterList>
>>> [1] ch22 [ 200, 200] * |
>>> [2] ch22 [16369861, 16369861] * | rs75258394,rs78180314
>>> [3] chMT [ 146, 146] * | rs72619361
>>> [4] chMT [ 73, 73] * | rs3087742
>>> [5] ch21 [ 9411609, 9411609] * | rs76676778
>>> [6] ch22 [16051107, 16051107] * | rs6518357
>>> ---
>>> seqlengths:
>>> ch21 ch22 chMT
>>> NA NA NA
>>>
>>> Hope this helps,
>>> H.
>>>
>>>
>>> ## Maps the genomic locations in 'locs' to the corresponding rs ids.
>>> ## 'locs' must be a GRanges object where all ranges are of with 1.
>>> ## Because dbSNP can assign distinct ids to SNPs located at the same
>>> ## position, the mapping is a 1-to-many relationship.
>>> ## Returns a CharacterList of the same length as 'locs'.
>>> loc2rsid <- function(locs)
>>> {
>>> if (!is(locs, "GRanges"))
>>> stop("'locs' must be a GRanges object")
>>> if (!all(width(locs) == 1L))
>>> stop("all ranges in 'locs' must be of width 1")
>>> common_seqlevels <- intersect(seqlevels(locs), names(getSNPcount()))
>>> if (length(common_seqlevels) == 0L)
>>> stop("chromosome names (a.k.a. seqlevels) in 'locs' don't seem
>>> to ",
>>> "be\n compatible with the chromosome names in the SNPlocs
>>> ",
>>> "package. Maybe they\n use a different naming convention?
>>> ",
>>> "If that's the case then you first need\n to rename the ",
>>> "seqlevels in 'locs'. See '?seqlevels' for how to do
>>> this.")
>>> f <- as.factor(seqnames(locs))
>>> locs_by_chrom <- split(start(locs), f)
>>> rsids_by_chrom <- lapply(seq_along(locs_by_chrom),
>>> function(i)
>>> {
>>> seqname <- levels(f)[i]
>>> locs2 <- locs_by_chrom[[i]]
>>> ans2 <- vector("list", length=length(locs2))
>>> if (length(locs2) == 0L || !(seqname %in%
>>> common_seqlevels))
>>> return(ans2)
>>> snplocs <- getSNPlocs(seqname, as.GRanges=FALSE)
>>> hits <- findMatches(locs2, snplocs$loc)
>>> rsids <- paste0("rs",
>>> snplocs$RefSNP_id[subjectHits(hits)])
>>> q_hits <- queryHits(hits)
>>> tmp <- split(rsids, q_hits)
>>> ans2[as.integer(names(tmp))] <- tmp
>>> ans2
>>> })
>>> CharacterList(unsplit(rsids_by_chrom, f))
>>> }
>>>
>>> PS: This functionality will make it to the SNPlocs packages (in this
>>> form or another).
>>>
>>>
>>>
>>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote:
>>>>
>>>>
>>>> Dear list,
>>>>
>>>> I have a list of genome positions. Is there is a package to get the
>>>> snp name at this position?
>>>>
>>>> It is the reverse function as rsid2loc in
>>>> SNPlocs.Hsapiens.dbSNP.20120608
>>>>
>>>> Thank you.
>>>>
>>>> _______________________________________________
>>>> Bioconductor mailing list
>>>> Bioconductor at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>> Search the archives:
>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioconductor
mailing list