[BioC] From location to get the SNP rs ID
Fabrice Tourre
fabrice.ciup at gmail.com
Mon Jun 3 23:21:24 CEST 2013
Hervé Pagès,
It helps. Thank you very much.
On Mon, Jun 3, 2013 at 4:58 PM, Hervé Pagès <hpages at fhcrc.org> wrote:
> Hi Fabrice,
>
>
> On 06/03/2013 11:55 AM, Fabrice Tourre wrote:
>>
>> Hervé Pagès,
>>
>> When I get rsid from position, I think some of them cannot mapped to a
>> rsid. How can we let they also return a value, otherwise we will have
>> a problem to map the position to rsid.
>
>
> Any genomic position can be mapped to 0, 1, or more than 1 rs ids (the
> "more than 1 rs ids" case can happen because dbSNP sometimes assign
> distinct ids to SNPs located at the same position, don't ask me why).
> This is summarized by saying that the mapping is a 1-to-many
> relationship.
>
>>
>> For example:
>>
>> posi <- matrix(unlist(strsplit(as.character(dat[,1]),":")),ncol=2,byrow=T)
>> chrs <- paste("ch",posi[,1],sep="")
>> mylocs <- GRanges(chrs,
>> IRanges(as.integer(posi[,2]), width=1))
>> rsids <- loc2rsid(mylocs)
>>
>> dat$SNP <- unlist(rsids)
>> Error in `$<-.data.frame`(`*tmp*`, "SNP", value = c("rs1374570",
>> "rs145939433", :
>> replacement has 92428 rows, data has 92489
>>>
>>> length(chrs)
>>
>> [1] 92489
>>>
>>> length(unlist(rsids))
>>
>> [1] 92428
>
>
> THE PROBLEM: Because the relationship is 1-to-many, loc2rsid() cannot
> in general return a character vector of the same length as the input
> vector (i.e. the GRanges object holding the genomic positions). This
> is why the function returns a *CharacterList* object. This CharacterList
> is guaranteed to have the length of the input vector. However the list
> elements of that CharacterList can be of length 0, 1, or more than 1.
> Because of that, you should not expect that the result of unlisting this
> CharacterList will give you a character vector of the length of the
> input vector.
>
> THE SOLUTION: The general advice is to not use unlist() if you want to
> preserve length and positioning (i.e. if you want the i-th element in
> the result to correspond to the i-th in the input):
>
> > x <- CharacterList("a", "bb", character(), "ccc")
>
> > x
> CharacterList of length 4
> [[1]] a
> [[2]] bb
> [[3]] character(0)
> [[4]] ccc
>
> > unlist(x)
> [1] "a" "bb" "ccc"
>
> Use as.character() instead:
>
> > as.character(x)
> [1] "a" "bb" NA "ccc"
>
> as.character() is guaranteed to preserve length and positioning... or
> to fail. It will fail if a list element has a length >= 2. Because that
> means a choice needs to be made and as.character() won't make that
> choice for you:
>
> > x2 <- CharacterList(letters[1:3], "bb", character(), "ccc")
>
> > x2
> CharacterList of length 4
> [[1]] a b c
> [[2]] bb
> [[3]] character(0)
> [[4]] ccc
>
> > as.character(x2)
> Error in as.vector(x, mode = "character") :
> coercing an AtomicList object to an atomic vector is supported only for
> objects with top-level elements of length <= 1
>
> Possible choices are (the goal being to end up with a CharacterList
> object where all the list elements have a length <= 1):
>
> (1) Keep 1 arbitrary element of any list element of length >= 2.
> For example, to keep the 1st element:
>
> idx <- elementLengths(x2) >= 2
> x2_subset <- x2[idx]
> x2_subset <- sapply(x2_subset, `[`, 1L)
> x2[idx] <- x2_subset
>
> Equivalently, you could choose to keep the last element:
>
> idx <- elementLengths(x2) >= 2
> x2_subset <- x2[idx]
> x2_subset <- sapply(x2_subset, function(x) x[length(x)])
> x2[idx] <- x2_subset
>
> Note that the above sapply() can be replaced by much faster
>
> x2_subset <- unlist(x2_subset)[cumsum(elementLengths(x2_subset))]
>
> (2) Keep nothing:
>
> idx <- elementLengths(x2) >= 2
> x2[idx] <- NA
>
> (3) Keep an element randomly...
>
> Then you should be able to do 'as.character(x2)'.
>
> Genomic positions mapped to more than 1 rs id are rare though so it
> could be that most of the times you won't run into that problem.
> However if the code you're writing is going to be re-used by other
> people, they might supply those problematic genomic positions in
> the input so it's good if your code can handle them gracefully...
>
> Cheers,
>
> H.
>
>>
>> On Fri, May 31, 2013 at 7:01 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>
>>> On 05/31/2013 03:31 AM, Fabrice Tourre wrote:
>>>>
>>>>
>>>> Hervé Pagès,
>>>>
>>>> It helps. Thank you very much.
>>>>
>>>> But loc2rsid seems does not in version SNPlocs.Hsapiens.dbSNP.2010042
>>>> and SNPlocs.Hsapiens.dbSNP.20120608.
>>>
>>>
>>>
>>> Nope. That's why I sent you the code ;-)
>>>
>>>
>>> H.
>>>
>>>>
>>>> On Fri, May 31, 2013 at 3:42 AM, Hervé Pagès <hpages at fhcrc.org> wrote:
>>>>>
>>>>>
>>>>> Hi Fabrice,
>>>>>
>>>>> With the loc2rsid() function (see code at the bottom) and using
>>>>> SNPlocs.Hsapiens.dbSNP.20100427 (I don't have
>>>>> SNPlocs.Hsapiens.dbSNP.20120608 installed at the moment):
>>>>>
>>>>>
>>>>> mylocs <- GRanges(Rle(c("ch22", "chM", "ch21", "ch22"), c(2, 2, 1,
>>>>> 1)),
>>>>> IRanges(c(200, 16369861, 146, 73, 9411609,
>>>>> 16051107),
>>>>> width=1))
>>>>>
>>>>> > mylocs
>>>>> GRanges with 6 ranges and 0 metadata columns:
>>>>> seqnames ranges strand
>>>>> <Rle> <IRanges> <Rle>
>>>>> [1] ch22 [ 200, 200] *
>>>>> [2] ch22 [16369861, 16369861] *
>>>>> [3] chM [ 146, 146] *
>>>>> [4] chM [ 73, 73] *
>>>>> [5] ch21 [ 9411609, 9411609] *
>>>>> [6] ch22 [16051107, 16051107] *
>>>>> ---
>>>>> seqlengths:
>>>>> ch21 ch22 chM
>>>>> NA NA NA
>>>>>
>>>>> Then:
>>>>>
>>>>> library(SNPlocs.Hsapiens.dbSNP.20100427)
>>>>>
>>>>> > loc2rsid(mylocs)
>>>>> CharacterList of length 6
>>>>> [[1]] character(0)
>>>>> [[2]] rs75258394 rs78180314
>>>>> [[3]] character(0)
>>>>> [[4]] character(0)
>>>>> [[5]] rs76676778
>>>>> [[6]] rs6518357
>>>>>
>>>>> You need to be careful to use the same chromosome naming convention as
>>>>> the SNPlocs package (which uses dbSNP naming convention). For example
>>>>> in the 'mylocs' object, the mitochondrial chromosome is named chM but
>>>>> in the SNPlocs package it's chMT:
>>>>>
>>>>> > getSNPcount()
>>>>> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9
>>>>> ch10
>>>>> 1369185 1422439 1186807 1191984 1064540 1040203 977275 919207
>>>>> 740516
>>>>> 859433
>>>>> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18
>>>>> ch19
>>>>> ch20
>>>>> 855225 822825 615382 543041 499101 556394 464686 482359
>>>>> 375259
>>>>> 450685
>>>>> ch21 ch22 chX chY chMT
>>>>> 253480 244482 445596 53908 667
>>>>>
>>>>> This is why the locations on chM are not mapped to anything. But
>>>>> after renaming:
>>>>>
>>>>> > seqlevels(mylocs)[3] <- "chMT"
>>>>> > seqlevels(mylocs)
>>>>> [1] "ch21" "ch22" "chMT"
>>>>>
>>>>> > loc2rsid(mylocs)
>>>>> CharacterList of length 6
>>>>> [[1]] character(0)
>>>>> [[2]] rs75258394 rs78180314
>>>>> [[3]] rs72619361
>>>>> [[4]] rs3087742
>>>>> [[5]] rs76676778
>>>>> [[6]] rs6518357
>>>>>
>>>>> they get mapped.
>>>>>
>>>>> The CharacterList object returned by loc2rsid() can be set as a
>>>>> metadata column of the input GRanges object:
>>>>>
>>>>> > mcols(mylocs)$rsid <- loc2rsid(mylocs)
>>>>> > mylocs
>>>>> GRanges with 6 ranges and 1 metadata column:
>>>>> seqnames ranges strand | rsid
>>>>> <Rle> <IRanges> <Rle> | <CharacterList>
>>>>> [1] ch22 [ 200, 200] * |
>>>>> [2] ch22 [16369861, 16369861] * | rs75258394,rs78180314
>>>>> [3] chMT [ 146, 146] * | rs72619361
>>>>> [4] chMT [ 73, 73] * | rs3087742
>>>>> [5] ch21 [ 9411609, 9411609] * | rs76676778
>>>>> [6] ch22 [16051107, 16051107] * | rs6518357
>>>>> ---
>>>>> seqlengths:
>>>>> ch21 ch22 chMT
>>>>> NA NA NA
>>>>>
>>>>> Hope this helps,
>>>>> H.
>>>>>
>>>>>
>>>>> ## Maps the genomic locations in 'locs' to the corresponding rs ids.
>>>>> ## 'locs' must be a GRanges object where all ranges are of with 1.
>>>>> ## Because dbSNP can assign distinct ids to SNPs located at the same
>>>>> ## position, the mapping is a 1-to-many relationship.
>>>>> ## Returns a CharacterList of the same length as 'locs'.
>>>>> loc2rsid <- function(locs)
>>>>> {
>>>>> if (!is(locs, "GRanges"))
>>>>> stop("'locs' must be a GRanges object")
>>>>> if (!all(width(locs) == 1L))
>>>>> stop("all ranges in 'locs' must be of width 1")
>>>>> common_seqlevels <- intersect(seqlevels(locs),
>>>>> names(getSNPcount()))
>>>>> if (length(common_seqlevels) == 0L)
>>>>> stop("chromosome names (a.k.a. seqlevels) in 'locs' don't
>>>>> seem
>>>>> to ",
>>>>> "be\n compatible with the chromosome names in the
>>>>> SNPlocs
>>>>> ",
>>>>> "package. Maybe they\n use a different naming
>>>>> convention?
>>>>> ",
>>>>> "If that's the case then you first need\n to rename the
>>>>> ",
>>>>> "seqlevels in 'locs'. See '?seqlevels' for how to do
>>>>> this.")
>>>>> f <- as.factor(seqnames(locs))
>>>>> locs_by_chrom <- split(start(locs), f)
>>>>> rsids_by_chrom <- lapply(seq_along(locs_by_chrom),
>>>>> function(i)
>>>>> {
>>>>> seqname <- levels(f)[i]
>>>>> locs2 <- locs_by_chrom[[i]]
>>>>> ans2 <- vector("list", length=length(locs2))
>>>>> if (length(locs2) == 0L || !(seqname %in%
>>>>> common_seqlevels))
>>>>> return(ans2)
>>>>> snplocs <- getSNPlocs(seqname, as.GRanges=FALSE)
>>>>> hits <- findMatches(locs2, snplocs$loc)
>>>>> rsids <- paste0("rs",
>>>>> snplocs$RefSNP_id[subjectHits(hits)])
>>>>> q_hits <- queryHits(hits)
>>>>> tmp <- split(rsids, q_hits)
>>>>> ans2[as.integer(names(tmp))] <- tmp
>>>>> ans2
>>>>> })
>>>>> CharacterList(unsplit(rsids_by_chrom, f))
>>>>> }
>>>>>
>>>>> PS: This functionality will make it to the SNPlocs packages (in this
>>>>> form or another).
>>>>>
>>>>>
>>>>>
>>>>> On 05/30/2013 09:54 PM, Fabrice Tourre wrote:
>>>>>>
>>>>>>
>>>>>>
>>>>>> Dear list,
>>>>>>
>>>>>> I have a list of genome positions. Is there is a package to get the
>>>>>> snp name at this position?
>>>>>>
>>>>>> It is the reverse function as rsid2loc in
>>>>>> SNPlocs.Hsapiens.dbSNP.20120608
>>>>>>
>>>>>> Thank you.
>>>>>>
>>>>>> _______________________________________________
>>>>>> Bioconductor mailing list
>>>>>> Bioconductor at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>>>>> Search the archives:
>>>>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>>>>>
>>>>>
>>>>> --
>>>>> Hervé Pagès
>>>>>
>>>>> Program in Computational Biology
>>>>> Division of Public Health Sciences
>>>>> Fred Hutchinson Cancer Research Center
>>>>> 1100 Fairview Ave. N, M1-B514
>>>>> P.O. Box 19024
>>>>> Seattle, WA 98109-1024
>>>>>
>>>>> E-mail: hpages at fhcrc.org
>>>>> Phone: (206) 667-5791
>>>>> Fax: (206) 667-1319
>>>
>>>
>>>
>>> --
>>> Hervé Pagès
>>>
>>> Program in Computational Biology
>>> Division of Public Health Sciences
>>> Fred Hutchinson Cancer Research Center
>>> 1100 Fairview Ave. N, M1-B514
>>> P.O. Box 19024
>>> Seattle, WA 98109-1024
>>>
>>> E-mail: hpages at fhcrc.org
>>> Phone: (206) 667-5791
>>> Fax: (206) 667-1319
>
>
> --
> Hervé Pagès
>
> Program in Computational Biology
> Division of Public Health Sciences
> Fred Hutchinson Cancer Research Center
> 1100 Fairview Ave. N, M1-B514
> P.O. Box 19024
> Seattle, WA 98109-1024
>
> E-mail: hpages at fhcrc.org
> Phone: (206) 667-5791
> Fax: (206) 667-1319
More information about the Bioconductor
mailing list