[BioC] get chr position for a batch of human SNPs

Fri Sep 23 09:11:17 CEST 2011

Hi Shirley,

The following page explains the difference between Build 36.1
and Build 36.3:

   http://www.ncbi.nlm.nih.gov/genome/guide/human/release_notes.html

In short, the *reference* genome assembly is the same for builds 36.1,
36.2 and 36.3. In those 3 builds, the reference assembly corresponds
to UCSC hg18.

Note that what NCBI calls a "build" can contain other assemblies in
addition to the reference assembly. For example builds 36.1 and 36.2
contain only the reference assembly but Build 36.3 also contains
the HuRef and Celera assemblies.

All the SNP positions relative to the HuRef or Celera assemblies were
ignored when the SNPlocs.Hsapiens.dbSNP.20090506 package was made.
Only the positions relative to the "reference assembly" were used
(i.e. positions relative to hg18).

I've tried to refine a little bit the SNPlocs packages over the
time, and, in the more recent ones (starting with
SNPlocs.Hsapiens.dbSNP.20100427), the man page provides some
details on how the SNPs were curated before inclusion in the
package. In particular SNPs mapped to more than 1 location on
the reference genome were dropped.

Hope this clarifies,

H.

On 11-09-22 05:48 PM, shirley zhang wrote:
> Dear Herve,
>
> Thanks for your quick reply.
>
> I've checked the link you suggested. The SNPs in
> "SNPlocs.Hsapiens.dbSNP.20090506" do map the hg18 genome (NCBI Build
> 36.1). But when I manually check the hg18 positions for a few SNPs on
> dbSNP at NCBI. I can only get position information for NCBI Build 36.3
> and NCBI Build 37. What is the difference between NCBI Build 36.3 and
> NCBI Build 36.1? Are they basically the same? Further, in dbSNP at
> NCBI, even for NCBI Build 36.3, there are 3 versions of the hg18
> position per SNP (each version has a different Group Label, i.e.
> "reference", "Celera" or "HuRef"). Could you kindly tell me what's the
> difference among these 3 versions?
>
> For my 2nd question, getting SNP position for SNPs without knowing the
> chr information, since I need to map SNPids to hg18 positions very
> often in my research, after all SNPs from all of chromosomes  were
> retrieved using "SNPlocs.Hsapiens.dbSNP.20090506",  I combined them
> into one big data frame and saved it on the server. But later on if I
> need to map SNPids to hg19, I will follow your suggestions by creating
> a big GRanges object using more recent SNPlocs packages.
>
> Thanks a lot for your detailed explanation. It is very helpful.
>
> Shirley
>
> 2011/9/22 Hervé Pagès<hpages at fhcrc.org>:
>> Hi Shirley,
>>
>> On 11-09-22 11:29 AM, shirley zhang wrote:
>>>
>>> Dear All,
>>>
>>> I am planing to map the SNPids to hg18 positions (chr and position)
>>> for a huge list of human snps. I've tried the package
>>> "SNPlocs.Hsapiens.dbSNP.20090506" and have 2 questions regarding this
>>> package:
>>>
>>> 1. Do the SNPs in this package map the hg18 genome (NCBI Build 36.3
>>> with Group Label "reference" instead of "Celera" or "HuRef"?
>>
>> Yes, they are mapped to hg18. See:
>>
>>
>> http://bioconductor.org/packages/release/data/annotation/html/SNPlocs.Hsapiens.dbSNP.20090506.html
>>
>> and the man page of the package for additional details:
>>
>>   >  library(SNPlocs.Hsapiens.dbSNP.20090506)
>>   >  ?SNPlocs.Hsapiens.dbSNP.20090506
>>
>>>
>>> 2. If I don't know the chr information (seqname), can I obtain the
>>> position with dbSNP Id only?
>>
>> Unfortunately, because SNPs are stored in one data frame per
>> chromosome, if you don't know the chr then you need to load and
>> query each data frame individually.
>>
>> With more recent SNPlocs packages (e.g.
>> SNPlocs.Hsapiens.dbSNP.20100427), provision was added to
>> let the user load SNPs from more than 1 chromosome in a single
>> GRanges object, so you can do something like:
>>
>>   ## Load all the SNPs in a big GRanges object (takes about
>>   ## 13 minutes and requires 6GB of RAM!):
>>   all_snps<- getSNPlocs(names(getSNPcount()), as.GRanges=TRUE)
>>
>>   ## Use the rs ids to set the names (takes about 6 minutes):
>>   names(all_snps)<- paste("rs", elementMetadata(all_snps)$RefSNP_id,
>>                            sep="")
>>
>>   ## Then extract your SNPs from the big GRanges object (again,
>>   ## this can take a long time, depending on how many SNPs you
>>   ## extract):
>>   my_rs_ids<- sample(names(all_snps), 1000)
>>   my_snps<- all_snps[my_rs_ids]
>>
>> However, please note that, starting with
>> SNPlocs.Hsapiens.dbSNP.20100427 (i.e. dbSNP Build 131),
>> SNPs are mapped to GRCh37 (UCSC hg19) instead of hg18.
>>
>> Hope this helps,
>> H.
>>
>>>
>>> Further, I find dbSNP batch queries a little more difficult to work
>>> with because they map to different versions of the hg18 like Celera,
>>> HumanRef, etc.Can anybody let me know a better option to get hg18 chr
>>> position with the most popular or confident version of dbSNP?
>>>
>>> Thanks in advance
>>>
>>> _______________________________________________
>>> Bioconductor mailing list
>>> Bioconductor at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/bioconductor
>>> Search the archives:
>>> http://news.gmane.org/gmane.science.biology.informatics.conductor
>>
>>
>> --
>> Hervé Pagès
>>
>> Program in Computational Biology
>> Division of Public Health Sciences
>> Fred Hutchinson Cancer Research Center
>> 1100 Fairview Ave. N, M1-B514
>> P.O. Box 19024
>> Seattle, WA 98109-1024
>>
>> E-mail: hpages at fhcrc.org
>> Phone:  (206) 667-5791
>> Fax:    (206) 667-1319
>>

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319