[BioC] Problem locating SNP by rsID for SNPlocs.Hsapiens.dbSNP.20120608 package Bioconductor x
Hervé Pagès
hpages at fhcrc.org
Wed Jan 16 08:00:56 CET 2013
Hi Christina,
According to the official announcement:
http://www.ncbi.nlm.nih.gov/mailman/pipermail/dbsnp-announce/2012q2/000122.html
there are 53,558,214 rs ids in dbSNP 137 for Human.
But in SNPlocs.Hsapiens.dbSNP.20120608:
> library(SNPlocs.Hsapiens.dbSNP.20120608)
> sum(getSNPcount())
[1] 45416711
As explained in ?SNPlocs.Hsapiens.dbSNP.20120608, the package (like
all other SNPlocs packages) was curated:
SNPs from dbSNP were filtered to keep only those satisfying the 3
following criteria:
• The SNP is a single-base substitution i.e. its type is "snp".
Other types used by dbSNP are: "in-del", "mixed",
"microsatellite", "named-locus",
"multinucleotide-polymorphism", etc... All those SNPs were
dropped.
• The SNP is marked as notwithdrawn.
• A *single* location on the reference genome (GRCh37.p5) is
reported for the SNP, and this location is on chromosomes
1-22, X, Y, MT.
In the case of rs7775397, it was dropped because of this last reason.
More precisely, the record in ds_flat_ch6.flat for this SNP contains
the following CTG lines:
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=32261252 | NT_007592.15 |
ctg-start=32201252 | ctg-end=32201252 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_113891.2 |
ctg-start=3732030 | ctg-end=3732030 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167245.1 |
ctg-start=3540499 | ctg-end=3540499 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167246.1 |
ctg-start=3604088 | ctg-end=3604088 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167248.1 |
ctg-start=3522471 | ctg-end=3522471 | loctype=2 | orient=+
CTG | assembly=GRCh37.p5 | chr=6 | chr-pos=? | NT_167249.1 |
ctg-start=3609047 | ctg-end=3609047 | loctype=2 | orient=+
That is, more than 1 CTG line corresponding to the reference assembly
(GRCh37.p5). This is the reason why the SNP was dropped.
I realize now that maybe I could keep those SNPs that have more than
1 CTG line corresponding to the reference assembly as long as exactly
1 of them actually provides a value for the chr-pos field. Would that
be reasonable?
Thanks,
H.
On 01/15/2013 05:19 PM, Christina Chaivorapol wrote:
> Hi,
>
> Has anyone ever had a case where a SNP was not found in
> SNPlocs.Hsapiens.dbSNP.
> 20120608, but is found in dbSNP 137? I am having this problem with SNP
> rs7775397.
>
>> library(SNPlocs.Hsapiens.dbSNP.20120608)
>> rsidsToGRanges('rs7775397')
> Error in .snpid2rowidx(x, snpid) : SNP id(s) not found: 7775397
>
> Thanks,
> Christina
>
>> sessionInfo()
> R version 2.15.2 (2012-10-26)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=C LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] datasets utils grDevices graphics stats methods base
>
> other attached packages:
> [1] SNPlocs.Hsapiens.dbSNP.
> 20120608_0.99.8
> [2] BSgenome_1.26.1
> [3] Biostrings_2.26.2
> [4] GenomicRanges_1.10.5
> [5] IRanges_1.16.4
> [6] BiocGenerics_0.4.0
>
> loaded via a namespace (and not attached):
> [1] parallel_2.15.2 stats4_2.15.2
>
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M1-B514
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list