[BioC] RE : maping SNPs
Simon Noël
simon.noel.2 at ulaval.ca
Wed Dec 22 15:51:00 CET 2010
Hello,
Sorry to be slow for a response. What you say that should take 3-5 min took 3-5 days... I don't have the best computer in the world.
We do have a supercomputer but even if it's the same version of R and package than my litle computer, it's not working... Well... It's not working on my computer to but not at the same place. We both have R 2.12.0 and the lastest version of every package.
Here is what I have. I have try with the test you sugested.
On the supercomputer :
> getSNPcount()
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075 1158707
ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20
1147722 1105364 815729 740129 657719 757926 641905 645646 520666 586708
ch21 ch22 chX chY chMT
338254 331060 529608 67438 624
> ch22snps <- getSNPlocs("ch22")
> ch22snps[1:5, ]
RefSNP_id alleles_as_ambig loc
1 56342815 K 16050353
2 7288968 S 16050994
3 6518357 M 16051107
4 7292503 R 16051209
5 6518368 Y 16051241
>
>
> makeGRangesFromRefSNPids <- function(myids)
+ {
+ ans_seqnames <- character(length(myids))
+ ans_seqnames[] <- "unknown"
+ ans_locs <- integer(length(myids))
+ for (seqname in names(getSNPcount())) {
+ locs <- getSNPlocs(seqname)
+ ids <- paste("rs", locs$RefSNP_id, sep="")
+ myrows <- match(myids, ids)
+ ans_seqnames[!is.na(myrows)] <- seqname
+ ans_locs[!is.na(myrows)] <- locs$loc[myrows]
+ }
+ GRanges(seqnames=ans_seqnames,
+ IRanges(start=ans_locs, width=1),
+ RefSNP_id=myids)
+ }
>
>
> myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs10915459", "rs16838750", "rs12128230", "rs999999999")
> mysnps <- makeGRangesFromRefSNPids(myids)
Warning message:
In ans_locs[!is.na(myrows)] <- locs$loc[myrows] :
number of items to replace is not a multiple of replacement length
> mysnps # a GRanges object with 1 SNP per row
GRangeswith 8 ranges and 1 elementMetadata value
seqnames ranges strand | RefSNP_id
<Rle> <IRanges> <Rle> | <character>
[1] ch1 [2195117, 2195117] * | rs7547453
[2] ch1 [2291680, 2291680] * | rs2840542
[3] ch1 [3256108, 3256108] * | rs1999527
[4] ch1 [3577321, 3577321] * | rs4648545
[5] ch1 [4230463, 4230463] * | rs10915459
[6] ch1 [4404344, 4404344] * | rs16838750
[7] ch1 [4501911, 4501911] * | rs12128230
[8] unknown [ 0, 0] * | rs999999999
seqlengths
ch1 unknown
NA NA
> txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
Downloadthe refGene table ... OK
Downloadthe refLink table ... OK
Extractthe 'transcripts' data frame ... OK
Extractthe 'splicings' data frame ... OK
Downloadand preprocess the 'chrominfo' data frame ... OK
Preparethe 'metadata' data frame ... OK
Makethe TranscriptDb object ... Error in .writeMetadataTable(conn, metadata) : subscript out of bounds
In addition: There were 50 or more warnings (use warnings() to see the first 50)
On my computer :
> getSNPcount()
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075 1158707
ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20
1147722 1105364 815729 740129 657719 757926 641905 645646 520666 586708
ch21 ch22 chX chY chMT
338254 331060 529608 67438 624
> ch22snps <- getSNPlocs("ch22")
> ch22snps[1:5, ]
RefSNP_id alleles_as_ambig loc
1 56342815 K 16050353
2 7288968 S 16050994
3 6518357 M 16051107
4 7292503 R 16051209
5 6518368 Y 16051241
>
>
> makeGRangesFromRefSNPids <- function(myids)
+ {
+ ans_seqnames <- character(length(myids))
+ ans_seqnames[] <- "unknown"
+ ans_locs <- integer(length(myids))
+ for (seqname in names(getSNPcount())) {
+ locs <- getSNPlocs(seqname)
+ ids <- paste("rs", locs$RefSNP_id, sep="")
+ myrows <- match(myids, ids)
+ ans_seqnames[!is.na(myrows)] <- seqname
+ ans_locs[!is.na(myrows)] <- locs$loc[myrows]
+ }
+ GRanges(seqnames=ans_seqnames,
+ IRanges(start=ans_locs, width=1),
+ RefSNP_id=myids)
+ }
>
>
> myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545", "rs10915459", "rs16838750", "rs12128230", "rs999999999")
> mysnps <- makeGRangesFromRefSNPids(myids)
Warning message:
In ans_locs[!is.na(myrows)] <- locs$loc[myrows] :
number of items to replace is not a multiple of replacement length
> mysnps # a GRanges object with 1 SNP per row
GRangeswith 8 ranges and 1 elementMetadata value
seqnames ranges strand | RefSNP_id
<Rle> <IRanges> <Rle> | <character>
[1] ch1 [2195117, 2195117] * | rs7547453
[2] ch1 [2291680, 2291680] * | rs2840542
[3] ch1 [3256108, 3256108] * | rs1999527
[4] ch1 [3577321, 3577321] * | rs4648545
[5] ch1 [4230463, 4230463] * | rs10915459
[6] ch1 [4404344, 4404344] * | rs16838750
[7] ch1 [4501911, 4501911] * | rs12128230
[8] unknown [ 0, 0] * | rs999999999
seqlengths
ch1 unknown
NA NA
> txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
Downloadthe refGene table ... OK
Downloadthe refLink table ... OK
Extractthe 'transcripts' data frame ... OK
Extractthe 'splicings' data frame ... OK
Downloadand preprocess the 'chrominfo' data frame ... OK
Preparethe 'metadata' data frame ... OK
Makethe TranscriptDb object ... OK
Il y a eu 50 avis ou plus (utilisez warnings() pour voir les 50 premiers)
> txdb
TranscriptDbobject:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| UCSC Table: refGene
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 38098
| exon_nrow: 230201
| cds_nrow: 204683
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2010-12-22 09:34:40 -0500 (Wed, 22 Dec 2010)
| GenomicFeatures version at creation time: 1.2.3
| RSQLite version at creation time: 0.9-4
| DBSCHEMAVERSION: 1.0
> tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx # a GRanges object with 1 transcript per row
GRangeswith 38098 ranges and 3 elementMetadata values
seqnames ranges strand | tx_id tx_name
<Rle> <IRanges> <Rle> | <integer> <character>
[1] chr1 [ 69091, 70008] + | 1021 NM_001005484
[2] chr1 [323892, 328581] + | 1023 NR_028327
[3] chr1 [323892, 328581] + | 1024 NR_028322
[4] chr1 [323892, 328581] + | 1025 NR_028325
[5] chr1 [367659, 368597] + | 1022 NM_001005277
[6] chr1 [367659, 368597] + | 1026 NM_001005221
[7] chr1 [367659, 368597] + | 1027 NM_001005224
[8] chr1 [763064, 789740] + | 174 NR_015368
[9] chr1 [861121, 879961] + | 1035 NM_152486
... ... ... ... ... ... ...
[38090] chrY [27177050, 27198251] - | 18991 NM_004678
[38091] chrY [27177050, 27198251] - | 18992 NM_001002760
[38092] chrY [27177050, 27198251] - | 18993 NM_001002761
[38093] chrY [27209230, 27246039] - | 18994 NR_001525
[38094] chrY [27209230, 27246039] - | 18995 NR_002177
[38095] chrY [27209230, 27246039] - | 18996 NR_002178
[38096] chrY [27329790, 27330920] - | 18997 NR_001526
[38097] chrY [27329790, 27330920] - | 18998 NR_002179
[38098] chrY [27329790, 27330920] - | 18999 NR_002180
gene_id
<CompressedCharacterList>
[1] 79501
[2] 100133331
[3] 100132287
[4] 100132062
[5] 81399
[6] 729759
[7] 26683
[8] 643837
[9] 148398
... ...
[38090] 9083
[38091] 442867
[38092] 442868
[38093] 114761
[38094] 474150
[38095] 474149
[38096] 252949
[38097] 474152
[38098] 474151
seqlengths
chr1 chr2 ... chr18_gl000207_random
249250621 243199373 ... 4262
> seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
> map <- as.matrix(findOverlaps(mysnps, tx))
Warning message :
In .local(query, subject, maxgap, minoverlap, type, select, ...) :
Only some seqnames from 'query' and 'subject' were not identical
> mapped_genes <- values(tx)$gene_id[map[, 2]]
> mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]], elementLengths(mapped_genes))
Erreur dans base::rep.int(x, ...) : valeur 'times' incorrecte
Simon Noël
CdeC
________________________________________
De : Hervé Pagès [hpages at fhcrc.org]
Date d'envoi : 5 décembre 2010 23:43
À : Simon Noël
Cc : bioconductor at r-project.org
Objet : Re: [BioC] maping SNPs
Hi Simon,
On 12/03/2010 10:17 AM, Simon Noël wrote:
>
> Hi,
>
>
>
> I have a really big list of SNPs names like :
>
>
>
> SNPNAME
>
> rs7547453
>
> rs2840542
>
> rs1999527
>
> rs4648545
>
> rs10915459
>
> rs16838750
>
> rs12128230
>
> ...
>
>
>
> I woudlike to map them to their official gene symbol. What the best way to
> procede?
Those ids look like RefSNP ids. AFAIK dbSNP doesn't provide mappings
from SNPs to genes and I don't think we have this kind of mappings
either in our collection of annotations (*.db packages).
But if your SNPs are Human then you can do the mapping yourself by
using a SNPlocs.Hsapies.dbSNP.* package and the GenomicFeatures
packages.
The latest SNPlocs.Hsapies.dbSNP.* package is
SNPlocs.Hsapiens.dbSNP.20101109 (dbSNP Build 132): it contains
SNP locations relative to the GRCh37 genome:
> library(SNPlocs.Hsapiens.dbSNP.20101109)
> getSNPcount()
ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9
ch10
1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129 995075
1158707
ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19
ch20
1147722 1105364 815729 740129 657719 757926 641905 645646 520666
586708
ch21 ch22 chX chY chMT
338254 331060 529608 67438 624
Note that it doesn't contain *all* SNPs from dbSNP Build 132:
only a subset of "clean" SNPs (see ?SNPlocs.Hsapiens.dbSNP.20101109
for the details).
> ch22snps <- getSNPlocs("ch22")
> ch22snps[1:5, ]
RefSNP_id alleles_as_ambig loc
1 56342815 K 16050353
2 7288968 S 16050994
3 6518357 M 16051107
4 7292503 R 16051209
5 6518368 Y 16051241
Note that the rs prefix has been dropped.
So here is how to proceed:
First you can use the following function to make a GRanges object from
your SNP ids:
makeGRangesFromRefSNPids <- function(myids)
{
ans_seqnames <- character(length(myids))
ans_seqnames[] <- "unknown"
ans_locs <- integer(length(myids))
for (seqname in names(getSNPcount())) {
locs <- getSNPlocs(seqname)
ids <- paste("rs", locs$RefSNP_id, sep="")
myrows <- match(myids, ids)
ans_seqnames[!is.na(myrows)] <- seqname
ans_locs[!is.na(myrows)] <- locs$loc[myrows]
}
GRanges(seqnames=ans_seqnames,
IRanges(start=ans_locs, width=1),
RefSNP_id=myids)
}
This takes between 3 and 5 minutes:
> myids <- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
"rs10915459", "rs16838750", "rs12128230", "rs999999999")
> mysnps <- makeGRangesFromRefSNPids(myids)
> mysnps # a GRanges object with 1 SNP per row
GRanges with 8 ranges and 1 elementMetadata value
seqnames ranges strand | myids
<Rle> <IRanges> <Rle> | <character>
[1] ch1 [2195117, 2195117] * | rs7547453
[2] ch1 [2291680, 2291680] * | rs2840542
[3] ch1 [3256108, 3256108] * | rs1999527
[4] ch1 [3577321, 3577321] * | rs4648545
[5] ch1 [4230463, 4230463] * | rs10915459
[6] ch1 [4404344, 4404344] * | rs16838750
[7] ch1 [4501911, 4501911] * | rs12128230
[8] unknown [ 0, 0] * | rs999999999
seqlengths
ch1 unknown
NA NA
The next step is to create a TranscriptDb object with
makeTranscriptDbFromUCSC() or makeTranscriptDbFromBiomart()
from the GenomicFeatures package. This TranscriptDb object will
contain the transcript locations and their associated
genes extracted from the annotation source you choose.
For example, if you want to use RefSeq genes:
## Takes about 3 minutes:
> txdb <- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> txdb
TranscriptDb object:
| Db type: TranscriptDb
| Data source: UCSC
| Genome: hg19
| UCSC Table: refGene
| Type of Gene ID: Entrez Gene ID
| Full dataset: yes
| transcript_nrow: 37924
| exon_nrow: 230024
| cds_nrow: 204571
| Db created by: GenomicFeatures package from Bioconductor
| Creation time: 2010-12-05 19:41:40 -0800 (Sun, 05 Dec 2010)
| GenomicFeatures version at creation time: 1.2.2
| RSQLite version at creation time: 0.9-4
| DBSCHEMAVERSION: 1.0
Note the type of gene IDs (Entrez Gene ID) stored in this TranscriptDb
object: this means that later you will be able to use the org.Hs.eg.db
package to map your gene ids to their symbol (the org.*.eg.db packages
are Entrez Gene ID centric).
To extract the transcript locations together with their genes:
> tx <- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
> tx # a GRanges object with 1 transcript per row
GRanges with 37924 ranges and 1 elementMetadata value
seqnames ranges strand | gene_id
<Rle> <IRanges> <Rle> | <CompressedCharacterList>
[1] chr1 [ 69091, 70008] + | 79501
[2] chr1 [323892, 328581] + | 100133331
[3] chr1 [323892, 328581] + | 100132287
[4] chr1 [323892, 328581] + | 100132062
[5] chr1 [367659, 368597] + | 81399
[6] chr1 [367659, 368597] + | 729759
[7] chr1 [367659, 368597] + | 26683
[8] chr1 [763064, 789740] + | 643837
[9] chr1 [861121, 879961] + | 148398
... ... ... ... ... ...
[37916] chrY [27177050, 27198251] - | 9083
[37917] chrY [27177050, 27198251] - | 442867
[37918] chrY [27177050, 27198251] - | 442868
[37919] chrY [27209230, 27246039] - | 114761
[37920] chrY [27209230, 27246039] - | 474150
[37921] chrY [27209230, 27246039] - | 474149
[37922] chrY [27329790, 27330920] - | 252949
[37923] chrY [27329790, 27330920] - | 474152
[37924] chrY [27329790, 27330920] - | 474151
seqlengths
chr1 chr2 ... chr18_gl000207_random
249250621 243199373 ... 4262
Now you can use findOverlaps() on 'mysnps' and 'tx' to find
the transcripts hits by your snps. But before you can do this,
you need to rename the sequences in 'mysnps' because dbSNPs and
UCSC use different naming conventions for the chromosomes:
> seqnames(mysnps) <- sub("ch", "chr", seqnames(mysnps))
Then:
> map <- as.matrix(findOverlaps(mysnps, tx))
'map' contains the mapping between your SNPs and their genes but not
in a readable form (this matrix contains indices) so we make the
'snp2gene' data frame with 2 cols: 1 for your SNP ids and 1 for
the associated gene ids:
> mapped_genes <- values(tx)$gene_id[map[, 2]]
> mapped_snps <- rep.int(values(mysnps)$myids[map[, 1]],
elementLengths(mapped_genes))
> snp2gene <- unique(data.frame(snp_id=mapped_snps,
gene_id=unlist(mapped_genes)))
> rownames(snp2gene) <- NULL
> snp2gene[1:4, ]
snp_id gene_id
1 rs7547453 6497
2 rs2840542 79906
3 rs1999527 63976
4 rs4648545 7161
Note that there is no guarantee that the number of rows in this
data frame is the number of your original SNP ids because the
relation between SNP ids and gene ids is of course not one-to-one.
Also the method described above considers that a SNP hits a gene
if it's located between the start and the end of one of its
transcripts but it doesn't take in account the exon structure of
the transcripts. If you want to do this you need to use exonsBy()
(from GenomicFeatures) to extract the exons grouped by transcripts
(this will be stored in a GRangesList object) and use this object
instead of 'tx' in the call to findOverlaps().
Hope this helps,
H.
>
>
>
> Simon Noël
> CdeC
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list