[BioC] RE : RE : maping SNPs
Valerie Obenchain
vobencha at fhcrc.org
Tue Jan 11 17:32:10 CET 2011
Hi Simon,
The windows binary of SNPlocs.Hsapiens.dbSNP.20101109 is now available.
Valerie
On 01/07/11 10:48, Valerie Obenchain wrote:
> Hi Simon,
>
> Please see responses below.
>
>
>
> On 01/06/2011 09:47 AM, Simon Noël wrote:
>
>> Thank you. Sorry for the delay. As I say, on my
>> computer, thing run really slowy... I have no error but I have some
>> warning and no result...
>>
>>
>>
>> library("IRanges")
>> library("GenomicRanges")
>> library("GenomicFeatures")
>> #À changer si une version plus récente de la librairie est téléchargée.
>> library(SNPlocs.Hsapiens.dbSNP.20101109)
>>
>>
>>> getSNPcount()
>>>
>> ch1 ch2 ch3 ch4 ch5 ch6 ch7 ch8 ch9 ch10
>>
>> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129
>> 995075 1158707
>> ch11 ch12 ch13 ch14 ch15 ch16 ch17 ch18 ch19 ch20
>>
>> 1147722 1105364 815729 740129 657719 757926 641905 645646
>> 520666 586708
>> ch21 ch22 chX chY chMT
>> 338254 331060 529608 67438 624
>>
>>> ch22snps<- getSNPlocs("ch22")
>>> ch22snps[1:5, ]
>>>
>> RefSNP_id alleles_as_ambig loc
>> 1 56342815 K 16050353
>> 2 7288968 S 16050994
>> 3 6518357 M 16051107
>> 4 7292503 R 16051209
>> 5 6518368 Y 16051241
>>
>>>
>>> makeGRangesFromRefSNPids<- function(myids)
>>>
>> + {
>> + ans_seqnames<- character(length(myids))
>> + ans_seqnames[]<- "unknown"
>> + ans_locs<- integer(length(myids))
>> + for (seqname in names(getSNPcount())) {
>> + locs<- getSNPlocs(seqname)
>> + ids<- paste("rs", locs$RefSNP_id, sep="")
>> + myrows<- match(myids, ids)
>> + ans_seqnames[!is.na(myrows)]<- seqname
>> + ans_locs[!is.na(myrows)]<- locs$loc[myrows]
>> + }
>> + GRanges(seqnames=ans_seqnames,
>> + IRanges(start=ans_locs, width=1),
>> + RefSNP_id=myids)
>> + }
>>
>>>
>>> myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
>>>
>> "rs10915459", "rs16838750", "rs12128230", "rs999999999")
>>
>>> mysnps<- makeGRangesFromRefSNPids(myids)
>>>
>> Warning message:
>> In ans_locs[!is.na(myrows)]<- locs$loc[myrows] :
>> number of items to replace is not a multiple of replacement length
>>
>>> mysnps # a GRanges object with 1 SNP per row
>>>
>> GRangeswith 8 ranges and 1 elementMetadata value
>> seqnames ranges strand | RefSNP_id
>> <Rle> <IRanges> <Rle> |<character>
>> [1] ch1 [2195117, 2195117] * | rs7547453
>> [2] ch1 [2291680, 2291680] * | rs2840542
>> [3] ch1 [3256108, 3256108] * | rs1999527
>> [4] ch1 [3577321, 3577321] * | rs4648545
>> [5] ch1 [4230463, 4230463] * | rs10915459
>> [6] ch1 [4404344, 4404344] * | rs16838750
>> [7] ch1 [4501911, 4501911] * | rs12128230
>> [8] unknown [ 0, 0] * | rs999999999
>> seqlengths
>> ch1 unknown
>> NA NA
>>
>>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
>>>
>> Downloadthe refGene table ... OK
>> Downloadthe refLink table ... OK
>> Extractthe 'transcripts' data frame ... OK
>> Extractthe 'splicings' data frame ... OK
>> Downloadand preprocess the 'chrominfo' data frame ... OK
>> Preparethe 'metadata' data frame ... OK
>> Makethe TranscriptDb object ... OK
>> Il y a eu 50 avis ou plus (utilisez warnings() pour voir les 50 premiers)
>>
>>> txdb
>>>
>> TranscriptDbobject:
>> | Db type: TranscriptDb
>> | Data source: UCSC
>> | Genome: hg19
>> | UCSC Table: refGene
>> | Type of Gene ID: Entrez Gene ID
>> | Full dataset: yes
>> | transcript_nrow: 38098
>> | exon_nrow: 230201
>> | cds_nrow: 204683
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2011-01-06 11:55:44 -0500 (Thu, 06 Jan 2011)
>> | GenomicFeatures version at creation time: 1.2.3
>> | RSQLite version at creation time: 0.9-4
>> | DBSCHEMAVERSION: 1.0
>>
>>> #voir pour exonsBy() à la place de tx
>>> txExon = exonsBy(txdb, by= "tx")
>>> txExon
>>>
>> GRangesListof length 38098
>> $1
>> GRanges with 25 ranges and 3 elementMetadata values
>> seqnames ranges strand
>> | exon_id exon_name exon_rank
>> <Rle> <IRanges> <Rle> |<integer> <character>
>> <integer>
>> [1] chr1 [66999825, 67000051] + | 1
>> NA 1
>> [2] chr1 [67091530, 67091593] + | 2
>> NA 2
>> [3] chr1 [67098753, 67098777] + | 3
>> NA 3
>> [4] chr1 [67101627, 67101698] + | 4
>> NA 4
>> [5] chr1 [67105460, 67105516] + | 5
>> NA 5
>> [6] chr1 [67108493, 67108547] + | 6
>> NA 6
>> [7] chr1 [67109227, 67109402] + | 7
>> NA 7
>> [8] chr1 [67126196, 67126207] + | 8
>> NA 8
>> [9] chr1 [67133213, 67133224] + | 9
>> NA 9
>> ... ... ... ... ... ...
>> ... ...
>> [17] chr1 [67155873, 67155999] + | 17
>> NA 17
>> [18] chr1 [67161117, 67161176] + | 18
>> NA 18
>> [19] chr1 [67184977, 67185088] + | 19
>> NA 19
>> [20] chr1 [67194947, 67195102] + | 20
>> NA 20
>> [21] chr1 [67199431, 67199563] + | 21
>> NA 21
>> [22] chr1 [67205018, 67205220] + | 22
>> NA 22
>> [23] chr1 [67206341, 67206405] + | 23
>> NA 23
>> [24] chr1 [67206955, 67207119] + | 24
>> NA 24
>> [25] chr1 [67208756, 67210768] + | 25
>> NA 25
>> ...
>> <38097 more elements>
>>
>> seqlengths
>> chr1 chr2 ... chr18_gl000207_random
>> 249250621 243199373 ... 4262
>>
>>> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
>>> tx # a GRanges object with 1 transcript per row
>>>
>> GRangeswith 38098 ranges and 3 elementMetadata values
>> seqnames ranges strand | tx_id tx_name
>> <Rle> <IRanges> <Rle> |<integer> <character>
>> [1] chr1 [ 69091, 70008] + | 1021 NM_001005484
>> [2] chr1 [323892, 328581] + | 1023 NR_028327
>> [3] chr1 [323892, 328581] + | 1024 NR_028322
>> [4] chr1 [323892, 328581] + | 1025 NR_028325
>> [5] chr1 [367659, 368597] + | 1022 NM_001005277
>> [6] chr1 [367659, 368597] + | 1026 NM_001005221
>> [7] chr1 [367659, 368597] + | 1027 NM_001005224
>> [8] chr1 [763064, 789740] + | 174 NR_015368
>> [9] chr1 [861121, 879961] + | 1035 NM_152486
>> ... ... ... ... ... ... ...
>> [38090] chrY [27177050, 27198251] - | 18991 NM_004678
>> [38091] chrY [27177050, 27198251] - | 18992 NM_001002760
>> [38092] chrY [27177050, 27198251] - | 18993 NM_001002761
>> [38093] chrY [27209230, 27246039] - | 18994 NR_001525
>> [38094] chrY [27209230, 27246039] - | 18995 NR_002177
>> [38095] chrY [27209230, 27246039] - | 18996 NR_002178
>> [38096] chrY [27329790, 27330920] - | 18997 NR_001526
>> [38097] chrY [27329790, 27330920] - | 18998 NR_002179
>> [38098] chrY [27329790, 27330920] - | 18999 NR_002180
>> gene_id
>> <CompressedCharacterList>
>> [1] 79501
>> [2] 100133331
>> [3] 100132287
>> [4] 100132062
>> [5] 81399
>> [6] 729759
>> [7] 26683
>> [8] 643837
>> [9] 148398
>> ... ...
>> [38090] 9083
>> [38091] 442867
>> [38092] 442868
>> [38093] 114761
>> [38094] 474150
>> [38095] 474149
>> [38096] 252949
>> [38097] 474152
>> [38098] 474151
>> seqlengths
>> chr1 chr2 ... chr18_gl000207_random
>> 249250621 243199373 ... 4262
>>
>>> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
>>> map<- as.matrix(findOverlaps(mysnps, tx))
>>>
>> Message d'avis :
>> In .local(query, subject, maxgap, minoverlap, type, select, ...) :
>> No seqnames from the 'query' and 'subject' were identical
>>
> It looks like the seqnames in mysnps were not renamed to the 'chr'
> prefix. Prior to running findOverlaps, check the seqnames in both
> objects to make sure the ranges you are interested in mapping have
> identical seqnames,
>
> > unique(seqnames(mysnps))
> [1] chr1 unknown
> Levels: chr1 unknown
>
>
> > unique(seqnames(tx))
> [1] chr1 chr10 chr11
> [4] chr12 chr13 chr14
> .... and many more.
>
> Both objects have 'chr1' and when we run findOverlaps,
>
> > map<- as.matrix(findOverlaps(mysnps, tx))
> Warning message:
> In .local(query, subject, maxgap, minoverlap, type, select, ...) :
> Only some seqnames from 'query' and 'subject' were not identical
>
> This warning makes sense because mysnps only has 'chr1' and 'unknown' to
> map to tx but there are many more seqnames in tx. Additionally, tx does
> not have an 'unknown' seqname.
>
> > map
> query subject
> [1,] 1 49
> [2,] 2 1944
> [3,] 3 62
> [4,] 3 63
> [5,] 4 67
>
>
>
>>> mapExon<- as.matrix(findOverlaps(mysnps, txExon))
>>>
>> Message d'avis :
>> In .local(query, subject, maxgap, minoverlap, type, select, ...) :
>> No seqnames from the 'query' and 'subject' were identical
>>
>>> mapped_genes<- values(tx)$gene_id[map[, 2]]
>>> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
>>>
>> elementLengths(mapped_genes))
>>
>>> snp2gene<- unique(data.frame(snp_id=mapped_snps,
>>>
>> gene_id=unlist(mapped_genes)))
>>
>>> rownames(snp2gene)<- NULL
>>> snp2gene[1:4, ]
>>>
>> snp_id gene_id
>> NA<NA> <NA>
>> NA.1<NA> <NA>
>> NA.2<NA> <NA>
>> NA.3<NA> <NA>
>>
>> On our super computer, I can't update manualy the package. They
>> are suposed to update automatically once per month according to our
>> informactician. The result of sessionInfo() is :
>>
>>
>>> sessionInfo()
>>>
>> R version 2.12.0 (2010-10-15)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> locale:
>> [1] LC_CTYPE=en_CA.UTF-8 LC_NUMERIC=C
>> [3] LC_TIME=en_CA.UTF-8 LC_COLLATE=en_CA.UTF-8
>> [5] LC_MONETARY=C LC_MESSAGES=en_CA.UTF-8
>> [7] LC_PAPER=en_CA.UTF-8 LC_NAME=C
>> [9] LC_ADDRESS=C LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>> attachedbase packages:
>> [1] stats graphics grDevices utils datasets methods base
>> otherattached packages:
>> [1] GenomicFeatures_1.2.3
>> [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2
>> [3] GenomicRanges_1.2.2
>> [4] IRanges_1.8.7
>> loadedvia a namespace (and not attached):
>> [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2 BSgenome_1.18.2
>> [5] DBI_0.2-5 RCurl_1.5-0
>> RSQLite_0.9-4 rtracklayer_1.10.6
>> [9] tools_2.12.0 XML_3.2-0
>>
>> On my computer , something stragge happen. The
>> package get unstalled when I run
>> | source("http://bioconductor.org/biocLite.R")
>> update.packages(repos=biocinstallRepos(), ask=FALSE, checkBuilt=TRUE)
>> |
>>
>> And when I try to reinstall SNPlocs.Hsapiens.dbSNP.20101109, I now get
>> an error message...
>>
>>
>>> source("http://www.bioconductor.org/biocLite.R")
>>>
>> BioC_mirror= http://www.bioconductor.org
>> Change using chooseBioCmirror().
>>
>>> biocLite("SNPlocs.Hsapiens.dbSNP.20101109")
>>>
>> Using R version 2.12.0, biocinstall version 2.7.4.
>> InstallingBioconductor version 2.7 packages:
>> [1] "SNPlocs.Hsapiens.dbSNP.20101109"
>> Pleasewait...
>> Message d'avis :
>> In getDependencies(pkgs, dependencies, available, lib) :
>> package ‘SNPlocs.Hsapiens.dbSNP.20101109’ is not available
>>
>>
>>
>>
>>
>>
>>
>> But for sessionInfo, it's :
>>
>>
>>
>>
>>> sessionInfo()
>>>
>> R version 2.12.0 (2010-10-15)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=French_Canada.1252 LC_CTYPE=French_Canada.1252
>> [3] LC_MONETARY=French_Canada.1252 LC_NUMERIC=C
>> [5] LC_TIME=French_Canada.1252
>>
>> attachedbase packages:
>> [1] stats graphics grDevices utils datasets methods base
>>
>> otherattached packages:
>> [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.2 IRanges_1.8.8
>>
>> loadedvia a namespace (and not attached):
>> [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2 BSgenome_1.18.2
>>
>> [5] DBI_0.2-5 RCurl_1.5-0.1 RSQLite_0.9-4
>> rtracklayer_1.10.6
>> [9] XML_3.2-0.2
>>
>>
>>
>> Now I have a super computer that can't run the program because of
>> a subscript out of bounds and a personal computer than have a missing
>> library... So now nothing work:( Other problem : why when
>> I install again IRanges I got a version older than your for exemple?
>>
> The code should run on the linux machine (super computer) once you
> address the seqnames problem above. You are having a problem installing
> SNPlocs.Hsapiens.dbSNP.20101109 on your Windows pc because a Windows
> binary of this package is not currently available. We should have this
> available and I will let you know as soon as it is ready. Sorry about that.
>
> I had a newer version of IRanges in my last email because I was using
> R-2.13, the devel branch. When using R-2.12 I have,
>
>
>> sessionInfo()
>>
> R version 2.12.0 Patched (2010-10-18 r53360)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
> [5] LC_MONETARY=C LC_MESSAGES=en_US.UTF-8
> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
> [9] LC_ADDRESS=C LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats graphics grDevices utils datasets methods base
>
> other attached packages:
> [1] GenomicFeatures_1.2.3
> [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2
> [3] GenomicRanges_1.2.2
> [4] IRanges_1.8.8
>
> loaded via a namespace (and not attached):
> [1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2
> BSgenome_1.18.2
> [5] DBI_0.2-5 RCurl_1.5-0 RSQLite_0.9-4
> rtracklayer_1.10.6
> [9] tools_2.12.0 XML_3.2-0
>
> Valerie
>
>>
>> Simon Noël
>> CdeC
>>
>
> [[alternative HTML version deleted]]
>
>
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
More information about the Bioconductor
mailing list