[BioC] RE : RE : maping SNPs

Tue Jan 11 17:32:10 CET 2011

Hi Simon,

The windows binary of SNPlocs.Hsapiens.dbSNP.20101109 is now available.

Valerie

On 01/07/11 10:48, Valerie Obenchain wrote:
> Hi Simon,
>
> Please see responses below.
>
>
>
> On 01/06/2011 09:47 AM, Simon Noël wrote:
>    
>> Thank you.  Sorry for the delay.  As I say, on my
>> computer, thing run really slowy...  I have no error but I have some
>> warning and no result...
>>
>>
>>
>> library("IRanges")
>> library("GenomicRanges")
>> library("GenomicFeatures")
>> #À changer si une version plus récente de la librairie est téléchargée.
>> library(SNPlocs.Hsapiens.dbSNP.20101109)
>>
>>      
>>> getSNPcount()
>>>        
>>      ch1     ch2     ch3     ch4     ch5     ch6     ch7     ch8     ch9    ch10
>>
>> 1849438 1936836 1613418 1613633 1453710 1446827 1335745 1243129
>> 995075 1158707
>>     ch11    ch12    ch13    ch14    ch15    ch16    ch17    ch18    ch19    ch20
>>
>> 1147722 1105364  815729  740129  657719  757926  641905  645646
>> 520666  586708
>>     ch21    ch22     chX     chY     chMT
>>   338254  331060  529608   67438     624
>>      
>>> ch22snps<- getSNPlocs("ch22")
>>> ch22snps[1:5, ]
>>>        
>>    RefSNP_id alleles_as_ambig      loc
>> 1  56342815                K 16050353
>> 2   7288968                S 16050994
>> 3   6518357                M 16051107
>> 4   7292503                R 16051209
>> 5   6518368                Y 16051241
>>      
>>>
>>> makeGRangesFromRefSNPids<- function(myids)
>>>        
>> + {
>> +      ans_seqnames<- character(length(myids))
>> +      ans_seqnames[]<- "unknown"
>> +      ans_locs<- integer(length(myids))
>> +      for (seqname in names(getSNPcount())) {
>> +          locs<- getSNPlocs(seqname)
>> +          ids<- paste("rs", locs$RefSNP_id, sep="")
>> +          myrows<- match(myids, ids)
>> +          ans_seqnames[!is.na(myrows)]<- seqname
>> +          ans_locs[!is.na(myrows)]<- locs$loc[myrows]
>> +      }
>> +      GRanges(seqnames=ans_seqnames,
>> +              IRanges(start=ans_locs, width=1),
>> +              RefSNP_id=myids)
>> + }
>>      
>>>
>>> myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
>>>        
>> "rs10915459", "rs16838750", "rs12128230", "rs999999999")
>>      
>>> mysnps<- makeGRangesFromRefSNPids(myids)
>>>        
>> Warning message:
>> In ans_locs[!is.na(myrows)]<- locs$loc[myrows] :
>>    number of items to replace is not a multiple of replacement length
>>      
>>> mysnps  # a GRanges object with 1 SNP per row
>>>        
>>   GRangeswith 8 ranges and 1 elementMetadata value
>>      seqnames             ranges strand |   RefSNP_id
>>         <Rle>           <IRanges>   <Rle>  |<character>
>> [1]      ch1 [2195117, 2195117]      * |   rs7547453
>> [2]      ch1 [2291680, 2291680]      * |   rs2840542
>> [3]      ch1 [3256108, 3256108]      * |   rs1999527
>> [4]      ch1 [3577321, 3577321]      * |   rs4648545
>> [5]      ch1 [4230463, 4230463]      * |  rs10915459
>> [6]      ch1 [4404344, 4404344]      * |  rs16838750
>> [7]      ch1 [4501911, 4501911]      * |  rs12128230
>> [8]  unknown [      0,       0]      * | rs999999999
>> seqlengths
>>       ch1 unknown
>>        NA      NA
>>      
>>> txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
>>>        
>>   Downloadthe refGene table ... OK
>>   Downloadthe refLink table ... OK
>>   Extractthe 'transcripts' data frame ... OK
>>   Extractthe 'splicings' data frame ... OK
>>   Downloadand preprocess the 'chrominfo' data frame ... OK
>>   Preparethe 'metadata' data frame ... OK
>>   Makethe TranscriptDb object ... OK
>> Il y a eu 50 avis ou plus (utilisez warnings() pour voir les 50 premiers)
>>      
>>> txdb
>>>        
>>   TranscriptDbobject:
>> | Db type: TranscriptDb
>> | Data source: UCSC
>> | Genome: hg19
>> | UCSC Table: refGene
>> | Type of Gene ID: Entrez Gene ID
>> | Full dataset: yes
>> | transcript_nrow: 38098
>> | exon_nrow: 230201
>> | cds_nrow: 204683
>> | Db created by: GenomicFeatures package from Bioconductor
>> | Creation time: 2011-01-06 11:55:44 -0500 (Thu, 06 Jan 2011)
>> | GenomicFeatures version at creation time: 1.2.3
>> | RSQLite version at creation time: 0.9-4
>> | DBSCHEMAVERSION: 1.0
>>      
>>> #voir pour exonsBy() à la place de tx
>>> txExon = exonsBy(txdb, by= "tx")
>>> txExon
>>>        
>>   GRangesListof length 38098
>> $1
>> GRanges with 25 ranges and 3 elementMetadata values
>>       seqnames               ranges strand
>> |   exon_id   exon_name exon_rank
>>          <Rle>             <IRanges>   <Rle>    |<integer>  <character>
>> <integer>
>>   [1]     chr1 [66999825, 67000051]      +   |         1
>> NA         1
>>   [2]     chr1 [67091530, 67091593]      +   |         2
>> NA         2
>>   [3]     chr1 [67098753, 67098777]      +   |         3
>> NA         3
>>   [4]     chr1 [67101627, 67101698]      +   |         4
>> NA         4
>>   [5]     chr1 [67105460, 67105516]      +   |         5
>> NA         5
>>   [6]     chr1 [67108493, 67108547]      +   |         6
>> NA         6
>>   [7]     chr1 [67109227, 67109402]      +   |         7
>> NA         7
>>   [8]     chr1 [67126196, 67126207]      +   |         8
>> NA         8
>>   [9]     chr1 [67133213, 67133224]      +   |         9
>> NA         9
>>   ...      ...                  ...    ... ...       ...
>> ...       ...
>> [17]     chr1 [67155873, 67155999]      +   |        17
>> NA        17
>> [18]     chr1 [67161117, 67161176]      +   |        18
>> NA        18
>> [19]     chr1 [67184977, 67185088]      +   |        19
>> NA        19
>> [20]     chr1 [67194947, 67195102]      +   |        20
>> NA        20
>> [21]     chr1 [67199431, 67199563]      +   |        21
>> NA        21
>> [22]     chr1 [67205018, 67205220]      +   |        22
>> NA        22
>> [23]     chr1 [67206341, 67206405]      +   |        23
>> NA        23
>> [24]     chr1 [67206955, 67207119]      +   |        24
>> NA        24
>> [25]     chr1 [67208756, 67210768]      +   |        25
>> NA        25
>> ...
>> <38097 more elements>
>>
>> seqlengths
>>                    chr1                  chr2 ... chr18_gl000207_random
>>               249250621             243199373 ...                  4262
>>      
>>> tx<- transcripts(txdb, columns=c("tx_id", "tx_name", "gene_id"))
>>> tx  # a GRanges object with 1 transcript per row
>>>        
>>   GRangeswith 38098 ranges and 3 elementMetadata values
>>          seqnames               ranges strand   |     tx_id      tx_name
>>             <Rle>             <IRanges>   <Rle>    |<integer>   <character>
>>      [1]     chr1     [ 69091,  70008]      +   |      1021 NM_001005484
>>      [2]     chr1     [323892, 328581]      +   |      1023    NR_028327
>>      [3]     chr1     [323892, 328581]      +   |      1024    NR_028322
>>      [4]     chr1     [323892, 328581]      +   |      1025    NR_028325
>>      [5]     chr1     [367659, 368597]      +   |      1022 NM_001005277
>>      [6]     chr1     [367659, 368597]      +   |      1026 NM_001005221
>>      [7]     chr1     [367659, 368597]      +   |      1027 NM_001005224
>>      [8]     chr1     [763064, 789740]      +   |       174    NR_015368
>>      [9]     chr1     [861121, 879961]      +   |      1035    NM_152486
>>      ...      ...                  ...    ... ...       ...          ...
>> [38090]     chrY [27177050, 27198251]      -   |     18991    NM_004678
>> [38091]     chrY [27177050, 27198251]      -   |     18992 NM_001002760
>> [38092]     chrY [27177050, 27198251]      -   |     18993 NM_001002761
>> [38093]     chrY [27209230, 27246039]      -   |     18994    NR_001525
>> [38094]     chrY [27209230, 27246039]      -   |     18995    NR_002177
>> [38095]     chrY [27209230, 27246039]      -   |     18996    NR_002178
>> [38096]     chrY [27329790, 27330920]      -   |     18997    NR_001526
>> [38097]     chrY [27329790, 27330920]      -   |     18998    NR_002179
>> [38098]     chrY [27329790, 27330920]      -   |     18999    NR_002180
>>                            gene_id
>>          <CompressedCharacterList>
>>      [1]                     79501
>>      [2]                 100133331
>>      [3]                 100132287
>>      [4]                 100132062
>>      [5]                     81399
>>      [6]                    729759
>>      [7]                     26683
>>      [8]                    643837
>>      [9]                    148398
>>      ...                       ...
>> [38090]                      9083
>> [38091]                    442867
>> [38092]                    442868
>> [38093]                    114761
>> [38094]                    474150
>> [38095]                    474149
>> [38096]                    252949
>> [38097]                    474152
>> [38098]                    474151
>> seqlengths
>>                    chr1                  chr2 ... chr18_gl000207_random
>>               249250621             243199373 ...                  4262
>>      
>>> seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
>>> map<- as.matrix(findOverlaps(mysnps, tx))
>>>        
>> Message d'avis :
>> In .local(query, subject, maxgap, minoverlap, type, select, ...) :
>>    No seqnames from the 'query' and 'subject' were identical
>>      
> It looks like the seqnames in mysnps were not renamed to the 'chr'
> prefix. Prior to running findOverlaps, check the seqnames in both
> objects to make sure the ranges you are interested in mapping have
> identical seqnames,
>
>      >  unique(seqnames(mysnps))
>      [1] chr1    unknown
>      Levels: chr1 unknown
>
>
>      >  unique(seqnames(tx))
>       [1] chr1                  chr10                 chr11
>       [4] chr12                 chr13                 chr14
>        .... and many more.
>
> Both objects have 'chr1' and when we run findOverlaps,
>
>      >  map<- as.matrix(findOverlaps(mysnps, tx))
>      Warning message:
>      In .local(query, subject, maxgap, minoverlap, type, select, ...) :
>      Only some seqnames from 'query' and 'subject' were not identical
>
> This warning makes sense because mysnps only has 'chr1' and 'unknown' to
> map to tx but there are many more seqnames in tx. Additionally,  tx does
> not have an 'unknown' seqname.
>
>      >  map
>           query subject
>      [1,]     1      49
>      [2,]     2    1944
>      [3,]     3      62
>      [4,]     3      63
>      [5,]     4      67
>
>
>    
>>> mapExon<- as.matrix(findOverlaps(mysnps, txExon))
>>>        
>> Message d'avis :
>> In .local(query, subject, maxgap, minoverlap, type, select, ...) :
>>    No seqnames from the 'query' and 'subject' were identical
>>      
>>> mapped_genes<- values(tx)$gene_id[map[, 2]]
>>> mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
>>>        
>> elementLengths(mapped_genes))
>>      
>>> snp2gene<- unique(data.frame(snp_id=mapped_snps,
>>>        
>> gene_id=unlist(mapped_genes)))
>>      
>>> rownames(snp2gene)<- NULL
>>> snp2gene[1:4, ]
>>>        
>>       snp_id gene_id
>> NA<NA>     <NA>
>> NA.1<NA>     <NA>
>> NA.2<NA>     <NA>
>> NA.3<NA>     <NA>
>>
>> On our super computer, I can't update manualy the package.  They
>> are suposed to update automatically once per month according to our
>> informactician.  The result of sessionInfo() is :
>>
>>      
>>> sessionInfo()
>>>        
>> R version 2.12.0 (2010-10-15)
>> Platform: x86_64-pc-linux-gnu (64-bit)
>> locale:
>>   [1] LC_CTYPE=en_CA.UTF-8       LC_NUMERIC=C
>>   [3] LC_TIME=en_CA.UTF-8        LC_COLLATE=en_CA.UTF-8
>>   [5] LC_MONETARY=C              LC_MESSAGES=en_CA.UTF-8
>>   [7] LC_PAPER=en_CA.UTF-8       LC_NAME=C
>>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
>> [11] LC_MEASUREMENT=en_CA.UTF-8 LC_IDENTIFICATION=C
>>   attachedbase packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>   otherattached packages:
>> [1] GenomicFeatures_1.2.3
>> [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2
>> [3] GenomicRanges_1.2.2
>> [4] IRanges_1.8.7
>>   loadedvia a namespace (and not attached):
>>   [1] Biobase_2.10.0     biomaRt_2.6.0      Biostrings_2.18.2  BSgenome_1.18.2
>>   [5] DBI_0.2-5          RCurl_1.5-0
>> RSQLite_0.9-4      rtracklayer_1.10.6
>>   [9] tools_2.12.0       XML_3.2-0
>>
>> On my computer , something stragge happen.  The
>> package get unstalled when I run
>> |   source("http://bioconductor.org/biocLite.R")
>>     update.packages(repos=biocinstallRepos(), ask=FALSE, checkBuilt=TRUE)
>> |
>>
>> And when I try to reinstall SNPlocs.Hsapiens.dbSNP.20101109, I now get
>> an error message...
>>
>>      
>>> source("http://www.bioconductor.org/biocLite.R")
>>>        
>>   BioC_mirror= http://www.bioconductor.org
>> Change using chooseBioCmirror().
>>      
>>> biocLite("SNPlocs.Hsapiens.dbSNP.20101109")
>>>        
>> Using R version 2.12.0, biocinstall version 2.7.4.
>>   InstallingBioconductor version 2.7 packages:
>> [1] "SNPlocs.Hsapiens.dbSNP.20101109"
>>   Pleasewait...
>> Message d'avis :
>> In getDependencies(pkgs, dependencies, available, lib) :
>>    package ‘SNPlocs.Hsapiens.dbSNP.20101109’ is not available
>>
>>
>>
>>
>>
>>
>>
>> But for sessionInfo, it's :
>>
>>
>>
>>      
>>> sessionInfo()
>>>        
>> R version 2.12.0 (2010-10-15)
>> Platform: i386-pc-mingw32/i386 (32-bit)
>>
>> locale:
>> [1] LC_COLLATE=French_Canada.1252  LC_CTYPE=French_Canada.1252
>> [3] LC_MONETARY=French_Canada.1252 LC_NUMERIC=C
>> [5] LC_TIME=French_Canada.1252
>>
>>   attachedbase packages:
>> [1] stats     graphics  grDevices utils     datasets  methods   base
>>
>>   otherattached packages:
>> [1] GenomicFeatures_1.2.3 GenomicRanges_1.2.2   IRanges_1.8.8
>>
>>   loadedvia a namespace (and not attached):
>> [1] Biobase_2.10.0     biomaRt_2.6.0      Biostrings_2.18.2  BSgenome_1.18.2
>>
>> [5] DBI_0.2-5          RCurl_1.5-0.1      RSQLite_0.9-4
>> rtracklayer_1.10.6
>> [9] XML_3.2-0.2
>>
>>
>>
>> Now I have a super computer that can't run the program because of
>> a subscript out of bounds and a personal computer than have a missing
>> library...  So now nothing work:(  Other problem : why when
>> I install again IRanges I got a version older than your for exemple?
>>      
> The code should run on the linux machine (super computer) once you
> address the seqnames problem above. You are having a problem installing
> SNPlocs.Hsapiens.dbSNP.20101109 on your Windows pc because a Windows
> binary of this package is not currently available. We should have this
> available and I will let you know as soon as it is ready. Sorry about that.
>
> I had a newer version of IRanges in my last email because I was using
> R-2.13, the devel branch. When using R-2.12 I have,
>
>    
>> sessionInfo()
>>      
> R version 2.12.0 Patched (2010-10-18 r53360)
> Platform: x86_64-unknown-linux-gnu (64-bit)
>
> locale:
>   [1] LC_CTYPE=en_US.UTF-8       LC_NUMERIC=C
>   [3] LC_TIME=en_US.UTF-8        LC_COLLATE=en_US.UTF-8
>   [5] LC_MONETARY=C              LC_MESSAGES=en_US.UTF-8
>   [7] LC_PAPER=en_US.UTF-8       LC_NAME=C
>   [9] LC_ADDRESS=C               LC_TELEPHONE=C
> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
>
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> other attached packages:
> [1] GenomicFeatures_1.2.3
> [2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2
> [3] GenomicRanges_1.2.2
> [4] IRanges_1.8.8
>
> loaded via a namespace (and not attached):
>   [1] Biobase_2.10.0     biomaRt_2.6.0      Biostrings_2.18.2
> BSgenome_1.18.2
>   [5] DBI_0.2-5          RCurl_1.5-0        RSQLite_0.9-4
> rtracklayer_1.10.6
>   [9] tools_2.12.0       XML_3.2-0
>
> Valerie
>    
>>
>> Simon Noël
>> CdeC
>>      
>
> 	[[alternative HTML version deleted]]
>
>    
>
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor