[BioC] RE : RE : RE : maping SNPs

Wed Jan 12 03:05:54 CET 2011

Hi Simon,

First thanks to Valerie for providing the Windows binary of
SNPlocs.Hsapiens.dbSNP.20101109.

More below...

On 01/11/2011 01:18 PM, Simon Noël wrote:
>
>     Hi,
>
>
>     You say that one of the warning was because I was only having SNPs from
>     chr1.  I decided to add the 2 first SNPs I have from chrd but I get a new
>     error...
>
>     >  myids<-  c("rs7547453",  "rs2840542",  "rs1999527",  "rs4648545",
>     "rs10915459",  "rs16838750",  "rs12128230", "rs4637157", "rs11900053",
>     "rs999999999")
>     >  mysnps<- makeGRangesFromRefSNPids(myids)
>      Errorin solveUserSEW0(start = start, end = end, width = width) :
>       solving row 8: range cannot be determined from the supplied arguments
>     (too many NAs)
>     In addition: Warning messages:
>     1: In ans_locs[!is.na(myrows)]<- locs$loc[myrows] :
>       number of items to replace is not a multiple of replacement length
>     2: In ans_locs[!is.na(myrows)]<- locs$loc[myrows] :
>       number of items to replace is not a multiple of replacement length
>
>     But if I only try with the 2 first SNPs on chr2, I have
>
>     >  myids<- c("rs4637157", "rs11900053", "rs999999999")
>     >  mysnps<- makeGRangesFromRefSNPids(myids)
>     Warning message:
>     In ans_locs[!is.na(myrows)]<- locs$loc[myrows] :
>       number of items to replace is not a multiple of replacement length
>     >  mysnps
>      GRangeswith 3 ranges and 1 elementMetadata value
>         seqnames         ranges strand |   RefSNP_id
>            <Rle>       <IRanges>   <Rle>  |<character>
>     [1]      ch2 [29443, 29443]      * |   rs4637157
>     [2]      ch2 [36787, 36787]      * |  rs11900053
>     [3]  unknown [    0,     0]      * | rs999999999
>     seqlengths
>          ch2 unknown
>           NA      NA
>
>     So is that mean that I will have to go chr by chr and split my big file?

No it just means that my initial makeGRangesFromRefSNPids() was
broken :-/  Please try with this one instead:

makeGRangesFromRefSNPids <- function(myids, verbose=FALSE)
{
     ans_seqnames <- character(length(myids))
     ans_seqnames[] <- "unknown"
     ans_locs <- integer(length(myids))
     for (seqname in names(getSNPcount())) {
         if (verbose)
             cat("Processing ", seqname, " SNPs ...\n", sep="")
         locs <- getSNPlocs(seqname)
         ids <- paste("rs", locs$RefSNP_id, sep="")
         myrows <- match(myids, ids)
         hit_idx <- !is.na(myrows)
         ans_seqnames[hit_idx] <- seqname
         ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]]
     }
     GRanges(seqnames=ans_seqnames,
             IRanges(start=ans_locs, width=1),
             RefSNP_id=myids)
}

Since this function ends up loading all the SNPs in memory (and there 
are 23 millions of them) you need a machine with at least 2.5 or 3 GB
of RAM (although makeGRangesFromRefSNPids() could easily be improved
to use less memory).

>
>     Now for the problem of changing ch1 to chr1
>
>     >  seqnames(mysnps)
>     'factor' Rle of length 8 with 2 runs
>       Lengths:       7       1
>       Values :     ch1 unknown
>     Levels(2): ch1 unknown
>     >  seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
>     >  seqnames(mysnps)
>     'factor' Rle of length 8 with 2 runs
>       Lengths:       7       1
>       Values :    chr1 unknown
>     Levels(2): chr1 unknown
>     >  map<- as.matrix(findOverlaps(mysnps, tx))
>     Message d'avis :
>     In .local(query, subject, maxgap, minoverlap, type, select, ...) :
>       Only some seqnames from 'query' and 'subject' were not identical
>     >  mapExon<- as.matrix(findOverlaps(mysnps, txExon))
>     Message d'avis :
>     In .local(query, subject, maxgap, minoverlap, type, select, ...) :
>       Only some seqnames from 'query' and 'subject' were not identical

This warning is caused by the presence of the fake "unknown" sequence
in 'mysnps' so you can safely ignore it. I agree that it might not be
totally clear what this warning is about or why we need such warning in
the first place. In BioC devel, the warning message has been slightly 
modified (in the hope that it would be a little bit more explicit) to
something like:

Warning message:
In .Seqinfo.mergexy(x, y) :
   Each of the 2 combined objects has sequence levels not in the other:
   - in 'x': unknown
   - in 'y': chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9...
   Make sure to always combine/compare objects based on the same reference
   genome (use suppressWarnings() to suppress this warning).

>     >
>     >  mapped_genes<- values(tx)$gene_id[map[, 2]]
>     >  mapped_snps<-    rep.int(values(mysnps)$RefSNP_id[map[,    1]],
>     elementLengths(mapped_genes))
>     >  snp2gene<-          unique(data.frame(snp_id=mapped_snps,
>     gene_id=unlist(mapped_genes)))
>     >  rownames(snp2gene)<- NULL
>     >  snp2gene[1:4, ]
>          snp_id gene_id
>     1 rs7547453    6497
>     2 rs2840542   79906
>     3 rs1999527   63976
>     4 rs4648545    7161
>     So now it's working on my computer:)  but I am only able to do SNPs from one
>     chromosome as I say.
>
>     On     the    super    computer, it still doesn't work    and    on my
>     computer, it still taking a lot of time.  What isn't working is
>
>     >  txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
>      Downloadthe refGene table ... OK
>      Downloadthe refLink table ... OK
>      Extractthe 'transcripts' data frame ... OK
>      Extractthe 'splicings' data frame ... OK
>      Downloadand preprocess the 'chrominfo' data frame ... OK
>      Preparethe 'metadata' data frame ... OK
>      Makethe TranscriptDb object  ... Error  in  .writeMetadataTable(conn,
>     metadata) : subscript out of bounds
>     In addition: There were 50 or more warnings (use warnings() to see the first
>     50)

Probably because some of the packages need to be updated on this machine
(as you are already aware). See my sessionInfo below.

Cheers,
H.

R version 2.12.0 (2010-10-15)
Platform: x86_64-unknown-linux-gnu (64-bit)

locale:
  [1] LC_CTYPE=en_US.utf8       LC_NUMERIC=C
  [3] LC_TIME=en_US.utf8        LC_COLLATE=en_US.utf8
  [5] LC_MONETARY=C             LC_MESSAGES=en_US.utf8
  [7] LC_PAPER=en_US.utf8       LC_NAME=C
  [9] LC_ADDRESS=C              LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base

other attached packages:
[1] GenomicFeatures_1.2.3
[2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2
[3] GenomicRanges_1.2.3
[4] IRanges_1.8.8

loaded via a namespace (and not attached):
[1] Biobase_2.10.0     biomaRt_2.6.0      Biostrings_2.18.2 
BSgenome_1.18.2
[5] DBI_0.2-5          RCurl_1.5-0        RSQLite_0.9-4 
rtracklayer_1.10.6
[9] XML_3.2-0

>
>
>     Simon NoÃ«l
>     CdeC
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor

-- 
Hervé Pagès

Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024

E-mail: hpages at fhcrc.org
Phone:  (206) 667-5791
Fax:    (206) 667-1319