[BioC] RE : RE : RE : maping SNPs
Hervé Pagès
hpages at fhcrc.org
Wed Jan 12 03:05:54 CET 2011
Hi Simon,
First thanks to Valerie for providing the Windows binary of
SNPlocs.Hsapiens.dbSNP.20101109.
More below...
On 01/11/2011 01:18 PM, Simon Noël wrote:
>
> Hi,
>
>
> You say that one of the warning was because I was only having SNPs from
> chr1. I decided to add the 2 first SNPs I have from chrd but I get a new
> error...
>
> > myids<- c("rs7547453", "rs2840542", "rs1999527", "rs4648545",
> "rs10915459", "rs16838750", "rs12128230", "rs4637157", "rs11900053",
> "rs999999999")
> > mysnps<- makeGRangesFromRefSNPids(myids)
> Errorin solveUserSEW0(start = start, end = end, width = width) :
> solving row 8: range cannot be determined from the supplied arguments
> (too many NAs)
> In addition: Warning messages:
> 1: In ans_locs[!is.na(myrows)]<- locs$loc[myrows] :
> number of items to replace is not a multiple of replacement length
> 2: In ans_locs[!is.na(myrows)]<- locs$loc[myrows] :
> number of items to replace is not a multiple of replacement length
>
> But if I only try with the 2 first SNPs on chr2, I have
>
> > myids<- c("rs4637157", "rs11900053", "rs999999999")
> > mysnps<- makeGRangesFromRefSNPids(myids)
> Warning message:
> In ans_locs[!is.na(myrows)]<- locs$loc[myrows] :
> number of items to replace is not a multiple of replacement length
> > mysnps
> GRangeswith 3 ranges and 1 elementMetadata value
> seqnames ranges strand | RefSNP_id
> <Rle> <IRanges> <Rle> |<character>
> [1] ch2 [29443, 29443] * | rs4637157
> [2] ch2 [36787, 36787] * | rs11900053
> [3] unknown [ 0, 0] * | rs999999999
> seqlengths
> ch2 unknown
> NA NA
>
> So is that mean that I will have to go chr by chr and split my big file?
No it just means that my initial makeGRangesFromRefSNPids() was
broken :-/ Please try with this one instead:
makeGRangesFromRefSNPids <- function(myids, verbose=FALSE)
{
ans_seqnames <- character(length(myids))
ans_seqnames[] <- "unknown"
ans_locs <- integer(length(myids))
for (seqname in names(getSNPcount())) {
if (verbose)
cat("Processing ", seqname, " SNPs ...\n", sep="")
locs <- getSNPlocs(seqname)
ids <- paste("rs", locs$RefSNP_id, sep="")
myrows <- match(myids, ids)
hit_idx <- !is.na(myrows)
ans_seqnames[hit_idx] <- seqname
ans_locs[hit_idx] <- locs$loc[myrows[hit_idx]]
}
GRanges(seqnames=ans_seqnames,
IRanges(start=ans_locs, width=1),
RefSNP_id=myids)
}
Since this function ends up loading all the SNPs in memory (and there
are 23 millions of them) you need a machine with at least 2.5 or 3 GB
of RAM (although makeGRangesFromRefSNPids() could easily be improved
to use less memory).
>
> Now for the problem of changing ch1 to chr1
>
> > seqnames(mysnps)
> 'factor' Rle of length 8 with 2 runs
> Lengths: 7 1
> Values : ch1 unknown
> Levels(2): ch1 unknown
> > seqnames(mysnps)<- sub("ch", "chr", seqnames(mysnps))
> > seqnames(mysnps)
> 'factor' Rle of length 8 with 2 runs
> Lengths: 7 1
> Values : chr1 unknown
> Levels(2): chr1 unknown
> > map<- as.matrix(findOverlaps(mysnps, tx))
> Message d'avis :
> In .local(query, subject, maxgap, minoverlap, type, select, ...) :
> Only some seqnames from 'query' and 'subject' were not identical
> > mapExon<- as.matrix(findOverlaps(mysnps, txExon))
> Message d'avis :
> In .local(query, subject, maxgap, minoverlap, type, select, ...) :
> Only some seqnames from 'query' and 'subject' were not identical
This warning is caused by the presence of the fake "unknown" sequence
in 'mysnps' so you can safely ignore it. I agree that it might not be
totally clear what this warning is about or why we need such warning in
the first place. In BioC devel, the warning message has been slightly
modified (in the hope that it would be a little bit more explicit) to
something like:
Warning message:
In .Seqinfo.mergexy(x, y) :
Each of the 2 combined objects has sequence levels not in the other:
- in 'x': unknown
- in 'y': chr3, chr4, chr5, chr6, chr7, chrX, chr8, chr9...
Make sure to always combine/compare objects based on the same reference
genome (use suppressWarnings() to suppress this warning).
> >
> > mapped_genes<- values(tx)$gene_id[map[, 2]]
> > mapped_snps<- rep.int(values(mysnps)$RefSNP_id[map[, 1]],
> elementLengths(mapped_genes))
> > snp2gene<- unique(data.frame(snp_id=mapped_snps,
> gene_id=unlist(mapped_genes)))
> > rownames(snp2gene)<- NULL
> > snp2gene[1:4, ]
> snp_id gene_id
> 1 rs7547453 6497
> 2 rs2840542 79906
> 3 rs1999527 63976
> 4 rs4648545 7161
> So now it's working on my computer:) but I am only able to do SNPs from one
> chromosome as I say.
>
> On the super computer, it still doesn't work and on my
> computer, it still taking a lot of time. What isn't working is
>
> > txdb<- makeTranscriptDbFromUCSC(genome="hg19", tablename="refGene")
> Downloadthe refGene table ... OK
> Downloadthe refLink table ... OK
> Extractthe 'transcripts' data frame ... OK
> Extractthe 'splicings' data frame ... OK
> Downloadand preprocess the 'chrominfo' data frame ... OK
> Preparethe 'metadata' data frame ... OK
> Makethe TranscriptDb object ... Error in .writeMetadataTable(conn,
> metadata) : subscript out of bounds
> In addition: There were 50 or more warnings (use warnings() to see the first
> 50)
Probably because some of the packages need to be updated on this machine
(as you are already aware). See my sessionInfo below.
Cheers,
H.
R version 2.12.0 (2010-10-15)
Platform: x86_64-unknown-linux-gnu (64-bit)
locale:
[1] LC_CTYPE=en_US.utf8 LC_NUMERIC=C
[3] LC_TIME=en_US.utf8 LC_COLLATE=en_US.utf8
[5] LC_MONETARY=C LC_MESSAGES=en_US.utf8
[7] LC_PAPER=en_US.utf8 LC_NAME=C
[9] LC_ADDRESS=C LC_TELEPHONE=C
[11] LC_MEASUREMENT=en_US.utf8 LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] GenomicFeatures_1.2.3
[2] SNPlocs.Hsapiens.dbSNP.20101109_0.99.2
[3] GenomicRanges_1.2.3
[4] IRanges_1.8.8
loaded via a namespace (and not attached):
[1] Biobase_2.10.0 biomaRt_2.6.0 Biostrings_2.18.2
BSgenome_1.18.2
[5] DBI_0.2-5 RCurl_1.5-0 RSQLite_0.9-4
rtracklayer_1.10.6
[9] XML_3.2-0
>
>
> Simon Noël
> CdeC
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at r-project.org
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
--
Hervé Pagès
Program in Computational Biology
Division of Public Health Sciences
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N, M2-B876
P.O. Box 19024
Seattle, WA 98109-1024
E-mail: hpages at fhcrc.org
Phone: (206) 667-5791
Fax: (206) 667-1319
More information about the Bioconductor
mailing list