[R] speed up subsetting with certain conditions
Martin Morgan
mtmorgan at fhcrc.org
Thu Jan 13 00:12:19 CET 2011
On 1/12/2011 2:52 PM, Duke wrote:
> Hi folks,
>
> I am working on a project that requires subsetting of a found file
> based on some known file. The known file contains several lines like
> below:
>
> chr1 3237546 3237547 rs52310428 0 +
> chr1 3237549 3237550 rs52097582 0 +
> chr2 4513326 4513327 rs29769280 0 +
> chr2 4513337 4513338 rs33286009 0 +
>
> where the first column can be chr2, chr1, chr12 etc... The second and
> third are numbers (cordinates). The found file contains lines like:
>
> chr1 3213435 G C
> chr1 3237547 T C
> chr1 3237549 G T
> chr2 4513326 A G
> chr2 4513337 C G
>
> where the first column, again, can be chr1, chr2, chr12 etc... and the
> second is a number. What I have to do is to separate the found file to
> two files: one (foundY) contains lines that have the same first column
> and the second column in range of the two columns 2 and 3 of any line
> of known file, and one (foundN) contains lines that do not meet the
> previous condition. For the two examples above, foundN will be the
> first line, and foundY will be the next 4 lines.
>
> What I came up with is this algorithm:
>
> * get the uniq item in the first column of found file (chr1, chr2,
> chr12, chr13 etc...)
> * for each of the uniq item, set subset of the known file and the
> found file that have same first column, then scanning each item in the
> known subset to see if any line meets any condition
>
> The code is like below:
>
> ########## CODE START###########
> # import known and found files to data frames
> known <- read.table( "known.txt", sep="\t", header=FALSE )
> found <- read.table( "found.txt", sep="\t", header=FALSE, fill=TRUE )
>
> # get the uniq item in first column of found file
> found.Chr <- as.character(found[!duplicated(found[[1]]),1])
>
> # create two empty result data frames
> foundN <- found[0,]
> foundY <- found[0,]
>
> # scan for each of the uniq items
> for ( iChr in found.Chr ) {
> # subset of known and found with specific item
> found.iChr <- found[found[[1]]==iChr,]
> known.iChr <- known[known[[1]]==iChr,]
>
> # scan through all found subset items
> if ( nrow(known.iChr)>0 ) {
> for ( i in 1:nrow(found.iChr) ) {
> if ( nrow(known.iChr[known.iChr[[3]]>=found.iChr[i,2] &
> known.iChr[[2]]<=found.iChr[i,2],])==0 ) {
> foundN <- rbind( foundN, found.iChr[i,] )
> } else {
> foundY <- rbind( foundN, found.iChr[i,] )
> }
> }
> }
> }
>
> ########## CODE END###########
>
> The code works well, but I tested it for only small known and found
> files. When trying with larger files (the known file can contains ~ 15
> million lines, the found ~ 15k lines), it takes like hrs to run.
>
> I want to speed up the process, and I believe there must be a better
> algorithm to do this with R. My questions are:
>
> * any body has a better algorithm or comments or suggestion?
The Bioconductor project has many tools for dealing with
sequence-related data. With the data
k <- read.table(textConnection(
"chr1 3237546 3237547 rs52310428 0 +
chr1 3237549 3237550 rs52097582 0 +
chr2 4513326 4513327 rs29769280 0 +
chr2 4513337 4513338 rs33286009 0 +"))
f <- read.table(textConnection(
"chr1 3213435 G C
chr1 3237547 T C
chr1 3237549 G T
chr2 4513326 A G
chr2 4513337 C G"))
One might use the GenomicRanges package as
library(GenomicRanges)
kgr <- with(k, GRanges(V1, IRanges(V2, V3, names=V4), V6, score=V5))
fgr <- with(f, GRanges(V1, IRanges(V2, width=1), V3=V3, V4=V4))
olaps <- findOverlaps(fgr, kgr)
idx <- countOverlaps(fgr, kgr) != 0
resulting in
> idx
[1] FALSE TRUE TRUE TRUE TRUE
This will be fast.
One could write foundY with as.data.frame(fgr[idx]) (maybe a little
editing) but likely one would want to stay in R / Bioc and do something
more interesting...
See
http://bioconductor.org/install/index.html
Martin
> * I read (google) that matrices work faster than data frame. Can I use
> matrices for this case? (is matrices for numbers only?)
> * I read (google) that I should avoid rbind, and prelocate data frame
> for faster speed. How would I do that in this case?
>
> Thank you very much in advance,
>
> Bests,
>
> D.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
Dr. Martin Morgan, PhD
Fred Hutchinson Cancer Research Center
1100 Fairview Ave. N.
PO Box 19024 Seattle, WA 98109
More information about the R-help
mailing list