[R] need any advises for code optimization.
Rich FitzJohn
rich.fitzjohn at gmail.com
Mon Apr 4 11:29:01 CEST 2005
Hi,
One fruitful course for optimisation is to vectorise wherever
possible, and avoid for-loops.
Something like the code below might be a good place to start.
=============
## Generate a thousand rows of data
cube.half.size <- 2
mult.sigma <- 2
n <- 1000
whole <- data.frame(a=runif(n), b=runif(n), c=rnorm(n), d=runif(n))*10
cube.look <- function() {
f <- function(x) {
with(whole,
{i <- (abs(a - a[x]) < cube.half.size &
abs(b - b[x]) < cube.half.size &
abs(d - d[x]) < cube.half.size)
if ( any(i) ) {
subdat <- c[i]
which(i)[abs(subdat - mean(subdat)) > sd(subdat)*mult.sigma]
} else NULL
})
}
td <- lapply(seq(length=n), f)
whole[-unique(unlist(td)),]
}
## And wrap the original in a function for comparison:
cube.look.orig <- function() {
to.drop<-data.frame()
for(i in 1:length(whole$c)){
pv<-subset(whole,abs(a-whole$a[i])<cube.half.size &
abs(b-whole$b[i])<cube.half.size &
abs(d-whole$d[i])<cube.half.size);
if(length(pv$c)>1){
mean.c<-mean(pv$c)
sd.c<-sd(pv$c)
td<-subset(pv,abs(c-mean.c)>sd.c*mult.sigma)
if(length(td$c)>0){
td.index<-which(row.names(td) %in% row.names(to.drop))
to.drop<-rbind(to.drop,if(length(td.index)>0) td[-td.index,]else td)
if(length(td.index)!=length(td$c))
print(c("i=",i,"Points to drop: ",length(to.drop$c)))
}
}
}
td.orig <<- to.drop
## This does not subset the way you want:
## whole[-which(row.names(to.drop) %in% row.names(whole)),]
whole[-as.integer(row.names(to.drop)),]
}
## Time how long it takes to run over the test data.frame (10 runs):
t.new <- t(replicate(10, system.time(cube.look())))
t.orig <- t(replicate(10, system.time(cube.look.orig())))
## On my alpha, the version using lapply() takes 4.9 seconds of CPU
## time (avg 10 runs), while the original version takes 23.3 seconds -
## so we're 4.8 times faster.
apply(t.orig, 2, mean)/apply(t.new, 2, mean)
## And the results are the same.
r.new <- cube.look()
r.orig <- cube.look.orig()
identical(r.new, r.orig)
==============
The above code could almost certainly be tweaked (replacing which()
with a stripped down version would probably save some time, since the
profile indicates we spend about 10% of our time there). Using with()
saved another 10% or so, compared with indexing a..d (e.g. whole$a)
every iteration. However, trying a completely different approach
would be more likely to yield better time savings. mapply() might be
one to try, though I have a feeling this is just a wrapper around
lapply(). I believe there is a section in the "Writing R Extensions"
manual that deals with profiling, and may touch on optimisation.
Cheers,
Rich
On Apr 4, 2005 6:50 PM, Wladimir Eremeev <wl at eimb.ru> wrote:
> Dear colleagues,
>
> I have the following code. This code is to 'filter' the data set.
>
> It works on the data frame 'whole' with four numeric columns: a,b,d, and c.
> Every row in the data frame is considered as a point in 3-D space.
> Variables a,b, and d are the point's coordinates, and c is its value.
> This code looks at every point, builds a cube 'centered' at this
> point, selects the set of points inside this cube,
> calculates mean and SD of their values,
> and drops points whose values differ from the mean more than 2 SD.
>
> Here is the code.
>
> =======
> # initialization
> cube.half.size<-2 # half size of a cube to be built around every point
>
> mult.sigma<-2 # we will drop every point with value differing
> # from mean more than mult.sigma * SD
>
> to.drop<-data.frame() # the list of points to drop.
>
> for(i in 1:length(whole$c)){ #look at every point...
> pv<-subset(whole,abs(a-whole$a[i])<cube.half.size & #make the subset...
> abs(b-whole$b[i])<cube.half.size &
> abs(d-whole$d[i])<cube.half.size);
> if(length(pv$c)>1){ # if subset includes not only considered point, then
> mean.c<-mean(pv$c) # calculate mean and SD
> sd.c<-sd(pv$c)
>
> #make a list of points to drop from current subset
> td<-subset(pv,abs(c-mean.c)>sd.c*mult.sigma)
> if(length(td$c)>0){
>
> #check which of these point are already already in the list to drop
> td.index<-which(row.names(td) %in% row.names(to.drop))
>
> #and replenish the list of points to drop
> to.drop<-rbind(to.drop,if(length(td.index)>0) td[-td.index,] else td)
>
> #print out the message showing, we're alive (these messages will
> #not appear regularly, that's OK)
> if(length(td.index)!=length(td$c))
> print(c("i=",i,"Points to drop: ",length(to.drop$c)))
> }
> }
> }
>
> # make a new data set without droppped points.
> whole.flt.3<-whole[-which(row.names(to.drop) %in% row.names(whole)),]
> =======
>
> The problem is: the 'whole' data set is large, more than 100000
> rows, and the script runs several hours.
> The running time becomes greater, if I build a sphere instead of a
> cube.
>
> I would like to optimize it in order to make it run faster.
> Is it possible?
> Will a sorting take effect?
> Thank you for attention and any good feedback.
>
> --
> Best regards
> Wladimir Eremeev mailto:wl at eimb.ru
>
> ==========================================================================
> Research Scientist, PhD
> Russian Academy of Sciences
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html
>
--
Rich FitzJohn
rich.fitzjohn <at> gmail.com | http://homepages.paradise.net.nz/richa183
You are in a maze of twisty little functions, all alike
More information about the R-help
mailing list