[R] dramatic speed difference in lapply
jim holtman
jholtman at gmail.com
Fri Feb 26 22:18:42 CET 2010
On my computer your two examples seem to execute about the same:
> fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions,
+ ...){
+ lapplyFunctionRecurse <- function(cdata, level=1, ...){
+ if(level==1){
+
+
return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]],
drop=T),
+ function(x) lapplyFunctionRecurse(x, level+1, ...)))
+ } else if (level==length(pivotColumns)) {
+ #
+
return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
+ function(x, ...) listNameFunctions(data[x,], ...)))
+
return(lapply(split(cdata,data[cdata,pivotColumns[level]],
+ drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
+ data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T),
+ sum(data[cdata,"A"], na.rm=T))))
+ } else {
+ return(lapply(split(cdata,data[cdata,pivotColumns[level]],
+ drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
+ }
+ }
+ result = lapplyFunctionRecurse(data, ...)
+ matrix2 <- do.call('rbind', lapply(result, function(x)
+ do.call('rbind',x)))
+ return(matrix2)
+ }
>
> Rprof()
> dat <- data.frame(D=sample(32000:33000, 666000,
+ T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
> temp = proc.time(); ret = fedb.ddplyWrapper2Fast(dat, c("D", "Fid"),
+ function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> proc.time()-temp
user system elapsed
23.44 7.37 30.86
> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){
+ lapplyFunctionRecurse <- function(cdata, level=1, ...){
+ if(level==1){
+
+ return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
+ function(x) lapplyFunctionRecurse(x, level+1, ...)))
+ } else if (level==length(pivotColumns)) {
+ #this line is different. it essentially calls the
function you pass in
+ return(lapply(split(cdata,data[cdata,pivotColumns[level]],
+ drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
+ } else {
+ return(lapply(split(cdata,data[cdata,pivotColumns[level]],
+ drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
+ }
+ }
+ result = lapplyFunctionRecurse(data, ...)
+ matrix2 <- do.call('rbind', lapply(result, function(x)
+ do.call('rbind',x)))
+ return(matrix2)
+ }
>
> dat <- data.frame(D=sample(32000:33000, 666000,
+ T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
+ function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> proc.time()-temp
user system elapsed
24.06 7.38 31.50
If you run Rprof, most of the time is being spent accessing the
dataframe. I would suggest that you convert the dataframe to a matrix
to get better performance. Here is what I saw in the Rprof of the
first example:
0 19.9 root
1. 19.7 fedb.ddplyWrapper2Fast
2. . 19.7 lapplyFunctionRecurse
3. . . 19.7 lapply
4. . . . 19.4 FUN
5. . . . . 19.4 lapplyFunctionRecurse
6. . . . . . 19.3 lapply
7. . . . . . . 18.6 FUN
8. . . . . . . . 18.6 listNameFunctions
9. . . . . . . . . 18.5 [
10. . . . . . . . . . 18.3 [.data.frame <<- most of the time in
accessing the data within a data frame.
11. . . . . . . . . . . 14.6 attr
11. . . . . . . . . . . 0.5 %in%
12. . . . . . . . . . . . 0.4 match
13. . . . . . . . . . . . . 0.4 is.factor
14. . . . . . . . . . . . . . 0.3 inherits
11. . . . . . . . . . . 0.5 [[
12. . . . . . . . . . . . 0.5 [[.data.frame
13. . . . . . . . . . . . . 0.2 %in%
14. . . . . . . . . . . . . . 0.2 match
15. . . . . . . . . . . . . . . 0.1 is.factor
16. . . . . . . . . . . . . . . . 0.1 inherits
11. . . . . . . . . . . 0.4 anyDuplicated
12. . . . . . . . . . . . 0.2 anyDuplicated.default
11. . . . . . . . . . . 0.2 names
12. . . . . . . . . . . . 0.2 names
11. . . . . . . . . . . 0.1 vector
12. . . . . . . . . . . . 0.1 length
13. . . . . . . . . . . . . 0.1 length
7. . . . . . . 0.7 is.vector
8. . . . . . . . 0.7 split
9. . . . . . . . . 0.6 split.default
10. . . . . . . . . . 0.5 factor
11. . . . . . . . . . . 0.2 as.character
11. . . . . . . . . . . 0.1 unique
12. . . . . . . . . . . . 0.1 unique.default
10. . . . . . . . . . 0.2 [
11. . . . . . . . . . . 0.1 [.data.frame
4. . . . 0.4 is.vector
5. . . . . 0.4 split
6. . . . . . 0.4 split.default
7. . . . . . . 0.4 factor
8. . . . . . . . 0.3 as.character
1. 0.1 data.frame
On Fri, Feb 26, 2010 at 2:58 PM, Rob Forler <rforler at uchicago.edu> wrote:
> So I have a function that does lapply's for me based on dimension. Currently
> only works for length(pivotColumns)=2 because I haven't fixed the rbinds. I
> have two versions. One runs WAYYY faster than the other. And I'm not sure
> why.
>
> Fast Version:
>
> fedb.ddplyWrapper2Fast <- function(data, pivotColumns, listNameFunctions,
> ...){
> lapplyFunctionRecurse <- function(cdata, level=1, ...){
> if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
> } else if (level==length(pivotColumns)) {
> #
> return(lapply(split(cdata,data[cdata,pivotColumns[level]], drop=T),
> function(x, ...) listNameFunctions(data[x,], ...)))
> return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) c(data[cdata[1],pivotColumns[2]],
> data[cdata[1],pivotColumns[1]], sum(data[cdata,"A"], na.rm=T),
> sum(data[cdata,"A"], na.rm=T))))
> } else {
> return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> }
> }
> result = lapplyFunctionRecurse(data, ...)
> matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
> return(matrix2)
> }
>
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> proc.time()-temp
> user system elapsed
> 4.616 0.006 4.630
> #note in thie case the anonymous function I pass in isn't used because I
> hardcode the function into the lapply.
>
> approx 4 seconds
>
> This runs very fast. This runs very slow:
>
> fedb.ddplyWrapper2 <- function(data, pivotColumns, listNameFunctions, ...){
> lapplyFunctionRecurse <- function(cdata, level=1, ...){
> if(level==1){
>
> return(lapply(split(seq(nrow(cdata)),cdata[,pivotColumns[level]], drop=T),
> function(x) lapplyFunctionRecurse(x, level+1, ...)))
> } else if (level==length(pivotColumns)) {
> #this line is different. it essentially calls the function you
> pass in
> return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x, ...) listNameFunctions(data[x,], ...)))
> } else {
> return(lapply(split(cdata,data[cdata,pivotColumns[level]],
> drop=T), function(x) lapplyFunctionRecurse(x, level+1, ...)))
> }
> }
> result = lapplyFunctionRecurse(data, ...)
> matrix2 <- do.call('rbind', lapply(result, function(x)
> do.call('rbind',x)))
> return(matrix2)
> }
>
> dat <- data.frame(D=sample(32000:33000, 666000,
> T),Fid=sample(1:10,666000,T), A=sample(1:5,666000,T))
>> temp = proc.time(); ret = fedb.ddplyWrapper2(dat, c("D", "Fid"),
> function(x) c(sum(x[,"A"], na.rm=T), sum(x[,"A"], na.rm=T)));
> proc.time()-temp
> user system elapsed
> 16.346 65.059 81.680
>
>
>
> Can anyone explain to me why there is a 4x time difference? I don't want to
> have to hardcore into the recursion function, but if I have to I will.
>
> Thanks,
> Rob
>
> [[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
--
Jim Holtman
Cincinnati, OH
+1 513 646 9390
What is the problem that you are trying to solve?
More information about the R-help
mailing list