[R] Efficient cbind of elements from two lists
William Dunlap
wdunlap at tibco.com
Thu Nov 19 17:06:23 CET 2009
> -----Original Message-----
> From: r-help-bounces at r-project.org
> [mailto:r-help-bounces at r-project.org] On Behalf Of Stephan Dlugosz
> Sent: Thursday, November 19, 2009 7:03 AM
> To: r-help at r-project.org
> Subject: [R] Efficient cbind of elements from two lists
>
> Hi!
>
> I have a data.frame "data" and splitted it.
>
> data <- split(data, data[,1])
>
> This is a quite slow procedure; and I do not want to do it again. So,
> any unsplit and "resplit" is no option for me.
> But: I have to cbind "variables" to the splitted data from
> another list,
> that contains of vectors with matching sizes, so
>
> for (i in 1:length(data)) {
> data[[i]] <- cbind(data[[i]], l[[i]]))
> }
>
> works well; but very, very slowly.
> The lapply solution:
>
> data <- lapply(1:k, function(i) cbind(data[[i]], l[[i]]))
>
> does not improve the situation, but allows for mclapply from the
> multicore package...
> Is there a more efficient way to combine elements from two lists?
Can you restructure your analysis so you don't need
to split the data.frame itself? I'm assuming the split
was slow because there are a lot of groups. Splitting
a data.frame into lots of pieces is considerably slower
than splitting a few numeric or character columns in it.
> df <- data.frame(group=rep(1:1e5, each=2), score=1:2e5)
> system.time(split(df, df$group)) # split entire data.frame into 1e5
parts
user system elapsed
117.32 38.42 154.34
> system.time(split(df$score, df$group)) # split 2nd column into 1e5
parts
user system elapsed
0.43 0.03 0.46
If R does things the way S+ does this is because splitting
simple vectors is done in C code but splitting data.frames
invokes the S-language [.data.frame function, which is
relatively slow when selecting rows from a data.frame.
I'd suggest using ave() (or a function from the plyr package),
working on columns from your data.frame and adding ave's
output as a column in your big data.frame. E.g., to compute
the average score in each group
> system.time(df$meanScore <- ave(df$score, df$group, FUN=mean))
user system elapsed
3.37 0.00 3.50
> df[1:6,]
group score meanScore
1 1 1 1.5
2 1 2 1.5
3 2 3 3.5
4 2 4 3.5
5 3 5 5.5
6 3 6 5.5
Bill Dunlap
Spotfire, TIBCO Software
wdunlap tibco.com
>
> Thank you very much!
>
> Greetings,
> Stephan
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
More information about the R-help
mailing list