[R] Quicker way of combining vectors into a data.frame

Fri Dec 1 11:44:05 CET 2006

[ Resending to the list as I fell foul of the too many recipients rule ]

On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote:

Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline)
for your comments and suggestions.

I noticed that two of the vectors were named and so I removed the names
(names(vec) <- NULL) and that pushed the execution time for the function
from c. 40 seconds to c. 115 seconds and all the time was taken within
the data.frame(...) call. So having names *on* some of the vectors
seemed to help things along, which was the opposite of what i had
expected.

If I use the cbind method of Marc, then the execution time for the
function drops to c. 1 second (most of which is in the calculation of
one of the vectors). So I guess I can work round this now.

What I find interesting is that:

test.dat <- rnorm(4471)
> system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

Whereas doing exactly the same thing with different data in the function
gives the following timings:

system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1] 173.415   0.260 192.192   0.000   0.000

Most of that was without a change in memory, but towards the end for c.
5 seconds memory use by R increased by 200-300 MB.

and...

> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1]  99.966   0.140 114.091   0.000   0.000

Again with a slight increase in memory usage in last 5 seconds. So now,
having stripped the names of two of the vectors (so now all are
un-named), the un-named version of the data.frame call is almost twice
as slow as the named data.frame call.

If I leave the names on the two vectors that had them, I get the
following timings for those same calls

> system.time(fab <- data.frame(lc.ratio, Q,
+                      fNupt,
+                      rho.n, rho.s,
+                      net.Nimm,
+                      net.Nden,
+                      CLminN,
+                      CLmaxN,
+                      CLmaxS))
[1]  96.234   0.244 101.706   0.000   0.000

> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+                      fNupt = fNupt,
+                      rho.n = rho.n, rho.s = rho.s,
+                      net.Nimm = net.Nimm,
+                      net.Nden = net.Nden,
+                      CLminN = CLminN,
+                      CLmaxN = CLmaxN,
+                      CLmaxS = CLmaxS))
[1] 13.597  0.088 15.868  0.000  0.000

So having the 2 named vectors and using the named version of the
data.frame call is the fastest combination.

This is all done within the debugger at the time when I would be
generating fab, and if I do,

system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000

(as above) at this point in the debugger it is exceedingly quick.

I just don't understand what is going on with data.frame.

I have yet to try Prof. Ripley's suggestion of being a bit naughty with
R - I'll see if that is any quicker.

Once again, thanks to you all for your suggestions.

All the best,

G

> Gavin,
> 
> One more note, which is that even timing the direct data frame creation
> on my system with colnames, again using the same 10 numeric columns, I
> get:
> 
> > system.time(DF1 <- data.frame(lc.ratio = Col1, Q = Col2, fNupt = Col3,
>                                 rho.n = Col4, rho.s = Col5, 
>                                 net.Nimm = Col6, net.Nden = Col7, 
>                                 CLminN = Col8, CLmaxN = Col9, 
>                                 CLmaxS = Col10))
> [1] 0.012 0.000 0.028 0.000 0.000
> 
> 
> > str(DF1)
> 'data.frame':   4471 obs. of  10 variables:
>  $ lc.ratio: num   0.1423  0.1873 -1.8129  0.0255 -1.7650 ...
>  $ Q       : num   0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ...
>  $ fNupt   : num  -0.1718 -0.0549  1.5194 -1.6127 -1.2019 ...
>  $ rho.n   : num  -0.740  0.240  0.522 -1.492  1.003 ...
>  $ rho.s   : num  -0.2363 -1.6248 -0.3045  0.0294  0.1240 ...
>  $ net.Nimm: num  -0.774  0.947 -1.098  0.809  1.216 ...
>  $ net.Nden: num  -0.198 -0.135 -0.300 -0.618 -0.784 ...
>  $ CLminN  : num   0.924 -3.265  0.211  0.813  0.262 ...
>  $ CLmaxN  : num   0.3212 -0.0502 -0.9978  0.9005 -1.6535 ...
>  $ CLmaxS  : num  -0.520  0.278 -0.546 -0.925  1.507 ...
> 
> 
> 
> 
> So there is something else going on, either with your code or some other
> conflict, unless my assumptions about your data are incorrect.
> 
> HTH,
> 
> Marc
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
-- 
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
 Gavin Simpson                 [t] +44 (0)20 7679 0522
 ECRC & ENSIS, UCL Geography,  [f] +44 (0)20 7679 0565
 Pearson Building,             [e] gavin.simpsonATNOSPAMucl.ac.uk
 Gower Street, London          [w] http://www.ucl.ac.uk/~ucfagls/
 UK. WC1E 6BT.                 [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%