[R] Quicker way of combining vectors into a data.frame
Gavin Simpson
gavin.simpson at ucl.ac.uk
Fri Dec 1 11:44:05 CET 2006
[ Resending to the list as I fell foul of the too many recipients rule ]
On Thu, 2006-11-30 at 11:34 -0600, Marc Schwartz wrote:
Thanks to Marc, Prof. Ripley, Sebastian and Sebastian (Luque - offline)
for your comments and suggestions.
I noticed that two of the vectors were named and so I removed the names
(names(vec) <- NULL) and that pushed the execution time for the function
from c. 40 seconds to c. 115 seconds and all the time was taken within
the data.frame(...) call. So having names *on* some of the vectors
seemed to help things along, which was the opposite of what i had
expected.
If I use the cbind method of Marc, then the execution time for the
function drops to c. 1 second (most of which is in the calculation of
one of the vectors). So I guess I can work round this now.
What I find interesting is that:
test.dat <- rnorm(4471)
> system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000
Whereas doing exactly the same thing with different data in the function
gives the following timings:
system.time(fab <- data.frame(lc.ratio, Q,
+ fNupt,
+ rho.n, rho.s,
+ net.Nimm,
+ net.Nden,
+ CLminN,
+ CLmaxN,
+ CLmaxS))
[1] 173.415 0.260 192.192 0.000 0.000
Most of that was without a change in memory, but towards the end for c.
5 seconds memory use by R increased by 200-300 MB.
and...
> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+ fNupt = fNupt,
+ rho.n = rho.n, rho.s = rho.s,
+ net.Nimm = net.Nimm,
+ net.Nden = net.Nden,
+ CLminN = CLminN,
+ CLmaxN = CLmaxN,
+ CLmaxS = CLmaxS))
[1] 99.966 0.140 114.091 0.000 0.000
Again with a slight increase in memory usage in last 5 seconds. So now,
having stripped the names of two of the vectors (so now all are
un-named), the un-named version of the data.frame call is almost twice
as slow as the named data.frame call.
If I leave the names on the two vectors that had them, I get the
following timings for those same calls
> system.time(fab <- data.frame(lc.ratio, Q,
+ fNupt,
+ rho.n, rho.s,
+ net.Nimm,
+ net.Nden,
+ CLminN,
+ CLmaxN,
+ CLmaxS))
[1] 96.234 0.244 101.706 0.000 0.000
> system.time(fab <- data.frame(lc.ratio = lc.ratio, Q = Q,
+ fNupt = fNupt,
+ rho.n = rho.n, rho.s = rho.s,
+ net.Nimm = net.Nimm,
+ net.Nden = net.Nden,
+ CLminN = CLminN,
+ CLmaxN = CLmaxN,
+ CLmaxS = CLmaxS))
[1] 13.597 0.088 15.868 0.000 0.000
So having the 2 named vectors and using the named version of the
data.frame call is the fastest combination.
This is all done within the debugger at the time when I would be
generating fab, and if I do,
system.time(z <- data.frame(col1 = test.dat, col2 = test.dat, col3 =
test.dat,
+ col4 = test.dat, col5 = test.dat, col6 = test.dat, col7 = test.dat,
+ col8 = test.dat, col9 = test.dat, col10 = test.dat))
[1] 0.008 0.000 0.007 0.000 0.000
(as above) at this point in the debugger it is exceedingly quick.
I just don't understand what is going on with data.frame.
I have yet to try Prof. Ripley's suggestion of being a bit naughty with
R - I'll see if that is any quicker.
Once again, thanks to you all for your suggestions.
All the best,
G
> Gavin,
>
> One more note, which is that even timing the direct data frame creation
> on my system with colnames, again using the same 10 numeric columns, I
> get:
>
> > system.time(DF1 <- data.frame(lc.ratio = Col1, Q = Col2, fNupt = Col3,
> rho.n = Col4, rho.s = Col5,
> net.Nimm = Col6, net.Nden = Col7,
> CLminN = Col8, CLmaxN = Col9,
> CLmaxS = Col10))
> [1] 0.012 0.000 0.028 0.000 0.000
>
>
> > str(DF1)
> 'data.frame': 4471 obs. of 10 variables:
> $ lc.ratio: num 0.1423 0.1873 -1.8129 0.0255 -1.7650 ...
> $ Q : num 0.8340 -0.2387 -0.0864 -1.1184 -0.3368 ...
> $ fNupt : num -0.1718 -0.0549 1.5194 -1.6127 -1.2019 ...
> $ rho.n : num -0.740 0.240 0.522 -1.492 1.003 ...
> $ rho.s : num -0.2363 -1.6248 -0.3045 0.0294 0.1240 ...
> $ net.Nimm: num -0.774 0.947 -1.098 0.809 1.216 ...
> $ net.Nden: num -0.198 -0.135 -0.300 -0.618 -0.784 ...
> $ CLminN : num 0.924 -3.265 0.211 0.813 0.262 ...
> $ CLmaxN : num 0.3212 -0.0502 -0.9978 0.9005 -1.6535 ...
> $ CLmaxS : num -0.520 0.278 -0.546 -0.925 1.507 ...
>
>
>
>
> So there is something else going on, either with your code or some other
> conflict, unless my assumptions about your data are incorrect.
>
> HTH,
>
> Marc
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
Gavin Simpson [t] +44 (0)20 7679 0522
ECRC & ENSIS, UCL Geography, [f] +44 (0)20 7679 0565
Pearson Building, [e] gavin.simpsonATNOSPAMucl.ac.uk
Gower Street, London [w] http://www.ucl.ac.uk/~ucfagls/
UK. WC1E 6BT. [w] http://www.freshwaters.org.uk
%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%~%
More information about the R-help
mailing list