[R-sig-hpc] Large Zoo object processing with Rcpp

David Rosenberg david.davidr at gmail.com
Wed Dec 16 21:48:35 CET 2009


Hi:

(sorry about the previous posting being html)

I have ~500,000 columns in  a zoo object (i.e. 500,000 time series with the
same index set).

I need to run the same function on each of the 500,000 series, and the
function is implemented in C++/Rcpp.

The storage for all the series is significant (fits into memory, but at
least 1 Gig in size).

I have no sense of whether it's better to process all 500,000 columns in the
same C++ function, or whether to use some kind of apply function in R, to
call the C++ function on each column individually.

The issues I have in mind are:
* time to copy the data (I vaguely recall reading that the data is copied
when an Rcpp function is run on it, but I could be mistaken)
* the space constraint imposed by having two copies of all the data
* the cost of using apply in R (it seems to just be a loop), though maybe a
different flavor of apply would be better.

As an important but possibly orthogonal issue, I notice that when I access
the zoo object in C++ (via a conversion to vector<vector<double>> using
getDataMat() ), the data seems to be stored as follows:
zooArrs[timeIndex][seriesNumber].
Which means if I'm processing one series at a time, and I have a large
number of series, I'm probably going to get a cache miss for every array
access, as I access consecutive entries for a single series: (i.e.
zooArrs[0][0], zooArrs[1][0], zooArrs[2][0],...).
I'm not sure if this is how the data is stored in R/Zoo, or if there's a
conversion to this format in Rcpp, or what.  In any case, are there any
obvious / natural places to change the way the data are stored?


Thanks,

David



More information about the R-sig-hpc mailing list