[R-sig-hpc] 48K csv files, 1000 lines each. How to redesign? (big picture)

Saptarshi Guha saptarshi.guha at gmail.com
Thu Mar 2 07:09:24 CET 2017

This is likely not going to help, but I've had immense success using
RHIPE [^1] . I say likely not going help because installing it
appears to be painful (though I have scripts for Amazon EMR
c4.2xlarge clusters).

My approach to this is

1. Read the hundreds of gb of CSV files into RHIPE
2. Chunk them as data tables [^2] each data table corresponding to
   the information of one subject.

Then analyze this data. I've worked with the tens of millions of
   subjects (the data tbales for each of these were < 1000 rows).

At the end of the following code, my data set consists of ~ 6MM
subjects each with few tens to a few hundred rows of data. The data
for each subject('cid') is stored as a data table.

I can then compute across subjects very easily. Upping the number of
compute nodes if i feel the need to do so(Elastic MapReduce makes
this simple)


z <- rhwatch(map=expression({
        z <- fread(paste(unlist(map.values),collapse="\n")


        z[, subdate:=as.Date(subdate,"%Y%m%d")]
        z[, rhcollect(.BY$cid, .SD) by=cid]
        rhcollect(sample(1:1000,1), z)
    }, error=function(e) { rhcounter("errors",as.character(e),1)})
   , reduce=expression(
       pre = {
           .r <- NULL
       reduce = {
           .r <- rbind(.r,rbindlist(reduce.values))
       post = {
           .r <- .r[order(subdate),]
           rhcollect(reduce.key, .r)
   , mapred = list(mapred.reduce.tasks=300)
   , output =
   , input  =


[^1]: http://deltarho.org/

[^2]: https://cran.r-project.org/web/packages/data.table/index.html

On Wed, Mar 1, 2017 at 9:56 PM, Paul Gilbert <pgilbert902 at gmail.com> wrote:

> Well, having lived to tell the tale, I would like to mention one option
> that never seems as obvious as it should. With simulation exercises you can
> save the seed and regenerate only portions of the data you want for
> specific analysis. It can be especially fast if the analysis can be done
> without actually saving data to file. This is a trade-off between compute
> speed and storage access/query speed, and depends of course on the
> complexity of the model computation. The trade off does not seem to always
> work the way one is inclined to think it should. (BTW, it is important to
> beware of the details needed for regenerating simulations on clusters.)
> Paul Gilbert
> On 03/01/2017 05:50 PM, Paul Johnson wrote:
>> Hi
>> I asked this question on stack overflow and 3 people voted it up
>> within 5 minutes, and then the admins froze it because it is too
>> interesting. Sorry. It is too broad and not specific. If you have ever
>> tried to store a giant collection of simulation exercises and lived to
>> tell the tale, would be glad to know your experience. (here it was:
>> http://stackoverflow.com/questions/42394583/48k-csv-files-
>> 1000-lines-each-how-to-redesign-the-data-storage).
>> This is it:
>> One of the people that I help decided to scale up a simulation
>> exercise to massive proportions. The usual sort of thing we do will
>> have 100 conditions with 1000 runs with each one, and the result can
>> "easily" fit into a single file or data frame. We do that kind of
>> thing with SAS, R, or Mplus. This one is in R. I should have seen
>> trouble coming when I heard that the project was failing for lack of
>> memory. We see that sometimes with Bayesian models, where holding all
>> of the results from chains in memory becomes too demanding. The fix in
>> those cases has been to save batches of iterations in separate files.
>> Without paying attention to details, I suggested they write smaller
>> files on disk as the simulation proceeds.
>> Later, I realized the magnitude of my error. They had generated 48,000
>> output CSV files, in each of which there are 1000 lines and about 80
>> columns of real numbers. These are written out in CSV files because
>> the researchers are comfortable with data they can see. Again, I was
>> not paying attention when they asked me how to analyze that. I was
>> thinking small data, and told them to stack up the csv files using a
>> shell script. The result is a 40+GB csv file. R can't hope to open
>> that on the computers we have around here.
>> I believe/hope that the analysis will never need to use all 40GB of
>> data in one regression model :) I expect it is more likely they will
>> want to summarize smaller segments. The usual exercise in this ilk has
>> 3 - 5 columns of simulation parameters and then 10 columns of results
>> from analysis. In this project, the result is much more massive
>> because they have 10 columns of parameters and all of the mix and
>> match combinations made the project expand.
>> I believe that the best plan is to store the data in a "database" like
>> structure. I want you to advise me about which approach to take.
>> Mysql? Not open anymore, I'm not too enthusiastic.
>> PostgreSQL? Seems more and more popular, have not administered a server
>> before.
>> SQlite3? Some admins here supply us with data for analysis in that
>> format, but never have we received anything larger than 1.5GB.
>> HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized
>> science style container database-like formats would work well.
>> However, I have not heard mention of them since I started helping the
>> social science students. Back when R started, we were using HDF5 and
>> one of my friends wrote the original R code to interact with HDF5.
>> My top priority is rapid data retrieval. I think if one of the
>> technicians can learn to retrieve a rectangular chunk, we can show
>> researchers how to do the same.
>> Warm Regards
>> PJ
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc

	[[alternative HTML version deleted]]

More information about the R-sig-hpc mailing list