[R-sig-hpc] 48K csv files, 1000 lines each. How to redesign? (big picture)

Paul Johnson pauljohn32 at gmail.com
Wed Mar 1 23:50:30 CET 2017


Hi

I asked this question on stack overflow and 3 people voted it up
within 5 minutes, and then the admins froze it because it is too
interesting. Sorry. It is too broad and not specific. If you have ever
tried to store a giant collection of simulation exercises and lived to
tell the tale, would be glad to know your experience. (here it was:
http://stackoverflow.com/questions/42394583/48k-csv-files-1000-lines-each-how-to-redesign-the-data-storage).

This is it:

One of the people that I help decided to scale up a simulation
exercise to massive proportions. The usual sort of thing we do will
have 100 conditions with 1000 runs with each one, and the result can
"easily" fit into a single file or data frame. We do that kind of
thing with SAS, R, or Mplus. This one is in R. I should have seen
trouble coming when I heard that the project was failing for lack of
memory. We see that sometimes with Bayesian models, where holding all
of the results from chains in memory becomes too demanding. The fix in
those cases has been to save batches of iterations in separate files.
Without paying attention to details, I suggested they write smaller
files on disk as the simulation proceeds.

Later, I realized the magnitude of my error. They had generated 48,000
output CSV files, in each of which there are 1000 lines and about 80
columns of real numbers. These are written out in CSV files because
the researchers are comfortable with data they can see. Again, I was
not paying attention when they asked me how to analyze that. I was
thinking small data, and told them to stack up the csv files using a
shell script. The result is a 40+GB csv file. R can't hope to open
that on the computers we have around here.

I believe/hope that the analysis will never need to use all 40GB of
data in one regression model :) I expect it is more likely they will
want to summarize smaller segments. The usual exercise in this ilk has
3 - 5 columns of simulation parameters and then 10 columns of results
from analysis. In this project, the result is much more massive
because they have 10 columns of parameters and all of the mix and
match combinations made the project expand.

I believe that the best plan is to store the data in a "database" like
structure. I want you to advise me about which approach to take.

Mysql? Not open anymore, I'm not too enthusiastic.

PostgreSQL? Seems more and more popular, have not administered a server before.

SQlite3? Some admins here supply us with data for analysis in that
format, but never have we received anything larger than 1.5GB.

HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized
science style container database-like formats would work well.
However, I have not heard mention of them since I started helping the
social science students. Back when R started, we were using HDF5 and
one of my friends wrote the original R code to interact with HDF5.

My top priority is rapid data retrieval. I think if one of the
technicians can learn to retrieve a rectangular chunk, we can show
researchers how to do the same.


Warm Regards
PJ
-- 
Paul E. Johnson   http://pj.freefaculty.org
Director, Center for Research Methods and Data Analysis http://crmda.ku.edu

To write to me directly, please address me at pauljohn at ku.edu.



More information about the R-sig-hpc mailing list