[R-sig-hpc] 48K csv files, 1000 lines each. How to redesign? (big picture)

Wed Mar 1 23:57:38 CET 2017

Upon a cursory read, sqlite db can handle up to 140 TB. I think your 40 GB
are safe. I would advocate sqlitr because its one file, easy to understand,
no installation required and is supported by R' sqldf package (maybe more
now).

Cheers,
Roman

On Mar 1, 2017 11:52 PM, "Paul Johnson" <pauljohn32 at gmail.com> wrote:

> Hi
>
> I asked this question on stack overflow and 3 people voted it up
> within 5 minutes, and then the admins froze it because it is too
> interesting. Sorry. It is too broad and not specific. If you have ever
> tried to store a giant collection of simulation exercises and lived to
> tell the tale, would be glad to know your experience. (here it was:
> http://stackoverflow.com/questions/42394583/48k-csv-
> files-1000-lines-each-how-to-redesign-the-data-storage).
>
> This is it:
>
> One of the people that I help decided to scale up a simulation
> exercise to massive proportions. The usual sort of thing we do will
> have 100 conditions with 1000 runs with each one, and the result can
> "easily" fit into a single file or data frame. We do that kind of
> thing with SAS, R, or Mplus. This one is in R. I should have seen
> trouble coming when I heard that the project was failing for lack of
> memory. We see that sometimes with Bayesian models, where holding all
> of the results from chains in memory becomes too demanding. The fix in
> those cases has been to save batches of iterations in separate files.
> Without paying attention to details, I suggested they write smaller
> files on disk as the simulation proceeds.
>
> Later, I realized the magnitude of my error. They had generated 48,000
> output CSV files, in each of which there are 1000 lines and about 80
> columns of real numbers. These are written out in CSV files because
> the researchers are comfortable with data they can see. Again, I was
> not paying attention when they asked me how to analyze that. I was
> thinking small data, and told them to stack up the csv files using a
> shell script. The result is a 40+GB csv file. R can't hope to open
> that on the computers we have around here.
>
> I believe/hope that the analysis will never need to use all 40GB of
> data in one regression model :) I expect it is more likely they will
> want to summarize smaller segments. The usual exercise in this ilk has
> 3 - 5 columns of simulation parameters and then 10 columns of results
> from analysis. In this project, the result is much more massive
> because they have 10 columns of parameters and all of the mix and
> match combinations made the project expand.
>
> I believe that the best plan is to store the data in a "database" like
> structure. I want you to advise me about which approach to take.
>
> Mysql? Not open anymore, I'm not too enthusiastic.
>
> PostgreSQL? Seems more and more popular, have not administered a server
> before.
>
> SQlite3? Some admins here supply us with data for analysis in that
> format, but never have we received anything larger than 1.5GB.
>
> HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized
> science style container database-like formats would work well.
> However, I have not heard mention of them since I started helping the
> social science students. Back when R started, we were using HDF5 and
> one of my friends wrote the original R code to interact with HDF5.
>
> My top priority is rapid data retrieval. I think if one of the
> technicians can learn to retrieve a rectangular chunk, we can show
> researchers how to do the same.
>
>
> Warm Regards
> PJ
> --
> Paul E. Johnson   http://pj.freefaculty.org
> Director, Center for Research Methods and Data Analysis
> http://crmda.ku.edu
>
> To write to me directly, please address me at pauljohn at ku.edu.
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>

	[[alternative HTML version deleted]]