[R-sig-hpc] 48K csv files, 1000 lines each. How to redesign? (big picture)

Cristian Bologa CBologa at salud.unm.edu
Thu Mar 2 01:10:51 CET 2017


Hi Paul,

In your case, you might be better with MonetDBLite.

https://www.monetdb.org/blog/monetdblite-r

Good luck,
Cristian


Cristian Bologa, Ph.D.
Research Professor,
Div. of Translational Informatics, 
Dept. of Internal Medicine,
Univ. of New Mexico, School of Medicine,
Innovation Discovery&Training Center, MSC09 5025, 
700 Camino de Salud NE, Albuquerque, NM 87131
Telephone: +1 (505) 925-7534
Fax:+1 (505) 925-7625



-----Original Message-----
From: R-sig-hpc [mailto:r-sig-hpc-bounces at r-project.org] On Behalf Of romunov
Sent: Wednesday, March 01, 2017 3:58 PM
To: Paul Johnson <pauljohn32 at gmail.com>
Cc: R SIG High Performance Computing <r-sig-hpc at r-project.org>
Subject: Re: [R-sig-hpc] 48K csv files, 1000 lines each. How to redesign? (big picture)

Upon a cursory read, sqlite db can handle up to 140 TB. I think your 40 GB are safe. I would advocate sqlitr because its one file, easy to understand, no installation required and is supported by R' sqldf package (maybe more now).

Cheers,
Roman

On Mar 1, 2017 11:52 PM, "Paul Johnson" <pauljohn32 at gmail.com> wrote:

> Hi
>
> I asked this question on stack overflow and 3 people voted it up 
> within 5 minutes, and then the admins froze it because it is too 
> interesting. Sorry. It is too broad and not specific. If you have ever 
> tried to store a giant collection of simulation exercises and lived to 
> tell the tale, would be glad to know your experience. (here it was:
> http://stackoverflow.com/questions/42394583/48k-csv-
> files-1000-lines-each-how-to-redesign-the-data-storage).
>
> This is it:
>
> One of the people that I help decided to scale up a simulation 
> exercise to massive proportions. The usual sort of thing we do will 
> have 100 conditions with 1000 runs with each one, and the result can 
> "easily" fit into a single file or data frame. We do that kind of 
> thing with SAS, R, or Mplus. This one is in R. I should have seen 
> trouble coming when I heard that the project was failing for lack of 
> memory. We see that sometimes with Bayesian models, where holding all 
> of the results from chains in memory becomes too demanding. The fix in 
> those cases has been to save batches of iterations in separate files.
> Without paying attention to details, I suggested they write smaller 
> files on disk as the simulation proceeds.
>
> Later, I realized the magnitude of my error. They had generated 48,000 
> output CSV files, in each of which there are 1000 lines and about 80 
> columns of real numbers. These are written out in CSV files because 
> the researchers are comfortable with data they can see. Again, I was 
> not paying attention when they asked me how to analyze that. I was 
> thinking small data, and told them to stack up the csv files using a 
> shell script. The result is a 40+GB csv file. R can't hope to open 
> that on the computers we have around here.
>
> I believe/hope that the analysis will never need to use all 40GB of 
> data in one regression model :) I expect it is more likely they will 
> want to summarize smaller segments. The usual exercise in this ilk has
> 3 - 5 columns of simulation parameters and then 10 columns of results 
> from analysis. In this project, the result is much more massive 
> because they have 10 columns of parameters and all of the mix and 
> match combinations made the project expand.
>
> I believe that the best plan is to store the data in a "database" like 
> structure. I want you to advise me about which approach to take.
>
> Mysql? Not open anymore, I'm not too enthusiastic.
>
> PostgreSQL? Seems more and more popular, have not administered a 
> server before.
>
> SQlite3? Some admins here supply us with data for analysis in that 
> format, but never have we received anything larger than 1.5GB.
>
> HDF5 (Maybe netCDF?) It used to be (say 2005) these specialized 
> science style container database-like formats would work well.
> However, I have not heard mention of them since I started helping the 
> social science students. Back when R started, we were using HDF5 and 
> one of my friends wrote the original R code to interact with HDF5.
>
> My top priority is rapid data retrieval. I think if one of the 
> technicians can learn to retrieve a rectangular chunk, we can show 
> researchers how to do the same.
>
>
> Warm Regards
> PJ
> --
> Paul E. Johnson   http://pj.freefaculty.org
> Director, Center for Research Methods and Data Analysis 
> http://crmda.ku.edu
>
> To write to me directly, please address me at pauljohn at ku.edu.
>
> _______________________________________________
> R-sig-hpc mailing list
> R-sig-hpc at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-hpc
>

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-hpc mailing list
R-sig-hpc at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-hpc



More information about the R-sig-hpc mailing list