[R-pkg-devel] Advice on in-RAM out of RAM (MonetDB) in data import package

Lucas Ferreira Mation lucasmation at gmail.com
Mon Jul 11 16:28:04 CEST 2016


I am writing a package that imports most of the Brazillian socio-economic
micro datasets.
(microdadosBrasil <https://github.com/lucasmation/microdadosBrasil>). The
idea of the package that the data import is very simple, so even users with
verry little R programming knowledge can use the data easily.
Although I would like to have decent performance, the first concern is
usability.

The package imports data to an in memory data.table  object.
I am now trying to implement support for out of memory datasets using
MonetDBLite.

Is there a (non OS dependent) way to predict if a dataset will fit into
memory or not? Ideally the package would ask the computer for the maximum
amount of RAM that R can use. The package would then default to MonetDBLite
if the available RAM was smaller then 3x the in memory size of the dataset.

There will also be an argument for the user to choose himself wether to use
in RAM or out of RAM, but if that argument is not provided the package
would choose for him.

In any case, does that seems reasonable? Or should I force the user to be
aware of this?

Another option would be to default to MonetDB (unless the user explicitly
asks for in-memory data). Is MonetDB performance so good that it would not
make much of a difference?

Another disadvantage of the MonetDB default is that the user will not be
able to run base-R data manipulation commands. So he will have to use dplyr
(which is great and simple) or SQL queries (which few people will know).

reagards
Lucas

	[[alternative HTML version deleted]]



More information about the R-package-devel mailing list