[R] package for saving large datasets in ASCII
Ott Toomet
siim at obs.ee
Sun Aug 11 14:51:33 CEST 2002
Hi,
I am Continuing discussion about dataframes in ASCII.
I have not overlooked the argument blocksize in write.matrix(), but
which is a sensible size? I assumed the blocksize=1 is the most
memoryefficient, but (for smaller example) I experimented with
different sizes. Initially, speed increased slightly, but seemed to
be constant or even decreasing from around value 5.
The problem for me is not the speed for small dataframes but the fact
that I was not able to save a large dataframe at all. I think the
reason is associated with the first line of write.matrix() which is
x < as.matrix(x)
This converts the whole dataframe into a new ascii matrix, a process which
is both slow and memory consuming if the original object is large. The
second place I am not sure about are lines
cat(format(t(x[nlines + (1:nb), ])), file = file,
append = TRUE, sep = c(rep(sep, p  1), "\n"))
isn't t(x[...]) creating new temporary objects?
Or have I misunderstood something?
BTW, are there any ways to check memory consumption of individual
objects and functions?
best wishes,
Ott
On Sat, 10 Aug 2002 ripley at stats.ox.ac.uk wrote:
?write.matrix will tell you what you have overlooked, a sensible
blocksize.

If `I am not sure about write.matrix()', surely reading the help page is a
first step?

On Sat, 10 Aug 2002, Ott Toomet wrote:

> Hi,
>
> I have made a tiny package for saving dataframes in ASCII format. The
> package contains functions save.table() and save.delim(), the first
> mimics (not completely) write.table() and the second uses just
> different default values, suitable for read.delim().
>
> The reason I have written the functions is that I have had problems
> with saving large dataframes in ASCII form. write.table() essentially
> makes a huge string in memory from the dataframe. I am not sure about
> write.matrix() (in MASS), but in my practice it is too
> memoryintensive also. My approach was to write the whole thing in C
> in this way that the function takes the values from the dataframe, one
> scalar value by time, and writes them immediately to the file. This,
> of course, puts certain limitations on the contents of dataframe and
> output format.
>
> Here is an example of the result:
>
> > dim(e2000)
> [1] 7505 1197
> > library(savetable)
> > system.time(save.table(e2000, "e2000"))
> [1] 38.04 0.48 48.75 0.00 0.00
> > library(MASS)
> > system.time(write.matrix(e2000, "e2000", sep=",", 1))
>
>  killed after 10 minutes swapping.
>
> And now a smaller example:
>
> > dim(e2000s)
> [1] 100 1197
> > library(savetable)
> > system.time(save.table(e2000s, "e2000s"))
> [1] 0.45 0.00 0.56 0.00 0.00
> > system.time(write.table(e2000s, "e2000s"))
> [1] 31.21 0.11 38.99 0.00 0.00
> > library(MASS)
> > system.time(write.matrix(e2000s, "e2000s", sep=",", 1))
> [1] 4.01 0.66 5.45 0.00 0.00
>
> None of the functions started swapping now, but as you can see,
> save.table() is still around 10 times as fast as write.matrix().
> Examples are on my 128MB PII400 linux system and R 1.4.0.
>
> I am not sure if there is much interest for such a package, so I put
> it on my own website instead of CRAN
> (http://www.obs.ee/~siim/savetable_0.1.0.tar.gz). Any feedback is
> appreciated.
>
> Many thanks to Brian Ripley and the others, who helped me accessing R
> objects in C.
>
>
> Best wishes,
>
> Ott Toomet
.......................................
rhelp mailing list  Read http://www.ci.tuwien.ac.at/~hornik/R/RFAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: rhelprequest at stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
More information about the Rhelp
mailing list