[Rd] How allocate STRSXP outside of gc

Vadim Ogranovich vograno at evafunds.com
Thu Apr 14 19:57:13 CEST 2005


Yes, and space sharing also improves speed since gc() does not need to
collect so many objects.

I thought about more efficient formats for my data, but:
* ASCII is ubiquitous. Your have grep, head, perl, etc. to work w/ them
* AFAIK, there is no industry standard binary format and a mature
supporting C-library (especially when the data needs to be compressed).
I considered HDF and netcdf.
* the programs that collect my data store it in ASCII. It is
advantageous to be able to read it directly from the original files. (I
have about 200G of these compressed)
* C code was able to read the data at a decent speed, it was the R's
overhead that was causing problems. One of them was mkChar, the other
was how chars are read from a connection. I detailed my findings in a
message to r-devel.

I tried to see is I could improve the original R codes for IO, but for
various reasons decided that I wouldn't be able to accomplish this. In
the end I decided to write a custom R IO package which came close to the
speed of raw C code (the difference is largely due to the lookup
overhead).

Thanks,
Vadim

> -----Original Message-----
> From: Prof Brian Ripley [mailto:ripley at stats.ox.ac.uk] 
> Sent: Thursday, April 14, 2005 12:02 AM
> To: Vadim Ogranovich
> Cc: Jan T. Kim; r-devel at stat.math.ethz.ch
> Subject: RE: [Rd] How allocate STRSXP outside of gc
> 
> On Wed, 13 Apr 2005, Vadim Ogranovich wrote:
> 
> > mkChar is a rather expensive call since it allocates a new 
> R object. 
> > For example in reading char data from a file it is often 
> advantageous 
> > to first try to look up an already made R string and only 
> then use mkChar.
> > That is, the overhead of the lookup is usually smaller than that of 
> > mkChar.
> 
> Yes (and that is one reason why scan in 2.1.0 uses lookups, 
> space sharing being the other), but both are really fast and 
> this only comes into play with hundreds of millions of items. 
>  (On my machine mkChar takes about 200 ns, hardly `rather 
> expensive'.)  And if you have that much data, why not store 
> it in a more efficient format?
> 
> -- 
> Brian D. Ripley,                  ripley at stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272866 (PA)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
>



More information about the R-devel mailing list