[R] fast mkChar

Vadim Ogranovich vograno at evafunds.com
Tue Jun 8 23:56:45 CEST 2004


I am no expert in memory management in R so it's hard for me to tell
what is and what is not doable. From reading the code of allocVector()
in memory.c I think that the critical part is to vectorize
CLASS_GET_FREE_NODE and use the vectorized version along the lines of
the code fragment below (taken from memory.c).

	if (node_class < NUM_SMALL_NODE_CLASSES) {
	    CLASS_GET_FREE_NODE(node_class, s); 

If this is possible than the rest is just a matter of code refactoring.

By vectorizing I mean writing a macro CLASS_GET_FREE_NODE2(node_class,
s, n) which in one go allocates n little objects of class node_class and
"inscribes" them into the elements of vector s, which is assumed to be
long enough to hold these objects.

If this is doable than the only missing piece would be a new function
setChar(CHARSXP rstr, const char * cstr) which copies 'cstr' into 'rstr'
and (re)allocates the heap memory if necessary. Here the setChar() macro
is safe since s[i]-s are all brand new and thus are not shared with any
other object.



> -----Original Message-----
> From: Peter Dalgaard [mailto:p.dalgaard at biostat.ku.dk] 
> Sent: Tuesday, June 08, 2004 1:23 PM
> To: Vadim Ogranovich
> Cc: R-Help
> Subject: Re: [R] fast mkChar
> 
> "Vadim Ogranovich" <vograno at evafunds.com> writes:
> 
> > Hi,
> >  
> > To speed up reading of large (few million lines) CSV files I am 
> > writing custom read functions (in C). By timing various 
> approaches I 
> > figured out that one of the bottlenecks in reading 
> character fields is 
> > the mkChar() function which on each call incurs a lot of 
> > garbage-collection-related overhead.
> >  
> > I wonder if there is a "vectorized" version of mkChar, say 
> > mkChar2(char **, int length) that converts an array of C 
> strings to a 
> > string vector, which somehow amortizes the gc overhead over 
> the entire array?
> >  
> > If no such function exists, I'd appreciate any hint as to 
> how to write 
> > it.
> 
> The real issue here is that character vectors are implemented 
> as generic vectors of little R objects (CHARSXP type) that 
> each hold one string. Allocating all those objects is 
> probably what does you in.
> 
> The reason behind the implementation is probably that doing 
> it that way allows the mechanics of the garbage collector to 
> be applied directly (CHARSXPs are just vectors of bytes), but 
> it is obviously wasteful in terms of total allocation. If you 
> can think up something better, please say so (but remember 
> that the memory management issues are nontrivial).
> 
> -- 
>    O__  ---- Peter Dalgaard             Blegdamsvej 3  
>   c/ /'_ --- Dept. of Biostatistics     2200 Cph. N   
>  (*) \(*) -- University of Copenhagen   Denmark      Ph: 
> (+45) 35327918
> ~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)             FAX: 
> (+45) 35327907
> 
>




More information about the R-help mailing list