[Rd] data frame subset patch, take 2
Vladimir Dergachev
vdergachev at rcgardis.com
Wed Dec 13 19:03:21 CET 2006
On Wednesday 13 December 2006 6:01 am, Martin Maechler wrote:
>
> - Vladimir, have you verified your 'take2' against recent versions
> of R-devel?
Yes.
>
> - If they still work, could you re-post them to R-devel, this
> time using a proper MIME type,
> i.e. most probably one of
> application/x-tar
> application/x-compressed-tar
> application/x-gzip
>
> In case you don't know how to achieve this,
> I'd be interested to get it by "private" e-mail.
No problem. The old e-mail did have a mime type: "text/x-diff".
I am resending the patch - now compressed, hopefully it will get pass whatever
filters are in place.
With regard to speedups in R, here is my wish list - I would greatly
appreciate comments on what makes sense here or not, etc:
1. I greatly miss equivalents of Tcl append and lappend commands - not
the function performed by these commands but efficiency (they are O(1) on
average). Tcl easily handles lists with 1e6 components and strings of 10s of
megabytes in length.
2. It would be nice to have true hashed arrays in R (i.e. O(1) access
times). So far I have used named lists for this, but they are O(n):
> L<-list(); system.time(for(i in 1:10000)L[[paste(i)]]<-i);
[1] 2.864 0.004 2.868 0.000 0.000
> L<-list(); system.time(for(i in 1:20000)L[[paste(i)]]<-i);
[1] 11.789 0.216 12.004 0.000 0.000
3. Efficient manipulation of large numbers of strings. The big reason
character row.names are slow is because they require a large number of string
objects that slow down garbage collector.
This is possibly not a problem that has an easy solution, here are a
couple of approaches I have considered:
a) Inline strings - use a structure like
union {
struct {
unsigned char size;
char body[15];
} inlined_string; /* use this when size<16 */
struct {
unsigned char flag;
char reserved[7]; /* for 64 bit */
CHRSXP ptr;
} indirect_string; /* use this when flag=255 */
}
This basically turns small strings into an enum type stored
within a 128-bit integer. This would greatly decrease required number of
CHRSXP in many common cases (in particular for many rownames).
The biggest disadvantage is more complicated access to string
data. Also this does not solve the issue of how to deal with 1e6 long
strings - though I feel like 15 characters should be good enough for most
uses.
b) CHRSXPs are always leaf nodes. One could implement true reference counting
and create a separate garbage collector pool for them. This way one can rely
on reference counting to free string objects during normal operation, but
also keep track of the number of referenced strings during garbage collector
passes - and trigger string garbage collection passes (with a warning) when
the number of referenced strings is much smaller the number of objects in
string pool.
This gets rid of overhead that strings impose on garbage
collector. The disadvantage are very large changes to R code.
best
Vladimir Dergachev
-------------- next part --------------
A non-text attachment was scrubbed...
Name: subset.patch.2.diff.gz
Type: application/x-gzip
Size: 1474 bytes
Desc: not available
Url : https://stat.ethz.ch/pipermail/r-devel/attachments/20061213/33598b1d/attachment.gz
More information about the R-devel
mailing list