[Rd] (PR#8192) [ subscripting sometimes loses names
Tim Hesterberg
TimHesterberg at gmail.com
Sun Feb 1 18:25:50 CET 2009
>...
>Simon, no, the drop=FALSE argument has nothing to do with what
>Christian was talking about. The kind of thing he meant is PR# 8192,
>"Subject: [ subscripting sometimes loses names":
>
> http://bugs.r-project.org/cgi-bin/R/wishlist?id=8192
>
>In R, subscripting with "[" USUALLY retains names, but R has various
>edge cases where it (IMNSHO) inappropriately discards them. This
>occurs with both .Primitive("[") and "[.data.frame". This has been
>known for years, but I have not yet tried digging into R's
>implementation to see where and how the names are actually getting
>lost.
>
>Incidentally, versions of S-Plus since approximately S-Plus 6.0 back
>in 2001 show similar buggy edge case behavior. Older versions of
>S-Plus, c. S-Plus 3.3 and earlier, had the correct, name preserving
>behavior. I presume that the original Bell Labs S had correct
>name-preserving behavior, and then the S-Plus developers broke it
>sometime along the way.
(Later comments on the thread pointed out the difference between
x[,1] for matrices and data frames.)
I rewrote the S-PLUS data frame code around then, to fix
various inconsistencies and improve efficiency.
This was probably my change, and I would do it again.
Note that the components of a data frame do not have names
attached to them; the row names are a separate object.
Extracting a component vector or matrix from a data frame should not
attach names to the result, because of:
* memory (attaching row names to an object can more than double the
size of the object),
* speed
* some objects cannot take names, and attaching them could change
the class and other behavior of an object, and
* the names are usually/often (depending on the user) meaningless,
artifacts of an early design decision that all data frames have row names.
Data frames differ from matrices in two ways that matter here:
* columns in matrices are all the same kind, and are simple objects
(numeric, etc.), whereas components of data frames can be nearly
arbitrary objects, and
* row names get added to a data frame whether a user wants them or not,
whereas row names on a matrix have to be specified.
A historical note - unique row names on data frame were a design
decision made when people worked with small data frames, and are
convenient for small data frames. But they are a problem for large
data frames. I was writing for all users, not just those with small
data frames and meaningful names.
I like R's 'automatic' row names. This is a big help working with
huge data frames (and I do this often, at Google). But this doesn't
go far enough; subscripting and other operations sometimes convert the
automatic names to real names, and check/enforce uniqueness, which is
a big waste of time when working with large data frames. I'll comment
more on this in a new thread.
Tim Hesterberg
More information about the R-devel
mailing list