[Rd] row.names in data.frame

Mon Apr 17 18:26:10 CEST 2006

On Mon, 17 Apr 2006, Martin Maechler wrote:

> Thanks a lot, Brian,
> I've been very happy with your proposal, but haven't yet looked
> at the details of the R-devel implementation.
> The NEWS entry only mentions *integer* rownames as new feature -
> which is exact from the user's perspective as you emphasize
> below.  It might be worth mentioning that the internal
> representation has become even more efficient than integers for
> the case of "1:n".

NEWS says

         The internal storage of row.names = 1:n just records 'n' for
         efficiency with very long vectors.

This bit is rather recent (both the item and the corresponding code).

BTW, this does not obviate the need for Matthew Dowle's data.table 
package, as quite a bit of time is spent manipulating row.names which can 
be saved if you know you do not need them, at the expense of not being 
allowed to use a lot of standard modelling functions.

Brian

>
> Martin
>
>>>>>> "BDR" == Prof Brian Ripley <ripley at stats.ox.ac.uk>
>>>>>>     on Mon, 17 Apr 2006 16:49:21 +0100 (BST) writes:
>
>    BDR> On Mon, 17 Apr 2006, Don MacQueen wrote:
>    >> This looks like a good proposal to me, from an end-user's
>    >> point of view.
>    >>
>    >> I have, from time to time, wished I could set row names
>    >> to NULL. Not for performance reasons, but because some
>    >> aspect of my data, in combination with how R handles row
>    >> names, was requiring me to explicitly manage them in
>    >> situations where I was otherwise making no use of
>    >> them. Admittedly, some of these occasions were quite a
>    >> few R versions ago, when row names were not as carefully
>    >> managed by R itself as they are now.
>    >>
>    >> Potential ramifications are not immediately obvious to
>    >> me, but for example, will rbind() of two data frames,
>    >> both of which have been assigned NULL row names, result
>    >> in a data frame with NULL row names? (Would it matter?)
>    >> What about one with NULL row names and one with non-NULL
>    >> row names?
>
>    BDR> In the user's perspective, there are no NULL row.names,
>    BDR> only integer or character ones.  (It did say *internal
>    BDR> representation*.)  If you rbind one with integer and
>    BDR> one with character, you get character in the result
>    BDR> (just as you do with c()).  There's actually code there
>    BDR> now that is supposed to work with 1:m and 1:n and give
>    BDR> you 1:(n+m), and that does work with integer row.names.
>
>    BDR> There was a snag with the proposal: zero-column data
>    BDR> frames do have a number of rows which is found from the
>    BDR> row.names.  So rather than encode as NULL, 1:n is
>    BDR> encoded as c(as.integer(NA), n), but the user will
>    BDR> never see that.
>
>    BDR> row.names(a_df) <- NULL sets the row.names to 1:n.
>
>    BDR> As of a few hours' ago, a test version is in R-devel.
>    BDR> This passes their tests with all CRAN packages
>    BDR> (somewhat to my surprise, but that may be in part be
>    BDR> because all existing data frames do have character
>    BDR> vector names).
>
>    >>  -Don
>    >>
>    >> At 8:29 PM +0100 4/14/06, Prof Brian Ripley wrote:
>    >>> We know from the White Book p.57 that the row names of a
>    >>> data frame `are never NULL and must be unique'.  R
>    >>> documents that row.names() returns a character vector,
>    >>> and in R (much more so than on S) a long character
>    >>> vector of short unique strings is expensive to store (I
>    >>> saw 72 bytes/row on a 64-bit machine for 1:1e6).
>    >>> [Incidentally, in the White Book the index page nos are
>    >>> all off by one for this item, and commonly elsewhere.
>    >>> It seems to be LaTeX indexing the page on which a para
>    >>> finishes.]
>    >>>
>    >>> Last time this came up Martin Maechler asked if we could
>    >>> not do it more efficiently, and reminded us recently.
>    >>> It would be fairly easy if everyone used the row.names()
>    >>> and row.names<-() accessor functions, but some packages
>    >>> (notably Design and Hmisc) access the attribute
>    >>> "row.names" directly (and what that is seems to be
>    >>> undocumented).
>    >>>
>    >>> I noticed that the White Book does not appear to say
>    >>> that the row names are character, and indeed says
>    >>>
>    >>> 'If all else fails the row names are just the row
>    >>> numbers.'
>    >>>
>    >>> and it seems the author of expand.grid() took that
>    >>> literally, for it used to assign integers to the row
>    >>> names.  However, the current S-PLUS help for both
>    >>> row.names and data.frame say row names are a character
>    >>> vector (and that row.names<-() coerces to character).
>    >>>
>    >>> We can certainly differentiate between the internal
>    >>> representation and the the result of row.names().  Here
>    >>> is my idea:
>    >>>
>    >>> 1) The internal representation is either NULL, an
>    >>> integer vector or a character vector.
>    >>>
>    >>> 2) attr(x, "row.names") will always return either an
>    >>> integer vector or a character vector, using 1:nrow(x) if
>    >>> the internal representation is NULL.
>    >>>
>    >>> 3) row.names() will always return as.character(attr(x,
>    >>> "row.names)).
>    >>>
>    >>> 4) attr<- and row.names<- can set NULL, integer or
>    >>> character.
>    >>>
>    >>> 5) Row-indexing a data frame with NULL or integer
>    >>> representation will give an integer representation.
>    >>>
>    >>> This would appear to be completely back-compatible for
>    >>> those who only work via the accessor functions, and
>    >>> probably work with almost all package code that
>    >>> manipulates attributes directly.  Since the changes can
>    >>> be done almost entirely in C code, the performance hit
>    >>> should be negliglible.
>    >>>
>    >>> The benefits will probably only be appreciable with
>    >>> `tall and skinny' data frames, as even 72 bytes per row
>    >>> is only going to buy you 9 numeric columns.  But that is
>    >>> it seems a common enough case to make this worthwhile.
>    >>>
>    >>> This would be a change aimed at 2.4.0, since we would
>    >>> need plenty of time both for testing and to alter code
>    >>> to make use of the more efficient representations.
>    >>>
>    >>> BTW, the maximum object length of 2^31 - 1 ensures that
>    >>> an integer representation of row numbers suffices.
>
>    BDR> -- Brian D. Ripley, ripley at stats.ox.ac.uk Professor of
>    BDR> Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
>    BDR> University of Oxford, Tel: +44 1865 272861 (self) 1
>    BDR> South Parks Road, +44 1865 272866 (PA) Oxford OX1 3TG,
>    BDR> UK Fax: +44 1865 272595
>
>    BDR> ______________________________________________
>    BDR> R-devel at r-project.org mailing list
>    BDR> https://stat.ethz.ch/mailman/listinfo/r-devel
>
>

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595