[Rd] [R] data.frame() size

Mon Dec 12 14:11:00 CET 2005

Hin-Tak Leung <hin-tak.leung at cimr.cam.ac.uk> writes:

> Prof Brian Ripley wrote:
> > Data frames have unique row names *by definition* (White Book p.57).
> 
> Yes - I happened to have the White Book on my desk (not mine...)
> - indeed, the first sentence on page 57 is (quote verbatim, the
> "never" is in italic in the book, which I have added the "*" before
> and after):
> 
>     If all else fails, the row names are just the row numbers. They
>     are *never* null and must be unique.
> 
> So patching data.frame.R is quite wrong. However, the rowname/colname
> overhead is definitely an issue for processing of large data sets,
> both for speed and amount of memory consumed. So it is probably best
> to extend the data.frame class and call it something else instead,
> for those who needs to go that route.

Exactly. I recall from the Insightful people at the DSC in Seattle
that something is going to happen with the rownames in S-PLUS or has
happened in the latest release, but I don't remember exactly how they
did it, and if and how it had to do with their "big dataframe" code.
We might want R to follow suit in this respect.

Other options might include doing something about the string-storage
of rownames, which is quite wasteful in R (every string is an R
object, a string vector is really a list of CHARSXP objects). Either
one could improve on the internal storage format, or one could allow
rownames to be integers with semantics like "virtual strings" so that
x["123",] still works.

> (What I am doing is already called a different name so it isn't
> affected by this argument).
> 
> Hin-Tak
> 
> 
> 

-- 
   O__  ---- Peter Dalgaard             Øster Farimagsgade 5, Entr.B
  c/ /'_ --- Dept. of Biostatistics     PO Box 2099, 1014 Cph. K
 (*) \(*) -- University of Copenhagen   Denmark          Ph:  (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk)                  FAX: (+45) 35327907