[Rd] [R] data.frame() size
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Mon Dec 12 14:11:00 CET 2005
Hin-Tak Leung <hin-tak.leung at cimr.cam.ac.uk> writes:
> Prof Brian Ripley wrote:
> > Data frames have unique row names *by definition* (White Book p.57).
>
> Yes - I happened to have the White Book on my desk (not mine...)
> - indeed, the first sentence on page 57 is (quote verbatim, the
> "never" is in italic in the book, which I have added the "*" before
> and after):
>
> If all else fails, the row names are just the row numbers. They
> are *never* null and must be unique.
>
> So patching data.frame.R is quite wrong. However, the rowname/colname
> overhead is definitely an issue for processing of large data sets,
> both for speed and amount of memory consumed. So it is probably best
> to extend the data.frame class and call it something else instead,
> for those who needs to go that route.
Exactly. I recall from the Insightful people at the DSC in Seattle
that something is going to happen with the rownames in S-PLUS or has
happened in the latest release, but I don't remember exactly how they
did it, and if and how it had to do with their "big dataframe" code.
We might want R to follow suit in this respect.
Other options might include doing something about the string-storage
of rownames, which is quite wasteful in R (every string is an R
object, a string vector is really a list of CHARSXP objects). Either
one could improve on the internal storage format, or one could allow
rownames to be integers with semantics like "virtual strings" so that
x["123",] still works.
> (What I am doing is already called a different name so it isn't
> affected by this argument).
>
> Hin-Tak
>
>
>
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-devel
mailing list