[Rd] tabular data (was RE: [R] Removing "row.names")
David James
David James <dj@research.bell-labs.com>
Fri, 9 Feb 2001 13:24:50 -0500 (EST)
Hi,
I agree that replacing data.frames for modeling functions would
be too painful. Also I agree with Thomas that new class(es) for
tabular data should not inherit from data.frame, and that data.frames
should conceptually inherit from some other base tabular data class.
At this point I'm not suggesting anything in concrete --- I haven't
sorted it out in my own mind --- but I want to point out that (1) we
already have several tabular data clasess (not one), including matrices,
arrays, contingency tables, data.frames, etc.; and that (2) I feel we'll
need to address some of the problems we get into when we use data.frames
inappropriately due to lack of other/better data structures. But I
don't think that one single class can truthfully and completely represent
all "tabular data".
For instance, interfaces to XML, spreadsheets, DBMS, etc., will further
expose some of the limitations of these existing objects.
Tables holding data from relational DBMS are an easy case: this class
should preserve the original data as much as possible, i.e., no coercing
into factors, no changing column names, no row names, but otherwise
very similar to data.frames. (Timothy Keitt has a more interesting concept
of "proxyTables" that presents some very interesting issues: should
proxyTables, which "point" to remote relations in dbms, allow integer
indexing? --- the relational database model does not support it)
Is there something common to all these objects? Obviously they all
support indexing x[i,j, ...] plus the methods dim() and (possibly
NULL) dimnames(). S4 defines vectors as a (virtual) class in
terms of the indexing operation in exactly this way -- thus in S4 lists
are vectors, and so are logicals, characters, etc. We may be able to
group the various "tabular data" classes under such virtual class,
and provide simple coercion facilities so that users can easily
fit, say, a linear model to data coming as an XML document or
in a table stored in a dbms.
David A. James
Statistics Research, Room 2C-253 Phone: (908) 582-3082
Bell Labs, Lucent Technologies Fax: (908) 582-3340
Murray Hill, NJ 09794-0636
----------------------------------------------------------------------------
> From: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at>
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Date: Thu, 8 Feb 2001 14:41:03 +0100
> To: David James <dj@research.bell-labs.com>
> Cc: Kurt.Hornik@ci.tuwien.ac.at, tlumley@u.washington.edu,
p.dalgaard@biostat.ku.dk, R-devel@r-project.org
> Subject: Re: [Rd] RE: [R] Removing "row.names"
>
> >>>>> David James writes:
>
> >> Date: Wed, 7 Feb 2001 09:33:12 -0800 (PST)
> >> From: Thomas Lumley <tlumley@u.washington.edu>
> >> To: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at>
> >> cc: Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk>, R-devel@r-project.org
> >> Subject: Re: [Rd] RE: [R] Removing "row.names"
> >> MIME-Version: 1.0
> >>
> >> On Wed, 7 Feb 2001, Kurt Hornik wrote:
> >>
> >> > >>>>> Thomas Lumley writes:
> >> >
> >> > > On Wed, 7 Feb 2001, Kurt Hornik wrote:
> >> > >> >>>>> Peter Dalgaard BSA writes:
> >> > >>
> >> > >> > Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes:
> >> > >> >> names(sampled) <- " "
> >> > >> >> and
> >> > >> >> dimnames(sampled)[[2]] <- " "
> >> > >> >>
> >> > >> >> happily introduce non-unique variable names in the data frame.
> >> > >> >>
> >> > >> >> Is the rule that row.names and names must be unique still on?
> >> > >> >>
> >> > >> >> Argh ...
> >> > >>
> >> > >> > Splus 3.4 dispatches on dimnames<-, but not on names<- with the
> >> > >> > following curious result:
> >> > >>
> >> > >> >> d <- data.frame(a=1:3,b=4:6)
> >> > >> >> names(d)<-c(" "," ")
> >> > >> >> d
> >> > >>
> >> > >> > 1 1 4
> >> > >> > 2 2 5
> >> > >> > 3 3 6
> >> > >> >> dimnames(d)[[1]] <- rep(" ",3)
> >> > >> > Error in "dimnames<-.data.frame"(d, .A0): column names must be
unique
> >> > >> > Dumped
> >> > >>
> >> > >> > R dispatches similarly, but doesn't check the dimnames in
> >> > >> > dimnames<-.data.frame. It could do so quite easily. Just add
> >> > >>
> >> > >> > || any(duplicated(d[[1]])) || any(duplicated(d[[2]]))
> >> > >>
> >> > >> > at the appropriate spot.
> >> > >>
> >> > >> Thomas' view about what should be permitted seems to be different.
> >> >
> >> > > I wouldn't object to making it hard to create duplicated names(), but
> >> > > I think it would be a bad idea to have data.frame() make up unique
> >> > > names if it's given non-unique ones.
> >> >
> >> > Maybe `check.names' could also be used for uniqueness testing?
> >> >
> >> > In any case, I think we should specify what *exactly* a data frame is.
> >> >
> >>
> >> I think we should specify, and check.names is a logical way to
> >> allow/forbid non-unique columns.
> >>
> >> Having a new class would be messy: logically it shouldn't inherit from
> >> data.frame, data.frame should inherit from it, but that would be a real
> >> pain to set up.
> >>
>
> > Data frames were originally meant to be used in modeling functions.
> > The opening paragraph in Chapter 3 (Data for Models) in the White Book
> > says:
>
> > "This chapter describes the general structure for data that
> > will be used throughout the book. In particular, it introduces the
> > data frame, a class of objects to represent the data typically encounterd
> > in fitting models."
>
> > However, data.frames may not be quite appropriate for representing
> > other types of tabular data (certainly a data.frame does not capture
> > the essence of, say, a "relational" table in the SQL sense, which
> > doesn't have the concept of row names). Several manifestations of
> > this problem are coercing character data to factors "at the drop of a
> > hat" (as someone wrote here or in s-news), the row.names issue now
> > being discussed, problems including general objets in the "cells" of
> > the data.frame, etc.
>
> > I think that the concept of a data.frame to represent data for fitting
> > models is fine, but we may (certainly I) have abused this concept. We
> > need other classes of tabular data objects in addition (not as a
> > replacement) to data.frames, together with coercion methods and
> > perhaps other utilities.
>
> Thomas had said that yes it would be nice to have something with less
> restrictions for modeling, but that it was uneconomical at least to
> introduce a new class that data.frame would then inherit from.
>
> I interpret your comment as suggesting that we introduce a new class for
> holding tabular data? Do you have specific ideas on this?
>
> -k
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._