[Rd] tabular data (was RE: [R] Removing "row.names")

David James David James <dj@research.bell-labs.com>
Fri, 9 Feb 2001 13:24:50 -0500 (EST)


Hi,

I agree that replacing data.frames for modeling functions would 
be too painful.  Also I agree with Thomas that new class(es) for 
tabular data should not inherit from data.frame, and that data.frames 
should conceptually inherit from some other base tabular data class.  

At this point I'm not suggesting anything in concrete --- I haven't 
sorted it out in my own mind --- but I want to point out that (1) we
already have several tabular data clasess (not one), including matrices,
arrays, contingency tables, data.frames, etc.;  and that (2) I feel we'll 
need to address some of the problems we get into when we use data.frames 
inappropriately due to lack of other/better data structures. But I 
don't think that one single class can truthfully and completely represent 
all "tabular data".

For instance, interfaces to XML, spreadsheets, DBMS, etc., will further 
expose some of the limitations of these existing objects.  
Tables holding data from relational DBMS are an easy case: this class 
should preserve the original data as much as possible, i.e., no coercing 
into factors, no changing column names, no row names, but otherwise 
very similar to data.frames.  (Timothy Keitt has a more interesting concept 
of "proxyTables" that presents some very interesting issues: should 
proxyTables, which "point" to remote relations in dbms, allow integer 
indexing? --- the relational database model does not support it)

Is there something common to all these objects?  Obviously they all 
support indexing x[i,j, ...] plus the methods dim() and (possibly
NULL) dimnames().  S4 defines vectors as a (virtual) class in 
terms of the indexing operation in exactly this way -- thus in S4 lists 
are vectors, and so are logicals, characters, etc.  We may be able to
group the various "tabular data" classes under such virtual class,
and provide simple coercion facilities so that users can easily
fit, say, a linear model to data coming as an XML document or
in a table stored in a dbms.

David A. James
Statistics Research, Room 2C-253            Phone:  (908) 582-3082       
Bell Labs, Lucent Technologies              Fax:    (908) 582-3340
Murray Hill, NJ 09794-0636
----------------------------------------------------------------------------
> From: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at>
> MIME-Version: 1.0
> Content-Transfer-Encoding: 7bit
> Date: Thu, 8 Feb 2001 14:41:03 +0100
> To: David James <dj@research.bell-labs.com>
> Cc: Kurt.Hornik@ci.tuwien.ac.at, tlumley@u.washington.edu, 
p.dalgaard@biostat.ku.dk, R-devel@r-project.org
> Subject: Re: [Rd] RE: [R] Removing "row.names"
> 
> >>>>> David James writes:
> 
> >> Date: Wed, 7 Feb 2001 09:33:12 -0800 (PST)
> >> From: Thomas Lumley <tlumley@u.washington.edu>
> >> To: Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at>
> >> cc: Peter Dalgaard BSA <p.dalgaard@biostat.ku.dk>, R-devel@r-project.org
> >> Subject: Re: [Rd] RE: [R] Removing "row.names"
> >> MIME-Version: 1.0
> >> 
> >> On Wed, 7 Feb 2001, Kurt Hornik wrote:
> >> 
> >> > >>>>> Thomas Lumley writes:
> >> > 
> >> > > On Wed, 7 Feb 2001, Kurt Hornik wrote:
> >> > >> >>>>> Peter Dalgaard BSA writes:
> >> > >> 
> >> > >> > Kurt Hornik <Kurt.Hornik@ci.tuwien.ac.at> writes:
> >> > >> >> names(sampled) <- " "
> >> > >> >> and
> >> > >> >> dimnames(sampled)[[2]] <- " "
> >> > >> >> 
> >> > >> >> happily introduce non-unique variable names in the data frame.
> >> > >> >> 
> >> > >> >> Is the rule that row.names and names must be unique still on?
> >> > >> >> 
> >> > >> >> Argh ...
> >> > >> 
> >> > >> > Splus 3.4 dispatches on dimnames<-, but not on names<- with the
> >> > >> > following curious result:
> >> > >> 
> >> > >> >> d <- data.frame(a=1:3,b=4:6)
> >> > >> >> names(d)<-c(" "," ")
> >> > >> >> d
> >> > >> 
> >> > >> > 1 1 4
> >> > >> > 2 2 5
> >> > >> > 3 3 6
> >> > >> >> dimnames(d)[[1]] <- rep(" ",3)  
> >> > >> > Error in "dimnames<-.data.frame"(d, .A0): column names must be 
unique
> >> > >> > Dumped
> >> > >> 
> >> > >> > R dispatches similarly, but doesn't check the dimnames in
> >> > >> > dimnames<-.data.frame. It could do so quite easily. Just add 
> >> > >> 
> >> > >> > || any(duplicated(d[[1]])) || any(duplicated(d[[2]]))
> >> > >> 
> >> > >> > at the appropriate spot.
> >> > >> 
> >> > >> Thomas' view about what should be permitted seems to be different.
> >> > 
> >> > > I wouldn't object to making it hard to create duplicated names(), but
> >> > > I think it would be a bad idea to have data.frame() make up unique
> >> > > names if it's given non-unique ones.
> >> > 
> >> > Maybe `check.names' could also be used for uniqueness testing?
> >> > 
> >> > In any case, I think we should specify what *exactly* a data frame is.
> >> >
> >> 
> >> I think we should specify, and check.names is a logical way to
> >> allow/forbid non-unique columns.  
> >> 
> >> Having a new class would be messy: logically it shouldn't inherit from
> >> data.frame, data.frame should inherit from it, but that would be a real
> >> pain to set up.
> >> 
> 
> > Data frames were originally meant to be used in modeling functions.
> > The opening paragraph in Chapter 3 (Data for Models) in the White Book
> > says:
>  
> >   "This chapter describes the general structure for data that
> >   will be used throughout the book.  In particular, it introduces the
> >   data frame, a class of objects to represent the data typically encounterd  
> >   in fitting models."
> 
> > However, data.frames may not be quite appropriate for representing
> > other types of tabular data (certainly a data.frame does not capture
> > the essence of, say, a "relational" table in the SQL sense, which
> > doesn't have the concept of row names).  Several manifestations of
> > this problem are coercing character data to factors "at the drop of a
> > hat" (as someone wrote here or in s-news), the row.names issue now
> > being discussed, problems including general objets in the "cells" of
> > the data.frame, etc.
> 
> > I think that the concept of a data.frame to represent data for fitting
> > models is fine, but we may (certainly I) have abused this concept.  We
> > need other classes of tabular data objects in addition (not as a
> > replacement) to data.frames, together with coercion methods and
> > perhaps other utilities.
> 
> Thomas had said that yes it would be nice to have something with less
> restrictions for modeling, but that it was uneconomical at least to
> introduce a new class that data.frame would then inherit from.
> 
> I interpret your comment as suggesting that we introduce a new class for
> holding tabular data?  Do you have specific ideas on this?
> 
> -k

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._