[Rd] RFC: type conversion in read.table

John Chambers jmc@research.bell-labs.com
Fri, 24 Aug 2001 09:18:17 -0400


Prof Brian Ripley wrote:
> 
> Currently read.table is rather limited in its type conversion.
> The algorithm is
> 
> 0) Read as character
> 1) Try to convert to numeric. If that works, quit
> 2) Convert to factor unless !as.is.
> 
> I am thinking about adding more flexibility and more classes by the
> following two changes.
> 
> A) Anticipating the arrival of classes for all R objects, add an
> argument say `colClasses' that allows the user to specify the desired
> class for every column.  This could default to "auto", or NA if people
> think "auto" might be a relevant class name one day.
> 
> The effect would be equivalent to running
> 
> data[[i]] <- as(data[[i]], colClasses[i])
> 
> instead of
> 
> data[[i]] <- type.convert(data[[i]], as.is = as.is[i], dec = dec)
> 
> except that standard classes such as "numeric", "factor", "logical",
> "character" would be dispatched directly, and argument "dec" would be
> consulted where appropriate.
> 
> colClasses = "character" would suppress all conversions, which cannot
> currently be done.
> 
> B) Make the default "auto" option somewhat cleverer.  I am thinking of
> trying the following in turn
> 
> logical
> integer
> numeric
> complex
> factor   (only if !as.is[i] for backwards compatibility).
> 
> The `dec' option needs to be used for numeric/complex.
> 
> This would be done by a documented typeConvert function, and
> should normally be fast (just look at the first item to rule
> out much of the list).
> 
> This does mean that data frames would be much more likely to end up
> containing integer or logical variables (although they can now).
> I have already fixed model.frame/matrix to handle logical variables,
> and would need to check that they do handle integer variables.
> 
> Questions:
> 
> 1) Is this desirable?

Yes, definitely.  It also fits very well into the formal class idiom. 
Couple of suggestions below.

> 
> 2) Are the names sensible?
> 
> 3) Is there any need to allow users to specify either the set of
>    classes used by "auto" or lists of classes on a column-specific
>    basis?

I think the most flexible way to get what you want is something like the
following.

The natural default for the colClasses argument is the name of a class,
but a "virtual" class in green book terminology.

I've been playing around with some data-frame related software mostly as
tests for the methods code (in SLanguage/SModels in the Omegahat tree).

The class used there for this purpose is called "dataVariable", meaning
anything that can conceptually be a variable in a data frame.  Actual
classes for variables extend this class, maybe trivially, maybe by some
method.

What's needed for the default here is essentially a method to coerce
class "character" to "dataVariable" (or whatever name one wants to
use).  When we are really using formal methods, this would be specified
by a call to setAs (green book, p307).  Then in effect
  data[[i]] <- as(data[[i]], colClasses[i])
applies in the default case as well.

Users could specialize the default by over-riding the setAs, but a
better way would be to define a new virtual class, with its own method
for coercion.  Users would then have essentially unlimited flexibility,
by supplying the name of that class in the colClasses argument.

> 4) Currently the default is to get something without much information
>    loss, and that would remain.  My intention is that if a class is
>    specified and conversion is not possible that the result would be
>    (mainly?) NAs.  Any problem with that?

As a default, seems fine.  When the user supplies a class, this implies
an as() method, which can then decide what to do in case of
problems--error, NA, or whatever.

> Brian
> 
> --
> Brian D. Ripley,                  ripley@stats.ox.ac.uk
> Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
> University of Oxford,             Tel:  +44 1865 272861 (self)
> 1 South Parks Road,                     +44 1865 272860 (secr)
> Oxford OX1 3TG, UK                Fax:  +44 1865 272595
> 
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._

John

-- 
John M. Chambers                  jmc@bell-labs.com
Bell Labs, Lucent Technologies    office: (908)582-2681
700 Mountain Avenue, Room 2C-282  fax:    (908)582-3340
Murray Hill, NJ  07974            web: http://www.cs.bell-labs.com/~jmc
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._