[Rd] RFC: type conversion in read.table
John Chambers
jmc@research.bell-labs.com
Fri, 24 Aug 2001 09:18:17 -0400
Prof Brian Ripley wrote:
>
> Currently read.table is rather limited in its type conversion.
> The algorithm is
>
> 0) Read as character
> 1) Try to convert to numeric. If that works, quit
> 2) Convert to factor unless !as.is.
>
> I am thinking about adding more flexibility and more classes by the
> following two changes.
>
> A) Anticipating the arrival of classes for all R objects, add an
> argument say `colClasses' that allows the user to specify the desired
> class for every column. This could default to "auto", or NA if people
> think "auto" might be a relevant class name one day.
>
> The effect would be equivalent to running
>
> data[[i]] <- as(data[[i]], colClasses[i])
>
> instead of
>
> data[[i]] <- type.convert(data[[i]], as.is = as.is[i], dec = dec)
>
> except that standard classes such as "numeric", "factor", "logical",
> "character" would be dispatched directly, and argument "dec" would be
> consulted where appropriate.
>
> colClasses = "character" would suppress all conversions, which cannot
> currently be done.
>
> B) Make the default "auto" option somewhat cleverer. I am thinking of
> trying the following in turn
>
> logical
> integer
> numeric
> complex
> factor (only if !as.is[i] for backwards compatibility).
>
> The `dec' option needs to be used for numeric/complex.
>
> This would be done by a documented typeConvert function, and
> should normally be fast (just look at the first item to rule
> out much of the list).
>
> This does mean that data frames would be much more likely to end up
> containing integer or logical variables (although they can now).
> I have already fixed model.frame/matrix to handle logical variables,
> and would need to check that they do handle integer variables.
>
> Questions:
>
> 1) Is this desirable?
Yes, definitely. It also fits very well into the formal class idiom.
Couple of suggestions below.
>
> 2) Are the names sensible?
>
> 3) Is there any need to allow users to specify either the set of
> classes used by "auto" or lists of classes on a column-specific
> basis?
I think the most flexible way to get what you want is something like the
following.
The natural default for the colClasses argument is the name of a class,
but a "virtual" class in green book terminology.
I've been playing around with some data-frame related software mostly as
tests for the methods code (in SLanguage/SModels in the Omegahat tree).
The class used there for this purpose is called "dataVariable", meaning
anything that can conceptually be a variable in a data frame. Actual
classes for variables extend this class, maybe trivially, maybe by some
method.
What's needed for the default here is essentially a method to coerce
class "character" to "dataVariable" (or whatever name one wants to
use). When we are really using formal methods, this would be specified
by a call to setAs (green book, p307). Then in effect
data[[i]] <- as(data[[i]], colClasses[i])
applies in the default case as well.
Users could specialize the default by over-riding the setAs, but a
better way would be to define a new virtual class, with its own method
for coercion. Users would then have essentially unlimited flexibility,
by supplying the name of that class in the colClasses argument.
> 4) Currently the default is to get something without much information
> loss, and that would remain. My intention is that if a class is
> specified and conversion is not possible that the result would be
> (mainly?) NAs. Any problem with that?
As a default, seems fine. When the user supplies a class, this implies
an as() method, which can then decide what to do in case of
problems--error, NA, or whatever.
> Brian
>
> --
> Brian D. Ripley, ripley@stats.ox.ac.uk
> Professor of Applied Statistics, http://www.stats.ox.ac.uk/~ripley/
> University of Oxford, Tel: +44 1865 272861 (self)
> 1 South Parks Road, +44 1865 272860 (secr)
> Oxford OX1 3TG, UK Fax: +44 1865 272595
>
> -.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
> r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
> Send "info", "help", or "[un]subscribe"
> (in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
> _._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._
John
--
John M. Chambers jmc@bell-labs.com
Bell Labs, Lucent Technologies office: (908)582-2681
700 Mountain Avenue, Room 2C-282 fax: (908)582-3340
Murray Hill, NJ 07974 web: http://www.cs.bell-labs.com/~jmc
-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !) To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._