[Rd] RFC: type conversion in read.table

Prof Brian Ripley ripley@stats.ox.ac.uk
Fri, 24 Aug 2001 08:30:44 +0100 (BST)


Currently read.table is rather limited in its type conversion.
The algorithm is

0) Read as character
1) Try to convert to numeric. If that works, quit
2) Convert to factor unless !as.is.

I am thinking about adding more flexibility and more classes by the
following two changes.


A) Anticipating the arrival of classes for all R objects, add an
argument say `colClasses' that allows the user to specify the desired
class for every column.  This could default to "auto", or NA if people
think "auto" might be a relevant class name one day.

The effect would be equivalent to running

data[[i]] <- as(data[[i]], colClasses[i])

instead of

data[[i]] <- type.convert(data[[i]], as.is = as.is[i], dec = dec)

except that standard classes such as "numeric", "factor", "logical",
"character" would be dispatched directly, and argument "dec" would be
consulted where appropriate.

colClasses = "character" would suppress all conversions, which cannot
currently be done.


B) Make the default "auto" option somewhat cleverer.  I am thinking of
trying the following in turn

logical
integer
numeric
complex
factor   (only if !as.is[i] for backwards compatibility).

The `dec' option needs to be used for numeric/complex.

This would be done by a documented typeConvert function, and
should normally be fast (just look at the first item to rule
out much of the list).


This does mean that data frames would be much more likely to end up
containing integer or logical variables (although they can now).
I have already fixed model.frame/matrix to handle logical variables,
and would need to check that they do handle integer variables.


Questions:

1) Is this desirable?

2) Are the names sensible?

3) Is there any need to allow users to specify either the set of
   classes used by "auto" or lists of classes on a column-specific
   basis?

4) Currently the default is to get something without much information
   loss, and that would remain.  My intention is that if a class is
   specified and conversion is not possible that the result would be
   (mainly?) NAs.  Any problem with that?


Brian

-- 
Brian D. Ripley,                  ripley@stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272860 (secr)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._