[Rd] read.table() and NULL for colClasses

Wed Jul 28 22:12:56 CEST 2004

NULL is not a valid value for colClasses and I don't see why you thought
it was.  colClasses has to be character according to the documentation, so
"NULL" is allowed but not NULL.

Your diff appears to be backwards for a patch.  A patch against the 
current R-devel sources is what is needed, including some regression 
tests.

On Wed, 28 Jul 2004, Henrik Bengtsson wrote:

> Hi,
> 
> is there are reason for not supporting NULL or "NULL" values for argument
> colClasses in read.table(), much like you can use NULL values for argument
> 'what' in scan()? This would help quite a bit when reading large data files
> where only a few columns are of interest. 

Is that a common enough case to make this worth the code complication,
given that scan() (or better, a DBMS) can be used?  The usual reason is
that R is maintained by a small and overworked team and adding
complications needs justification, not not adding them.

> I've modfied read.table() to so it calls scan(what=...) also with NULLs for
> the fields to be skipped. Here's the diff of readtable.R (from the
> R-1.9.1.tgz; 9,591,217 bytes):
> 
> diff readtable.new.R readtable.R
> 117,123d116
> <     # Skip NULL columns in scan()
> <     void <- sapply(colClasses, FUN=identical, "NULL") |
> <             sapply(colClasses, FUN=is.null)
> <     # If all (data) columns are NULL, return empty data frame.
> <     if (sum(!void) <= 1*rlabp)
> <       return(data.frame())
> <     what[void] <- list(NULL)
> 131c124
> <     nlines <- length(data[[which(!void)[1]]])
> ---
> >     nlines <- length(data[[1]])
> 161c154
> <     for (i in (1:cols)[!known & !void]) {
> ---
> >     for (i in 1:cols) {
> 171,178d163
> <     # Skipped row names equals row.names=NULL.
> <     if (rlabp) {
> <       if (void[1]) {
> <         row.names <- NULL
> <         data <- data[-1]
> <       }
> <       void <- void[-1]
> <     }
> 201,202d185
> <     # Remove NULL columns
> <     data[void] <- NULL
> 
> and a diff for read.table.Rd:
> 
> diff read.table.new.Rd read.table.Rd
> 102,104c102
> <     \code{NA} when \code{\link{type.convert}} is used.  Columns for
> <     which the value is \code{"NULL"} (or \code{NULL} in a list) are
> <     skipped. NB: \code{as} is
> ---
> >     \code{NA} when \code{\link{type.convert}} is used.  NB: \code{as} is
> 181,183c179
> <   the five atomic vector classes. Skipping columns with \code{"NULL"}
> <   (or \code{NULL} will also require less memory.
> <
> ---
> >   the five atomic vector classes.
> 
> Note that there is already an, what I assume is unintentional, effect of
> setting a colClasses to "NULL". The data conversion, which happens *after*
> scan() has read the data anyway, "NULL" will NULL a column via as(x,
> "NULL"), but unfortunately the wrong column. If not the above modifications,
> maybe a warning for the latter?

That's not usage as documented so the effect is definitely unintentional.
We can't catch all misuses!

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595