[Rd] read.table() with quoted integers

Jens Oehlschlägel Jens.Oehlschlaegel at truecluster.com
Thu Oct 3 16:44:19 CEST 2013


I agree that quoted integer columns are not the most efficient way of 
delivering csv-files. However, the sad reality is that one receives such 
formats and still needs to read the data. Therefore it is not helpful to 
state that one should 'consider "character" to be the correct colClass 
in case an integer is surrounded by quotes'.

The philosophy of read.table.ffdf is delegating the actual csv-parsing 
to a parse engine 'similarly' parametrized like 'read.table'. It is not 
'bad coding practice' - but a conscious design decision - to assume that 
the parse engine behaves consistently, which read.table does not yet: it 
automatically recognizes a quoted integer column as 'integer', but when 
asked to explicitly interpret the column as 'integer' it does refuse to 
do so. So there is nothing wrong with read.table.ffdf (but something can 
be improved about read.table). It is *not* the 'best solution [...] to 
rewrite read.table.ffdf()' given that it nicely imports such data, see 
4+1 ways to do so below.

Jens Oehlschlägel


# --- first create a csv file for demonstration 
-------------------------------
require(ff)
file <- "test.csv"
path <- "c:/tmp"
n <- 1e2
d <- data.frame(x=1:n, y=shQuote(1:n))
write.csv(d, file=file.path(path,file), row.names=FALSE, quote=FALSE)

# --- how to do it with read.table.ffdf 
---------------------------------------

# 1 let the parse engine ignore colClasses and hope for the best
fixedengine <- function(file, ..., colClasses=NA){
	read.csv(file, ...)
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, 
FUN="fixedengine")
df

# 2 Suspend colClasses(=NA) for the quoted integer column only
df <- read.csv.ffdf(file=file.path(path,file), first.rows = 10, 
colClasses=c("integer", NA))
df

# 3 do your own type conversion using transFUN
#  after reading the problematic column as character
# Being able to inject regexps is quite powerful isn't it?
# Or error handlinig in case of varying column format!
custominterp <- function(d){
	d[[2]] <- as.integer(gsub('"', '', d[[2]]))
	d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, 
colClasses=c("integer", "character"), FUN="read.csv", transFUN=custominterp)
df

# 4 do your own line parsing and type conversion
# Here you can even handle non-standard formats
#  such as varying number of columns
customengine <- function(file, header=TRUE, col.names, colClasses=NA, 
nrows=0, skip=0, fileEncoding="", comment.char = ""){
	l <- scan(file, what="character", nlines=nrows+header, skip=skip, 
fileEncoding=fileEncoding, comment.char = comment.char)
	s <- do.call("rbind", strsplit(l, ","))
	if (header){
		d <- data.frame(as.integer(s[-1,1]), as.integer(gsub('"','',s[-1,2])))
		names(d) <- s[1,]
	}else{
		d <- data.frame(as.integer(s[,1]), as.integer(gsub('"','',s[,2])))
	}
	if (!missing(col.names))
		names(d) <- col.names
	d
}
df <- read.table.ffdf(file=file.path(path,file), first.rows = 10, 
FUN="customengine")
df

# 5 use a parsing engine that can apply colClasses to quoted integers
# Unfortunately Henry Bengtson's readDataFrame does not work as a
#  parse engine for read.table.ffdf because read.table.ffdf expects
#  the parse engine to read successive chunks from a file connection
#  while readDataFrame only accepts a filename as input file spec.
# Yes it has 'skip', but using that would reread the file from scratch
#  for each chunk (O(N^2) costs)



More information about the R-devel mailing list