[R] read.csv and write.csv filtering for very big data ?
ivo welch
ivo.welch at anderson.ucla.edu
Tue Jun 4 00:59:03 CEST 2013
dear R wizards---
I presume this is a common problem, so I thought I would ask whether
this solution already exists and if not, suggest it. say, a user has
a data set of x GB, where x is very big---say, greater than RAM.
fortunately, data often come sequentially in groups, and there is a
need to process contiguous subsets of them and write the results to a
new file. read.csv and write.csv only work on FULL data sets.
read.csv has the ability to skip n lines and read only m lines, but
this can cross the subsets. the useful solution here would be a
"filter" function that understands about chunks:
filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
a chunk would not exactly be a factor, because normal R factors can be
non-sequential in the data frame. the filter.csv makes it very simple
to work on large data sets...almost SAS simple:
filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv", "date",
function(d) colMeans(d))
or
filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c >
results.csv.bz2"), "date", function(d) d[ unique(d$date), ] ) ##
filter out obserations that have the same date again later
or some reasonable variant of this.
now that I can have many small chunks, it would be nice if this were
threadsafe, so
mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
with 'library(parallel)' could feed multiple cores the FUNprocess, and
make sure that the processes don't step on one another. (why did R
not use a dot after "mc" for parallel lapply?) presumably, to keep it
simple, mcfilter.csv would keep a counter of read chunks and block
write chinks until the next sequential chunk in order arrives.
just a suggestion...
/iaw
----
Ivo Welch (ivo.welch at gmail.com)
More information about the R-help
mailing list