[R] read.csv and write.csv filtering for very big data ?

Duncan Murdoch murdoch.duncan at gmail.com
Wed Jun 5 14:06:36 CEST 2013


On 13-06-05 12:08 AM, ivo welch wrote:
> thx, greg.
>
> chunk boundaries have meanings.  the reader needs to stop, and buffer one
> line when it has crossed to the first line beyond the boundary.  it is also
> problem that read.csv no longer works with files---readLines then has to do
> the processing.  (starting read.csv over and over again with different
> skip.lines is probably not a good idea for big files.)  it needs a lot of
> smarts to intelligently append to a data frame.  (if the input is a data
> matrix, this is much simpler, of course.)

As Greg said, you don't need to use skip.lines:  just don't close the 
file, and continue reading from where you stopped on the previous run.

If you don't know the size of blocks in advance this is harder, but it's 
not really all that hard.  The logic would be something like this:

open the file
read the first block including the header
while not done:
    if you have a complete block with some extra lines at the end,
    extract them and save them, then process the complete block.
    Initialize the next block with the extra lines.

    if the block is incomplete, read some more and append it
    to what you saved.
end while
close the file

Duncan Murdoch

>
> exporting large input files to sqlite data bases makes sense when the same
> file is used again and again, but probably not when it is a staged one-time
> processor.  the disk consumption is too big.
>
> the writer could become quasi-threaded by writing to multiple temp files
> and then concatenating at the end, but this would be a nasty
> solution...nothing like the parsimonious elegance and generality that a
> built-in R filter function could provide.
>
> ----
> Ivo Welch (ivo.welch at gmail.com)
>
>
>
> On Tue, Jun 4, 2013 at 2:56 PM, Greg Snow <538280 at gmail.com> wrote:
>
>> Some possibilities using existing tools.
>>
>> If you create a file connection and open it before reading from it (or
>> writing to it), then functions like read.table and read.csv ( and
>> write.table for a writable connection) will read from the connection, but
>> not close and reset it.  This means that you could open 2 files, one for
>> reading and one for writing, then read in a chunk, process it, write it
>> out, then read in the next chunk, etc.
>>
>> Another option would be to read the data into an ff object (ff package) or
>> into a database (SQLite for one) which could have the data accessed in
>> chunks, possibly even in parallel.
>>
>>
>> On Mon, Jun 3, 2013 at 4:59 PM, ivo welch <ivo.welch at anderson.ucla.edu>wrote:
>>
>>> dear R wizards---
>>>
>>> I presume this is a common problem, so I thought I would ask whether
>>> this solution already exists and if not, suggest it.  say, a user has
>>> a data set of x GB, where x is very big---say, greater than RAM.
>>> fortunately, data often come sequentially in groups, and there is a
>>> need to process contiguous subsets of them and write the results to a
>>> new file.  read.csv and write.csv only work on FULL data sets.
>>> read.csv has the ability to skip n lines and read only m lines, but
>>> this can cross the subsets.  the useful solution here would be a
>>> "filter" function that understands about chunks:
>>>
>>>     filter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
>>>
>>> a chunk would not exactly be a factor, because normal R factors can be
>>> non-sequential in the data frame.  the filter.csv makes it very simple
>>> to work on large data sets...almost SAS simple:
>>>
>>>     filter.csv( pipe('bzcat infile.csv.bz2'), "results.csv", "date",
>>> function(d) colMeans(d))
>>> or
>>>     filter.csv( pipe('bzcat infile.csv.bz2'), pipe("bzip -c >
>>> results.csv.bz2"), "date", function(d) d[ unique(d$date), ] )  ##
>>> filter out obserations that have the same date again later
>>>
>>> or some reasonable variant of this.
>>>
>>> now that I can have many small chunks, it would be nice if this were
>>> threadsafe, so
>>>
>>>     mcfilter.csv <- function( in.csv, out.csv, chunk, FUNprocess ) ...
>>>
>>> with 'library(parallel)' could feed multiple cores the FUNprocess, and
>>> make sure that the processes don't step on one another.  (why did R
>>> not use a dot after "mc" for parallel lapply?)  presumably, to keep it
>>> simple, mcfilter.csv would keep a counter of read chunks and block
>>> write chinks until the next sequential chunk in order arrives.
>>>
>>> just a suggestion...
>>>
>>> /iaw
>>>
>>> ----
>>> Ivo Welch (ivo.welch at gmail.com)
>>>
>>> ______________________________________________
>>> R-help at r-project.org mailing list
>>> https://stat.ethz.ch/mailman/listinfo/r-help
>>> PLEASE do read the posting guide
>>> http://www.R-project.org/posting-guide.html
>>> and provide commented, minimal, self-contained, reproducible code.
>>>
>>
>>
>>
>> --
>> Gregory (Greg) L. Snow Ph.D.
>> 538280 at gmail.com
>>
>
> 	[[alternative HTML version deleted]]
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list