[R] read.csv and write.csv filtering for very big data ?

Duncan Murdoch murdoch.duncan at gmail.com
Wed Jun 5 16:40:28 CEST 2013


On 05/06/2013 10:32 AM, ivo welch wrote:
>
> I just tested read.csv with files...it still works.  it can read one 
> line at a time (without col.names and with nrows).  nice.  it loses 
> its type memory across reinvokcations, but this is usually not a 
> problem if one reads a few thousand lines inside a buffer function. 
>  this sort of function is useful only for big files anyway.

Surely you know the types of the columns?  If you specify it in advance, 
read.table and relatives will be much faster.

Duncan Murdoch

>
> is it possible to block write.csv across multiple threads in mclapply? 
>  or hook a single-thread function into the mclapply collector?
>
> /iaw
>
>
>
> On Wed, Jun 5, 2013 at 5:06 AM, Duncan Murdoch 
> <murdoch.duncan at gmail.com <mailto:murdoch.duncan at gmail.com>> wrote:
>
>     On 13-06-05 12:08 AM, ivo welch wrote:
>
>         thx, greg.
>
>         chunk boundaries have meanings.  the reader needs to stop, and
>         buffer one
>         line when it has crossed to the first line beyond the
>         boundary.  it is also
>         problem that read.csv no longer works with files---readLines
>         then has to do
>         the processing.  (starting read.csv over and over again with
>         different
>         skip.lines is probably not a good idea for big files.)  it
>         needs a lot of
>         smarts to intelligently append to a data frame.  (if the input
>         is a data
>         matrix, this is much simpler, of course.)
>
>
>     As Greg said, you don't need to use skip.lines:  just don't close
>     the file, and continue reading from where you stopped on the
>     previous run.
>
>     If you don't know the size of blocks in advance this is harder,
>     but it's not really all that hard.  The logic would be something
>     like this:
>
>     open the file
>     read the first block including the header
>     while not done:
>        if you have a complete block with some extra lines at the end,
>        extract them and save them, then process the complete block.
>        Initialize the next block with the extra lines.
>
>        if the block is incomplete, read some more and append it
>        to what you saved.
>     end while
>     close the file
>
>     Duncan Murdoch
>
>
>         exporting large input files to sqlite data bases makes sense
>         when the same
>         file is used again and again, but probably not when it is a
>         staged one-time
>         processor.  the disk consumption is too big.
>
>         the writer could become quasi-threaded by writing to multiple
>         temp files
>         and then concatenating at the end, but this would be a nasty
>         solution...nothing like the parsimonious elegance and
>         generality that a
>         built-in R filter function could provide.
>
>         ----
>         Ivo Welch (ivo.welch at gmail.com <mailto:ivo.welch at gmail.com>)
>
>
>
>         On Tue, Jun 4, 2013 at 2:56 PM, Greg Snow <538280 at gmail.com
>         <mailto:538280 at gmail.com>> wrote:
>
>             Some possibilities using existing tools.
>
>             If you create a file connection and open it before reading
>             from it (or
>             writing to it), then functions like read.table and
>             read.csv ( and
>             write.table for a writable connection) will read from the
>             connection, but
>             not close and reset it.  This means that you could open 2
>             files, one for
>             reading and one for writing, then read in a chunk, process
>             it, write it
>             out, then read in the next chunk, etc.
>
>             Another option would be to read the data into an ff object
>             (ff package) or
>             into a database (SQLite for one) which could have the data
>             accessed in
>             chunks, possibly even in parallel.
>
>
>             On Mon, Jun 3, 2013 at 4:59 PM, ivo welch
>             <ivo.welch at anderson.ucla.edu
>             <mailto:ivo.welch at anderson.ucla.edu>>wrote:
>
>                 dear R wizards---
>
>                 I presume this is a common problem, so I thought I
>                 would ask whether
>                 this solution already exists and if not, suggest it.
>                  say, a user has
>                 a data set of x GB, where x is very big---say, greater
>                 than RAM.
>                 fortunately, data often come sequentially in groups,
>                 and there is a
>                 need to process contiguous subsets of them and write
>                 the results to a
>                 new file.  read.csv and write.csv only work on FULL
>                 data sets.
>                 read.csv has the ability to skip n lines and read only
>                 m lines, but
>                 this can cross the subsets.  the useful solution here
>                 would be a
>                 "filter" function that understands about chunks:
>
>                     filter.csv <- function( in.csv, out.csv, chunk,
>                 FUNprocess ) ...
>
>                 a chunk would not exactly be a factor, because normal
>                 R factors can be
>                 non-sequential in the data frame.  the filter.csv
>                 makes it very simple
>                 to work on large data sets...almost SAS simple:
>
>                     filter.csv( pipe('bzcat infile.csv.bz2'),
>                 "results.csv", "date",
>                 function(d) colMeans(d))
>                 or
>                     filter.csv( pipe('bzcat infile.csv.bz2'),
>                 pipe("bzip -c >
>                 results.csv.bz2"), "date", function(d) d[
>                 unique(d$date), ] )  ##
>                 filter out obserations that have the same date again later
>
>                 or some reasonable variant of this.
>
>                 now that I can have many small chunks, it would be
>                 nice if this were
>                 threadsafe, so
>
>                     mcfilter.csv <- function( in.csv, out.csv, chunk,
>                 FUNprocess ) ...
>
>                 with 'library(parallel)' could feed multiple cores the
>                 FUNprocess, and
>                 make sure that the processes don't step on one
>                 another.  (why did R
>                 not use a dot after "mc" for parallel lapply?)
>                  presumably, to keep it
>                 simple, mcfilter.csv would keep a counter of read
>                 chunks and block
>                 write chinks until the next sequential chunk in order
>                 arrives.
>
>                 just a suggestion...
>
>                 /iaw
>
>                 ----
>                 Ivo Welch (ivo.welch at gmail.com
>                 <mailto:ivo.welch at gmail.com>)
>
>                 ______________________________________________
>                 R-help at r-project.org <mailto:R-help at r-project.org>
>                 mailing list
>                 https://stat.ethz.ch/mailman/listinfo/r-help
>                 PLEASE do read the posting guide
>                 http://www.R-project.org/posting-guide.html
>                 and provide commented, minimal, self-contained,
>                 reproducible code.
>
>
>
>
>             --
>             Gregory (Greg) L. Snow Ph.D.
>             538280 at gmail.com <mailto:538280 at gmail.com>
>
>
>                 [[alternative HTML version deleted]]
>
>
>         ______________________________________________
>         R-help at r-project.org <mailto:R-help at r-project.org> mailing list
>         https://stat.ethz.ch/mailman/listinfo/r-help
>         PLEASE do read the posting guide
>         http://www.R-project.org/posting-guide.html
>         and provide commented, minimal, self-contained, reproducible code.
>
>
>



More information about the R-help mailing list