[R] Reading big files in chunks-ff package

Jan van der Laan rhelp at eoos.dds.nl
Sun Mar 25 21:20:13 CEST 2012


The 'normal' way of doing that with ff is to first convert your csv  
file completely to a
ffdf object (which stores its data on disk so shouldn't give any  
memory problems). You
can then use the chunk routine (see ?chunk) to divide your data in the  
required chunks.

Untested so may contain errors:

ffdf <- read.table.ffdf(...)

chnks <- chunk(from=1, to=nrow(yourffdf), by=5E6, method='seq')

for (chnk in chnks) {
   # read data
   data <- ffdf[chnk, ]
   # do your thing with the data
   # clean up
   rm(data)
   gc()
}


If you want to process your csv file directly in chunks, you could  
also have a look at
the LaF package. Especially the process_blocks routine which does  
exactly that. The
manual vignette  
(http://cran.r-project.org/web/packages/LaF/vignettes/LaF-manual.pdf)
contains some examples how to do that.

Jan



Quoting Mav <mastorvarela at gmail.com>:

> Thank you Jan
>
> My problem is the following:
> For instance, I have 2 files with different number of rows (15 million and 8
> million of rows each).
> I would like to read the first one in chunks of 5 million each. However
> between the first and second chunk, I would like to analyze those first 5
> million of rows, write the analysis in a new csv and then proceed to read
> and analyze the second chunk and so on until the third chunk. With the
> second file, I would like to do the same...read the first chunk, analyze it
> and continue to read the second and analyze it.
>
> Basically my problem is that I manage to read the files....but with so many
> rows...I cannot do any analyses (even filtering the rows) because of the RAM
> restrictions.
>
> Sorry if is still not clear.
>
> Thank you
>
> --
> View this message in context:   
> http://r.789695.n4.nabble.com/Reading-big-files-in-chunks-ff-package-tp4502070p4503642.html
> Sent from the R help mailing list archive at Nabble.com.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>



More information about the R-help mailing list