[R] Suggestion for big files [was: Re: A comment about R:]

Thu Jan 5 17:09:47 CET 2006

> -----Original Message-----
> 
> [ronggui]
> 
> >R's week when handling large data file.  I has a data file : 807 vars,
> >118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
> >my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
> 
> Just (another) thought.  I used to use SPSS, many, many years ago, on
> CDC machines, where the CPU had limited memory and no kind of paging
> architecture.  Files did not need to be very large for being too large.
> 
> SPSS had a feature that was then useful, about the capability of
> sampling a big dataset directly at file read time, quite before
> processing starts.  Maybe something similar could help in R (that is,
> instead of reading the whole data in memory, _then_ sampling it.)
> 
> One can read records from a file, up to a preset amount of them.  If the
> file happens to contain more records than that preset number (the number
> of records in the whole file is not known beforehand), already read
> records may be dropped at random and replaced by other records coming
> from the file being read.  If the random selection algorithm is properly
> chosen, it can be made so that all records in the original file have
> equal probability of being kept in the final subset.
> 
> If such a sampling facility was built right within usual R reading
> routines (triggered by an extra argument, say), it could offer
> a compromise for processing large files, and also sometimes accelerate
> computations for big problems, even when memory is not at stake.
> 

Since I often work with images and other large data sets, I have been thinking about a "BLOb" (binary large object--though it wouldn't necessarily have to be binary) package for R--one that would handle I/O for such creatures and only bring as much data into the R space as was actually needed.

So I see 3 possibilities:

1. The sort of functionality you describe is implemented in the R internals (by people other than me).
2. Some individuals (perhaps myself included) write such a package.
3. This thread fizzles out and we do nothing.

I guess I will see what, if any, discussion ensues from this point to see which of these three options seems worth pursuing.

> --
> François Pinard   http://pinard.progiciels-bpi.ca
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-
> guide.html
This email message, including any attachments, is for the so...{{dropped}}