[R] Suggestion for big files [was: Re: A comment about R:]

Thu Jan 5 16:46:09 CET 2006

[ronggui]

>R's week when handling large data file.  I has a data file : 807 vars,
>118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
>my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.

Just (another) thought.  I used to use SPSS, many, many years ago, on 
CDC machines, where the CPU had limited memory and no kind of paging 
architecture.  Files did not need to be very large for being too large.

SPSS had a feature that was then useful, about the capability of 
sampling a big dataset directly at file read time, quite before 
processing starts.  Maybe something similar could help in R (that is, 
instead of reading the whole data in memory, _then_ sampling it.)

One can read records from a file, up to a preset amount of them.  If the 
file happens to contain more records than that preset number (the number 
of records in the whole file is not known beforehand), already read 
records may be dropped at random and replaced by other records coming 
from the file being read.  If the random selection algorithm is properly 
chosen, it can be made so that all records in the original file have 
equal probability of being kept in the final subset.

If such a sampling facility was built right within usual R reading 
routines (triggered by an extra argument, say), it could offer 
a compromise for processing large files, and also sometimes accelerate 
computations for big problems, even when memory is not at stake.

-- 
François Pinard   http://pinard.progiciels-bpi.ca