[R] Suggestion for big files [was: Re: A comment about R:]

Prof Brian Ripley ripley at stats.ox.ac.uk
Thu Jan 5 17:48:26 CET 2006


Another possibility is to make use of the several DBMS interfaces already 
available for R.  It is very easy to pull in a sample from one of those, 
and surely keeping such large data files as ASCII not good practice.

One problem with Francois Pinard's suggestion (the credit has got lost) is 
that R's I/O is not line-oriented but stream-oriented.  So selecting lines 
is not particularly easy in R.  That's a deliberate design decision, given 
the DBMS interfaces.

I rather thought that using a DBMS was standard practice in the R 
community for those using large datasets: it gets discussed rather often.

On Thu, 5 Jan 2006, Kort, Eric wrote:

>> -----Original Message-----
>>
>> [ronggui]
>>
>>> R's week when handling large data file.  I has a data file : 807 vars,
>>> 118519 obs.and its CVS format.  Stata can read it in in 2 minus,but In
>>> my PC,R almost can not handle. my pc's cpu 1.7G ;RAM 512M.
>>
>> Just (another) thought.  I used to use SPSS, many, many years ago, on
>> CDC machines, where the CPU had limited memory and no kind of paging
>> architecture.  Files did not need to be very large for being too large.
>>
>> SPSS had a feature that was then useful, about the capability of
>> sampling a big dataset directly at file read time, quite before
>> processing starts.  Maybe something similar could help in R (that is,
>> instead of reading the whole data in memory, _then_ sampling it.)
>>
>> One can read records from a file, up to a preset amount of them.  If the
>> file happens to contain more records than that preset number (the number
>> of records in the whole file is not known beforehand), already read
>> records may be dropped at random and replaced by other records coming
>> from the file being read.  If the random selection algorithm is properly
>> chosen, it can be made so that all records in the original file have
>> equal probability of being kept in the final subset.
>>
>> If such a sampling facility was built right within usual R reading
>> routines (triggered by an extra argument, say), it could offer
>> a compromise for processing large files, and also sometimes accelerate
>> computations for big problems, even when memory is not at stake.
>>
>
> Since I often work with images and other large data sets, I have been thinking about a "BLOb" (binary large object--though it wouldn't necessarily have to be binary) package for R--one that would handle I/O for such creatures and only bring as much data into the R space as was actually needed.
>
> So I see 3 possibilities:
>
> 1. The sort of functionality you describe is implemented in the R internals (by people other than me).
> 2. Some individuals (perhaps myself included) write such a package.
> 3. This thread fizzles out and we do nothing.
>
> I guess I will see what, if any, discussion ensues from this point to see which of these three options seems worth pursuing.
>
>> --
>> François Pinard   http://pinard.progiciels-bpi.ca

-- 
Brian D. Ripley,                  ripley at stats.ox.ac.uk
Professor of Applied Statistics,  http://www.stats.ox.ac.uk/~ripley/
University of Oxford,             Tel:  +44 1865 272861 (self)
1 South Parks Road,                     +44 1865 272866 (PA)
Oxford OX1 3TG, UK                Fax:  +44 1865 272595


More information about the R-help mailing list