[R] size limitations in R
Peter Dalgaard
p.dalgaard at biostat.ku.dk
Sat Sep 1 14:36:41 CEST 2007
Daniel Lakeland wrote:
> On Fri, Aug 31, 2007 at 01:31:12PM +0100, Fabiano Vergari wrote:
>
>
>> I am a SAS user currently evaluating R as a possible addition or
>> even replacement for SAS. The difficulty I have come across straight
>> away is R's apparent difficulty in handling relatively large data
>> files. Whilst I would not expect it to handle datasets with millions
>> of records, I still really need to be able to work with dataset with
>> 100,000+ records and 100+ variables. Yet, when reading a .csv file
>> with 180,000 records and about 200 variables, the software virtually
>> ground to a halt (I stopped it after 1 hour). Are there guidelines
>> or maybe a limitations document anywhere that helps me assess the
>> size
>>
>
> 180k records with 200 variables = 36 million entries, if they're
> numeric then they're doubles taking up 8 bytes, so 288 MB of RAM. This
> should be perfectly fine for R, as long as you have that much free
> RAM.
>
> However, the routines that read CSV and tabular delimited files are
> relatively inefficient for such large files.
>
> In order to handle large data files, it is better to use one of the
> database interfaces. My preference would be sqlite unless I already
> had the data on a mysql or other database server.
>
>
Yes. However, for an intermediate solution, notice that much of the
inefficiency comes from storing data as character vectors before
deciding what to do with them. Character vectors have an overhead of one
SEXP per string stored i.e. 20-28 bytes in addition to the actual
string. There are options for telling the read routines explicitly that
data are numeric/integer/logical: 'colClasses' for read.table(), 'what'
for scan(). This will bypass the intermediate storage.
> the documentation for the packages RSQLite and SQLiteDF should be
> helpful, as well as the documentation for SQLite itself, which has a
> facility for efficiently importing CSV and similar files directly to a
> SQLite database.
>
> eg: http://netadmintools.com/art572.html
>
>
>
>
--
O__ ---- Peter Dalgaard Øster Farimagsgade 5, Entr.B
c/ /'_ --- Dept. of Biostatistics PO Box 2099, 1014 Cph. K
(*) \(*) -- University of Copenhagen Denmark Ph: (+45) 35327918
~~~~~~~~~~ - (p.dalgaard at biostat.ku.dk) FAX: (+45) 35327907
More information about the R-help
mailing list