[R] Help with R

Christoph Lehmann christoph.lehmann at gmx.ch
Thu May 5 15:05:55 CEST 2005

>>heard that 'R' does not do a very good job at handling large datasets, is
>>this true? 
importing huge datasets in a data.frame with e.g. a subsequent step of 
conversion of some columns into factors may lead into memory troubles 
(probably due to memory overhead when building out factors). But we 
currently succeeded in  importing 12 millions of data records stored in 
a MySQL database, using RMySQL package. The procedure which lead to 
success was:

0 define a data.frame 'data.total' with the size necessary to keep the 
whole data set to be imported
in a loop do:
   1 import the data in chunks of eg 30000 records per chunk and save it 
in a temporary data.frame 'data.chunk'
   2 the conversion into factors and other preprocessing steps, such as 
data aggregation should be done for each single chunk saved in 
'data.chunk' after import
   3 the now preprocessed chunk is saved into the appropriate part of 
the at the beginning defined data.frame 'data.total'

4 whole dataset is imported and data.frame 'data.total' is ready for 
further computational steps

in a nutshell: preprocessing steps such as conversion into factors yield 
memory troubles, even for data.sets which per se don't take too much 
memory- but done separately in smaller chunks of data, it can be done 
with R very efficiently. The 'team' MySQL together with R is VERY powerful


More information about the R-help mailing list