[R] Handling large data sets via scan()

Mulholland, Tom Tom.Mulholland at dpi.wa.gov.au
Fri Feb 4 08:26:49 CET 2005


I'm sure others with more experience will answer this, but for what it is worth my experience suggests that memory issues are more often with the user and not the machine. I don't use Linux so I can't make specific comments about the capacity of your machine. However it appears that there is often a need for a copy of an object to be in memory while you are working on creating a new version. So if you can get a data.frame to be 1.4Gb it wouldn't leave much space if there needed to be an original and a copy for any reason. (I speculate that this may be the case rather than asserting it is the case.)

>From a practical point of view I assume that when you say you have 600 features that you are not going to use each and every one in the models that you may generate. So is it practical to limit the features to those that you wish to use before creating a data.frame?

In short if you really do need to work this way I suggest that you read as many of the frequent posts on memory issues until you are either fully conversant with memory issues with the machine you have or you have found one of the many suggestions to work around this issue, such as working with a database and sql. Using "large dataset" as a query on Jonathon Baron's website gave over 400 hits. http://finzi.psych.upenn.edu/nmz.html

Tom

> -----Original Message-----
> From: Nawaaz Ahmed [mailto:nawaaz at inktomi.com]
> Sent: Friday, 4 February 2005 2:40 PM
> To: R-help at stat.math.ethz.ch
> Cc: nawaaz at yahoo-inc.com
> Subject: [R] Handling large data sets via scan()
> 
> 
> I'm trying to read in datasets with roughly 150,000 rows and 600
> features. I wrote a function using scan() to read it in (I have a 4GB
> linux machine) and it works like a charm.  Unfortunately, 
> converting the
> scanned list into a datafame using as.data.frame() causes the memory
> usage to explode (it can go from 300MB for the scanned list 
> to 1.4GB for
> a data.frame of 30000 rows) and it fails claiming it cannot allocate
> memory (though it is still not close to the 3GB limit per 
> process on my
> linux box - the message is "unable to allocate vector of size 522K"). 
> 
> So I have three questions --
> 
> 1) Why is it failing even though there seems to be enough 
> memory available?
> 
> 2) Why is converting it into a data.frame causing the memory usage to
> explode? Am I using as.data.frame() wrongly? Should I be using some
> other command?
> 
> 3) All the model fitting packages seem to want to use data.frames as
> their input. If I cannot convert my list into a data.frame what can I
> do? Is there any way of getting around this?
> 
> Much thanks!
> Nawaaz
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! 
http://www.R-project.org/posting-guide.html




More information about the R-help mailing list