[R] Help/information required
dwinsemius at comcast.net
Mon Sep 18 17:58:36 CEST 2017
> On Sep 17, 2017, at 9:24 PM, Ajay Arvind Rao <AjayArvind.Rao at gmrgroup.in> wrote:
> We are using open source license of R to analyze data at our organization. The system configuration are as follows:
> * System configuration:
> o Operating System - Windows 7 Enterprise SP1, 64 bit (Desktop)
> o RAM - 8 GB
> o Processor - i5-6500 @ 3.2 Ghz
> * R Version:
> o R Studio 1.0.136
> o R 3.4.0
> While trying to merge two datasets we received the following resource error message on running the code
> Code: merg_data <- merge(x=Data_1Junto30Jun,y=flight_code,by.x="EB_FLNO1",by.y="EB_FLNO1",all.x = TRUE)
> Error Message: Error: cannot allocate vector of size 5.8 Gb
> Later we tried running the code differently but error still remained
> Code: merg_data <- sqldf("Select * from Data_1Junto30Jun as a inner join flight_code as b on a.EB_FLNO1=b.EB_FLNO1")
> Error Message: Error: cannot allocate vector of size 200.0 Mb
> We have upgraded the RAM to 8 GB couple of months back. Can you let us know options to resolve the above issue without having to increase the RAM? The size of the datasets are as follows:
> * Data_1Junto30Jun (513476 obs of 32 variables). Data size - 172033368 bytes / 172 MB
> * flight_code (478105 obs of 2 variables). Data size - 3836304 bytes / 4 MB
> Help with determining system requirement:
> Is there a way to determine minimum system requirement (hardware and software)
There are some packages for working with data "out of memory". See bigmemory and other "big*" packages. See also the data.table package which has many satisfied users. There are also several packages for handling data through database connections. That would be probably the preferred method for your use case.
R objects are almost always copied when an assignment is made and this means that you need at a minimum at least twice as much free (and in _continuous_ chunks) memory. You will often be breaking up the memory with other code and other out-of-R processes. Windows was in the past notorious for having poor memory management. I don't know if Windows 7 continued that tradition or whether later versions might be useful to avoid the problem.
A dataframe will consume about 10 bytes per row for numeric columns. Factor and character vectors are hashed so the memory consumed will depend on the degree of duplication of entries. That will also affect the merge operations. Merges will give you a Cartesian product so if you merge two dataframes with lots of duplicates you will often get a message such as: "Error: cannot allocate vector of size 5.8 Gb"
The second error you cite suggests that much of your 8Gb of storage has been fragmented.
Most of this information should be available via searching in Rhelp or RSeek.
> depending on size of the data, the way the data is loaded into R (directly from server or in a flat file) and the type of analysis to be run?
No difference for the source of data but cannot comment on the type of analysis because that part of the question is too vague. (Aside from mentioning the issue of Cartesian multiplication of merge results which often trips up new users of database technology.)
> We have not been able to get any specific information related to this and are estimating the requirements through a trial and error method. Any information on this front will be helpful.
This suggests an impoverished ability for searching:
> [[alternative HTML version deleted]]
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
Alameda, CA, USA
'Any technology distinguishable from magic is insufficiently advanced.' -Gehm's Corollary to Clarke's Third Law
More information about the R-help