[R] solution design for a large scale (> 50G) R computing problem

jeffc hcen at andrew.cmu.edu
Sat Nov 7 17:17:00 CET 2009


Hi,

I am tackling a computing problem in R that involves large data. Both time
and memory issues need to be seriously considered. Below is the problem
description and my tentative approach. I would appreciate if any one can
share thoughts on how to solve this problem more efficiently.

I have 1001 multidimensional arrays -- A, B1, ..., B1000. A takes about
500MB in memory and B_i takes 100MB. I need to run an experiment that
evaluates a function f(A, B_i) for all B_i. f(A, B_i) doesn't change A, B_i
during its evaluation. These evaluations are independent for all i. I also
need to design various evaluation functions. Thus these kind of experiments
need to be performed often.

My computing environment is a 64bit Linux, 64GB memory, 8 core PC. My goal
is to do multiple experiments quickly given the existing equipments. 

One possible approach is to run a R process that loads A and use a parallel
library like foreach and mc to load B_i and compute f(A, B_i). The problems
with this approach are that each time foreach splits a new process it has to
1) copy the whole A array and 2) load B_i from disk to memory using io.
Since f(A, B_i) doesn't change A, B_i, would it be possible to do in R 1)
share A across different processes and 2) use memory mapped file to load B_i
(even A at the beginning)

Any suggestions would be appreciated.

Jeff





-- 
View this message in context: http://old.nabble.com/solution-design-for-a-large-scale-%28%3E-50G%29-R-computing-problem-tp26241900p26241900.html
Sent from the R help mailing list archive at Nabble.com.




More information about the R-help mailing list