[R] Memory problem on a linux cluster using a large data set [Broadcast]

Mon Dec 18 19:48:23 CET 2006

In addition to my off-list reply to Iris (pointing her to an old post of
mine that detailed the memory requirement of RF in R), she might
consider the following:

- Use larger nodesize
- Use sampsize to control the size of bootstrap samples

Both of these have the effect of reducing sizes of trees grown.  For a
data set that large, it may not matter to grow smaller trees.

Still, with data of that size, I'd say 64-bit is the better solution.

Cheers,
Andy

From: Martin Morgan
> 
> Iris --
> 
> I hope the following helps; I think you have too much data 
> for a 32-bit machine.
> 
> Martin
> 
> Iris Kolder <iriskolder at yahoo.com> writes:
> 
> > Hello,
> >  
> > I have a large data set 320.000 rows and 1000 columns. All the data 
> > has the values 0,1,2.
> 
> It seems like a single copy of this data set will be at least 
> a couple of gigabytes; I think you'll have access to only 4 
> GB on a 32-bit machine (see section 8 of the R Installation 
> and Administration guide), and R will probably end up, even 
> in the best of situations, making at least a couple of copies 
> of your data. Probably you'll need a 64-bit machine, or 
> figure out algorithms that work on chunks of data.
> 
> > on a linux cluster with R version R 2.1.0.  which operates on a 32
> 
> This is quite old, and in general it seems like R has become 
> more sensitive to big-data issues and tracking down 
> unnecessary memory copying.
> 
> > "cannot allocate vector size 1240 kb". I've searched through
> 
> use traceback() or options(error=recover) to figure out where 
> this is actually occurring.
> 
> > SNP <- read.table("file.txt", header=FALSE, sep="")    # 
> read in data file
> 
> This makes a data.frame, and data frames have several aspects 
> (e.g., automatic creation of row names on sub-setting) that 
> can be problematic in terms of memory use. Probably better to 
> use a matrix, for which:
> 
>      'read.table' is not the right tool for reading large matrices,
>      especially those with many columns: it is designed to read _data
>      frames_ which may have columns of very different classes. Use
>      'scan' instead.
> 
> (from the help page for read.table). I'm not sure of the 
> details of the algorithms you'll invoke, but it might be a 
> false economy to try to get scan to read in 'small' versions 
> (e.g., integer, rather than
> numeric) of the data -- the algorithms might insist on 
> numeric data, and then make a copy during coercion from your 
> small version to numeric.
> 
> > SNP$total.NAs = rowSums(is.na(SN         # calculate the 
> number of NA per row and adds a colum with total Na's
> 
> This adds a column to the data.frame or matrix, probably 
> causing at least one copy of the entire data. Create a 
> separate vector instead, even though this unties the 
> coordination between columns that a data frame provides.
> 
> > SNP  = t(as.matrix(SNP))                          # 
> transpose rows and columns
> 
> This will also probably trigger a copy; 
> 
> > snp.na<-SNP
> 
> R might be clever enough to figure out that this simple 
> assignment does not trigger a copy. But it probably means 
> that any subsequent modification of snp.na or SNP *will* 
> trigger a copy, so avoid the assignment if possible.
> 
> > snp.roughfix<-na.roughfix(snp.na)                           
>                   
> > fSNP<-factor(snp.roughfix[, 1])                # Asigns 
> factor to case control status
> >  
> > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, 
> > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, 
> > keep.forest=FALSE, do.trace=100)
> 
> Now you're entirely in the hands of the randomForest. If 
> memory problems occur here, perhaps you'll have gained enough 
> experience to point the package maintainer to the problem and 
> suggest a possible solution.
> 
> > set it should be able to cope with that amount. Perhaps someone has 
> > tried this before in R or is Fortram a better choice? I added my R
> 
> If you mean a pure Fortran solution, including coding the 
> random forest algorithm, then of course you have complete 
> control over memory management. You'd still likely be limited 
> to addressing 4 GB of memory. 
> 
> 
> > I wrote a script to remove all the rows with more than 46 missing 
> > values. This works perfect on a smaller dataset. But the problem 
> > arises when I try to run it on the larger data set I get an error 
> > "cannot allocate vector size 1240 kb". I've searched 
> through previous 
> > posts and found out that it might be because i'm running it 
> on a linux 
> > cluster with R version R 2.1.0.  which operates on a 32 bit 
> processor. 
> > But I could not find a solution for this problem. The cluster is a 
> > really fast one and should be able to cope with these large 
> amounts of 
> > data the systems configuration are Speed: 3.4 GHz, memory 
> 4GByte. Is 
> > there a way to change the settings or processor under R? I 
> want to run 
> > the function Random Forest on my large data set it should 
> be able to 
> > cope with that amount. Perhaps someone has tried this 
> before in R or 
> > is Fortram a better choice? I added my R script down below.
> >  
> > Best regards,
> >  
> > Iris Kolder
> >  
> > SNP <- read.table("file.txt", header=FALSE, sep="")    # 
> read in data file
> > SNP[SNP==9]<-NA                                   # change 
> missing values from a 9 to a NA
> > SNP$total.NAs = rowSums(is.na(SN         # calculate the 
> number of NA per row and adds a colum with total Na's
> > SNP = SNP[ SNP$total.NAs < 46,  ]         # create a subset 
> with no more than 5%(46) NA's 
> > SNP$total.NAs=NULL                              # remove 
> added column with sum of NA's
> > SNP  = t(as.matrix(SNP))                          # 
> transpose rows and columns
> > set.seed(1)                                                 
>                                   
> > snp.na<-SNP 
> > snp.roughfix<-na.roughfix(snp.na)                           
>                   
> > fSNP<-factor(snp.roughfix[, 1])                # Asigns 
> factor to case control status
> >  
> > snp.narf<- randomForest(snp.roughfix[,-1], fSNP, 
> > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE, 
> > keep.forest=FALSE, do.trace=100)
> >  
> > print(snp.narf)
> >
> > __________________________________________________
> >
> >
> >
> > 	[[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide 
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
> 
> --
> Martin T. Morgan
> Bioconductor / Computational Biology
> http://bioconductor.org
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
> 
> 

------------------------------------------------------------------------------
Notice:  This e-mail message, together with any attachments,...{{dropped}}