[R] Memory problem on a linux cluster using a large data set [Broadcast]
Liaw, Andy
andy_liaw at merck.com
Mon Dec 18 19:48:23 CET 2006
In addition to my off-list reply to Iris (pointing her to an old post of
mine that detailed the memory requirement of RF in R), she might
consider the following:
- Use larger nodesize
- Use sampsize to control the size of bootstrap samples
Both of these have the effect of reducing sizes of trees grown. For a
data set that large, it may not matter to grow smaller trees.
Still, with data of that size, I'd say 64-bit is the better solution.
Cheers,
Andy
From: Martin Morgan
>
> Iris --
>
> I hope the following helps; I think you have too much data
> for a 32-bit machine.
>
> Martin
>
> Iris Kolder <iriskolder at yahoo.com> writes:
>
> > Hello,
> >
> > I have a large data set 320.000 rows and 1000 columns. All the data
> > has the values 0,1,2.
>
> It seems like a single copy of this data set will be at least
> a couple of gigabytes; I think you'll have access to only 4
> GB on a 32-bit machine (see section 8 of the R Installation
> and Administration guide), and R will probably end up, even
> in the best of situations, making at least a couple of copies
> of your data. Probably you'll need a 64-bit machine, or
> figure out algorithms that work on chunks of data.
>
> > on a linux cluster with R version R 2.1.0. which operates on a 32
>
> This is quite old, and in general it seems like R has become
> more sensitive to big-data issues and tracking down
> unnecessary memory copying.
>
> > "cannot allocate vector size 1240 kb". I've searched through
>
> use traceback() or options(error=recover) to figure out where
> this is actually occurring.
>
> > SNP <- read.table("file.txt", header=FALSE, sep="") #
> read in data file
>
> This makes a data.frame, and data frames have several aspects
> (e.g., automatic creation of row names on sub-setting) that
> can be problematic in terms of memory use. Probably better to
> use a matrix, for which:
>
> 'read.table' is not the right tool for reading large matrices,
> especially those with many columns: it is designed to read _data
> frames_ which may have columns of very different classes. Use
> 'scan' instead.
>
> (from the help page for read.table). I'm not sure of the
> details of the algorithms you'll invoke, but it might be a
> false economy to try to get scan to read in 'small' versions
> (e.g., integer, rather than
> numeric) of the data -- the algorithms might insist on
> numeric data, and then make a copy during coercion from your
> small version to numeric.
>
> > SNP$total.NAs = rowSums(is.na(SN # calculate the
> number of NA per row and adds a colum with total Na's
>
> This adds a column to the data.frame or matrix, probably
> causing at least one copy of the entire data. Create a
> separate vector instead, even though this unties the
> coordination between columns that a data frame provides.
>
> > SNP = t(as.matrix(SNP)) #
> transpose rows and columns
>
> This will also probably trigger a copy;
>
> > snp.na<-SNP
>
> R might be clever enough to figure out that this simple
> assignment does not trigger a copy. But it probably means
> that any subsequent modification of snp.na or SNP *will*
> trigger a copy, so avoid the assignment if possible.
>
> > snp.roughfix<-na.roughfix(snp.na)
>
> > fSNP<-factor(snp.roughfix[, 1]) # Asigns
> factor to case control status
> >
> > snp.narf<- randomForest(snp.roughfix[,-1], fSNP,
> > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE,
> > keep.forest=FALSE, do.trace=100)
>
> Now you're entirely in the hands of the randomForest. If
> memory problems occur here, perhaps you'll have gained enough
> experience to point the package maintainer to the problem and
> suggest a possible solution.
>
> > set it should be able to cope with that amount. Perhaps someone has
> > tried this before in R or is Fortram a better choice? I added my R
>
> If you mean a pure Fortran solution, including coding the
> random forest algorithm, then of course you have complete
> control over memory management. You'd still likely be limited
> to addressing 4 GB of memory.
>
>
> > I wrote a script to remove all the rows with more than 46 missing
> > values. This works perfect on a smaller dataset. But the problem
> > arises when I try to run it on the larger data set I get an error
> > "cannot allocate vector size 1240 kb". I've searched
> through previous
> > posts and found out that it might be because i'm running it
> on a linux
> > cluster with R version R 2.1.0. which operates on a 32 bit
> processor.
> > But I could not find a solution for this problem. The cluster is a
> > really fast one and should be able to cope with these large
> amounts of
> > data the systems configuration are Speed: 3.4 GHz, memory
> 4GByte. Is
> > there a way to change the settings or processor under R? I
> want to run
> > the function Random Forest on my large data set it should
> be able to
> > cope with that amount. Perhaps someone has tried this
> before in R or
> > is Fortram a better choice? I added my R script down below.
> >
> > Best regards,
> >
> > Iris Kolder
> >
> > SNP <- read.table("file.txt", header=FALSE, sep="") #
> read in data file
> > SNP[SNP==9]<-NA # change
> missing values from a 9 to a NA
> > SNP$total.NAs = rowSums(is.na(SN # calculate the
> number of NA per row and adds a colum with total Na's
> > SNP = SNP[ SNP$total.NAs < 46, ] # create a subset
> with no more than 5%(46) NA's
> > SNP$total.NAs=NULL # remove
> added column with sum of NA's
> > SNP = t(as.matrix(SNP)) #
> transpose rows and columns
> > set.seed(1)
>
> > snp.na<-SNP
> > snp.roughfix<-na.roughfix(snp.na)
>
> > fSNP<-factor(snp.roughfix[, 1]) # Asigns
> factor to case control status
> >
> > snp.narf<- randomForest(snp.roughfix[,-1], fSNP,
> > na.action=na.roughfix, ntree=500, mtry=10, importance=TRUE,
> > keep.forest=FALSE, do.trace=100)
> >
> > print(snp.narf)
> >
> > __________________________________________________
> >
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at stat.math.ethz.ch mailing list
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> > http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> --
> Martin T. Morgan
> Bioconductor / Computational Biology
> http://bioconductor.org
>
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
>
>
------------------------------------------------------------------------------
Notice: This e-mail message, together with any attachments,...{{dropped}}
More information about the R-help
mailing list