[R] Memory limits for MDSplot in randomForest package
Sam Albers
tonightsthenight at gmail.com
Fri Mar 23 20:30:36 CET 2012
Hello,
I am struggling to produce an MDS plot using the randomForest package
with a moderately large data set. My data set has one categorical
response variables, 7 predictor variables and just under 19000
observations. That means my proximity matrix is approximately 133000
by 133000 which is quite large. To train a random forest on this large
a dataset I have to use my institutions high performance computer.
Using this setup I was able to train a randomForest with the proximity
argument set to TRUE. At this point I wanted to construct an MDSplot
using the following:
MDSplot(nech.rf, nech.d$pd.fl, palette=c(1,2,3), pch=as.numeric(nech.d$pd.fl))
where "nech.rf" is the randomForest object and "nech.d$pd.fl" is the
classification factor. Now with the architecture listed below, I've
been waiting for approximately 2 days for this to run. My issue is
that I am not sure if this will ever run.
Can anyone recommend a way to tweak the MDSplot function to run a
little faster? I tried changing the cmdscale arguments (i.e.
eigenvalues) within the MDSplot function a little but that didn't seem
to have any effect of the overall running time using a much smaller
data set. Or even if someone could comment whether I am dreaming that
this will actually ever run?
This is probably the best computer that I will have access to so I was
hoping that somehow I could get this to run. I was just hoping that
someone reading the list might have some experience with randomForests
and using large datasets and might be able to comment on my situation.
Below the architecture information I have constructed a dummy example
to illustrate what I am doing but given the nature of the problem,
this doesn't completely reflect my situation.
Any help would be much appreciated!
Thanks!
Sam
----
Computer specs and sessionInfo()
OS: Suse Linux
Memory: 64 GB
Processors: Intel Itanium 2, 64 x 1500 MHz
And:
> sessionInfo()
R version 2.6.2 (2008-02-08)
ia64-unknown-linux-gnu
locale:
LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C
attached base packages:
[1] stats graphics grDevices utils datasets methods base
other attached packages:
[1] randomForest_4.6-6
loaded via a namespace (and not attached):
[1] rcompgen_0.1-17
###
# Dummy Example
###
require(randomForest)
set.seed(17)
## Number of points
x <- 10
df <- rbind(
data.frame(var1=runif(x, 10, 50),
var2=runif(x, 2, 7),
var3=runif(x, 0.2, 0.35),
var4=runif(x, 1, 2),
var5=runif(x, 5, 8),
var6=runif(x, 1, 2),
var7=runif(x, 5, 8),
cls=factor("CLASS-2")
)
,
data.frame(var1=runif(x, 10, 50),
var2=runif(x, -3, 3),
var3=runif(x, 0.1, 0.25),
var4=runif(x, 1, 2),
var5=runif(x, 5, 8),
var6=runif(x, 1, 2),
var7=runif(x, 5, 8),
cls=factor("CLASS-1")
)
)
df.rf<-randomForest(y=df[,8],x=df[,1:7], proximity=TRUE, importance=TRUE)
MDSplot(df.rf, df$cls, k=2, palette=c(1,2,3,4), pch=as.numeric(df$cls))
More information about the R-help
mailing list