[R-sig-Geo] randomForests for mapping vegetation with big data

Tue Aug 1 14:22:13 CEST 2006

Alberto - 

I am doing the same with randomForests: We are working with 22,578 X
17,160 pixel GRIDS and 36 environmental variables. The big difference is
that we don't have nearly that size data frames for training data. In
the few cases that we do have a lot of points/rows (up to about 20,000)
, we end up running fewer trees (down from 600 to 100 trees). 

I think it may be more appropriate to divide by running fewer trees on
the entire set if you can. Try running 50-100 trees on the 300000 row
data frame, and then making sure you aren't starting from the same
random seed, keep running these smaller sets until you get the total
number of trees you want. Then combine them to make your prediction.  

The difficult, memory intensive part for us is running the prediction
on our large GRIDS. For this we have to read in the data from the 36
environmental variables (in ascii format) line by line and then write
out the prediction before reading in the next batch. 

Tim Howard

Date: Mon, 31 Jul 2006 12:17:34 +0200
From: "Alberto Ruiz Moreno" <aruiz at eeza.csic.es>
Subject: [R-sig-Geo] randomForests for mapping vegetation with big
	data
To: <r-sig-geo at stat.math.ethz.ch>
Message-ID:
	<306770A257EE3840A78215253AF233C815AC61 at CORREO.eeza.csic.es>
Content-Type: text/plain

Hi,

I*m trying to ran randomForests in R to do vegetation suitability
maps.

I*m working with 1000x1000 pixel maps and 30 environmental  variables.

My software is R v2.3.1 and RandomForests 4.5-16. 
1GB Ram memory and 3GB swap partition in a Linux or Windows machine
(the problem is the same in both configurations).

R abort with an memory limits error when I try to train randomForest
with 
big data frames, 300000 rows x 30 columns and 500 trees.

I have lost one week tuning the use of memory in R (I read tens of
messages on it) but
I think that it is not a R misconfiguration but a big memory expense of
randomforests library implementation.

My conclusion is: I need to divide...

ok, to solve this memory error, I ran randomForests several times, 
with less rows in the training data, and use the combine() function to
join the forests.

The cuestion is...

Is this the right way to train randomForest with big data? 
There are another way?
How do you make it?

thanks...

	[[alternative HTML version deleted]]