[R] randomForest parameters for image classification

Tue Nov 16 17:15:49 CET 2010

I have modified my code since asking my original question. The
classifier is now generated correctly (with a good, low error rate, as
expected). However, I am running into two issues: 

1) I am getting an error at the prediction stage, I get only NA's when I
try to run data down the forest;
2) I run out of memory when generating the forest with more than 200
trees due to the large block of memory already occupied by the training
data

Here is my code:

library(raster)
library(randomForest)

# Set some user variables
fn = "image.pix"
outraster = "output.pix"
training_band = 2
validation_band = 1

# Get the training data
myraster = stack(fn)
training_class = subset(myraster, training_band)
training_class[training_class == 0] = NA
training_class = Which(training_class != 0, cells=TRUE)
training_data = extract(myraster, training_class)
training_response = as.factor(as.vector(training_data[,training_band]))
training_predictors = training_data[,3:nlayers(myraster)]
remove(training_data)

# Create and save the forest
r_tree = randomForest(training_predictors, y=training_response, ntree =
200, keep.forest=TRUE) # Runs out of memory with ntree > ~200
remove(training_predictors, training_response)

# Classify the whole image
predictor_data = subset(myraster, 3:nlayers(myraster))
layerNames(predictor_data) = layerNames(myraster)[3:nlayers(myraster)]
predictions = predict(predictor_data, r_tree, filename=outraster,
format="PCIDSK", overwrite=TRUE, progress="text", type="response") #All
NA!?
remove(predictor_data)

See also a thread I started on
http://stackoverflow.com/questions/4186507/rgdal-efficiently-reading-lar
ge-multiband-rasters about improving the efficiency of collecting the
training data...

Thanks, Benjamin

-----Original Message-----
From: Liaw, Andy [mailto:andy_liaw at merck.com] 
Sent: November 11, 2010 7:02 AM
To: Deschamps, Benjamin; r-help at r-project.org
Subject: RE: [R] randomForest parameters for image classification

Please show us the code you used to run randomForest, the output, as
well as what you get with other algorithms (on the same random subset
for comparison).  I have yet to see a dataset where randomForest does
_far_ worse than other methods.

Andy 

> -----Original Message-----
> From: r-help-bounces at r-project.org 
> [mailto:r-help-bounces at r-project.org] On Behalf Of Deschamps, Benjamin
> Sent: Tuesday, November 09, 2010 10:52 AM
> To: r-help at r-project.org
> Subject: [R] randomForest parameters for image classification
> 
> I am implementing an image classification algorithm using the
> randomForest package. The training data consists of 31000+ training
> cases over 26 variables, plus one factor predictor variable (the
> training class). The main issue I am encountering is very low overall
> classification accuracy (a lot of confusion between classes). 
> However, I
> know from other classifications (including a regular decision tree
> classifier) that the training and validation data is sound and capable
> of producing good accuracies). 
> 
>  
> 
> Currently, I am using the default parameters (500 trees, mtry not set
> (default), nodesize = 1, replace=TRUE). Does anyone have experience
> using this with large datasets? Currently I need to randomly sample my
> training data because giving it the full 31000+ cases returns 
> an out of
> memory error; the same thing happens with large numbers of 
> trees.  From
> what I read in the documentation, perhaps I do not have 
> enough trees to
> fully capture the training data?
> 
>  
> 
> Any suggestions or ideas will be greatly appreciated.
> 
>  
> 
> Benjamin
> 
> 
> 	[[alternative HTML version deleted]]
> 
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide 
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
> 
Notice:  This e-mail message, together with any attachme...{{dropped:12}}