[R-sig-eco] Random Forest classification on species counts

Fri May 31 09:33:20 CEST 2013

Hi there,
I think I can help a little bit with this: From the description of your problem it would seem that you want to run random forest in unsupervised mode; for that you have to omit the y in your formula (ie provide only your matrix of species abundances). Otherwise you are using species composition to predict the identity of each site, which of course is a very difficult classification problem (possibly the reason why you're getting such high error rate). Once you have done this, you should be able to get the proximities between the sites, which you then can use further to classify your sites into clusters of similar species compostion (as far as I know, the random forest function does not deal with that part of the problem, but you might want to seek more help there).
Hope this helps!
Geno

-----Original Message-----
From: r-sig-ecology-bounces at r-project.org [mailto:r-sig-ecology-bounces at r-project.org] On Behalf Of Hall, Kyle
Sent: May-30-13 22:37
To: r-sig-ecology at r-project.org
Subject: [R-sig-eco] Random Forest classification on species counts

First time poster, please forgive me for errors.

I have a data set of 23 sites with 145 different species counts for macroinvertebrate communities for a given year. each species is represented at least once per site and there are a lot of 0's for some species. I have been applying a variety of vegan functions to the data set to get a better understanding of the structure and I would like to classify sites based on species using randomForest. My thought is that this will give me a more understandable classification based on species that I can use to cluster my sites and also see which species are of more importance in classification.

Question 1. Am I barking up the wrong...tree (pun intended) with randomForest for this purpose?

With PCA two sites are typically separated from the rest but the other 21 sites show no discernible structure; spread like white noise over both axes.
When I perform NMDS I tend to get a shot gun look and there are not tight groupings on these reduced axes. However with Wards clustering (in hclust) I do see some clusters plotting heavily to one side of an axis or another (albeit spread wide on the orthogonal axis).

Question 2. Is it possible that my data set just doesn't have enough structure to neatly classify Sites by species count or am I simply a newbie that is applying randomForest incorrectly?

Example of data structure:
Site ABLA.MAL   ABLA.PAR   ACEN.SPP   ACRO.MEL
MC14A 1    1    2    0
MC17 4    2    0    0
MC22A 8    0    0    0
MC25 13   3    0    0
MC27 0    0    0    0
MC29A1     1    0    0    0
MC30A 1    0    0    0
MC31A 4    1    0    0
MC31B 4    0    0    0
MC33 8    0    0    0
MC38 7    0    0    0
MC40A 12   3    0    0
MC42 0    0    0    0
MC45 9    0    0    0
MC47A 0    0    0    0
MC49A 5    0    0    0
MC50 2    0    0    1
MC51 13   0    0    0
MC66 4    0    0    0
MY11B 13   1    0    0
MY13 0    0    0    0
MY7B 1    0    0    0
MY8  3    2    0    1

This is my call to randomForest:

FY09BUGS.rF <- randomForest(Site~ .,data=FY09Bugs, ntree=500, mtry=sqrt(ncol(FY09Bugs)), replace=TRUE,importance=TRUE, proximity=TRUE, norm.votes=TRUE, keep.forest=TRUE, do.trace=100)

I am following the iris data example with my formula but the print data on FY09BUGS.rf returns 100% OOB error rate and the summary returns:

summary(FY09BUGS.rF)
                Length Class  Mode
call               11  -none- call
type                1  -none- character
predicted          23  factor numeric
err.rate        12000  -none- numeric
confusion         552  -none- numeric
votes             529  matrix numeric
oob.times          23  -none- numeric
classes            23  -none- character
importance       3625  -none- numeric
importanceSD     3480  -none- numeric
localImportance     0  -none- NULL
proximity         529  -none- numeric
ntree               1  -none- numeric
mtry                1  -none- numeric
forest             14  -none- list
y                  23  factor numeric
test                0  -none- NULL
inbag               0  -none- NULL
terms               3  terms  call

One concern I have is that the iris example does not appear to give a training data set and so I don't believe I have done that either. I feel like there is potential here but I can't seem to find the solution searching online so I put the questions to you! Thanks in advance for any assistance or constructive criticism.

Kyle

Kyle Hall                             .
City of Charlotte Storm Water Services
Water Quality Modeler
600 East Fourth Street
Charlotte, NC 28202
704.336.4110
Fax: 704.353.0473

	[[alternative HTML version deleted]]

_______________________________________________
R-sig-ecology mailing list
R-sig-ecology at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-ecology