[R-sig-eco] Random Forest classification on species counts
Gavin Simpson
gavin.simpson at ucl.ac.uk
Wed Jun 5 19:06:03 CEST 2013
On Thu, 2013-05-30 at 20:36 +0000, Hall, Kyle wrote:
> First time poster, please forgive me for errors.
>
> I have a data set of 23 sites with 145 different species counts for
> macroinvertebrate communities for a given year. each species is
> represented at least once per site and there are a lot of 0's for some
> species. I have been applying a variety of vegan functions to the data
> set to get a better understanding of the structure and I would like to
> classify sites based on species using randomForest. My thought is that
> this will give me a more understandable classification based on
> species that I can use to cluster my sites and also see which species
> are of more importance in classification.
>
> Question 1. Am I barking up the wrong...tree (pun intended) with
> randomForest for this purpose?
Yes - I doubt unsupervised RF would give you anything more than you
could get from a suitably-chosen dissimilarity matrix or even ordination
to check if you actually have clusters.
For supervised RF, even if you had a classification, you have far to few
sites/samples to warrant a machine learning tool.
> With PCA two sites are typically separated from the rest but the other
> 21 sites show no discernible structure; spread like white noise over
> both axes.
> When I perform NMDS I tend to get a shot gun look and there are not
> tight groupings on these reduced axes. However with Wards clustering
> (in hclust) I do see some clusters plotting heavily to one side of an
> axis or another (albeit spread wide on the orthogonal axis).
Ward's clustering, as with any clustering, will find clusters - it
[Ward's method] tends to find compact, spherical ones IIRC and hence
often looks convincing. Your job is to demonstrate that the clustering
into k cluster explains more of the variance in the model than no
clustering. Simply eye-balling the dendrogram is not a solution to this.
> Question 2. Is it possible that my data set just doesn't have enough
> structure to neatly classify Sites by species count or am I simply a
> newbie that is applying randomForest incorrectly?
With so few data I wouldn't both with machine learning tools - they are
designed to work with hundreds and thousands or more samples.
HTH
G
> Example of data structure:
> Site ABLA.MAL ABLA.PAR ACEN.SPP ACRO.MEL
> MC14A 1 1 2 0
> MC17 4 2 0 0
> MC22A 8 0 0 0
> MC25 13 3 0 0
> MC27 0 0 0 0
> MC29A1 1 0 0 0
> MC30A 1 0 0 0
> MC31A 4 1 0 0
> MC31B 4 0 0 0
> MC33 8 0 0 0
> MC38 7 0 0 0
> MC40A 12 3 0 0
> MC42 0 0 0 0
> MC45 9 0 0 0
> MC47A 0 0 0 0
> MC49A 5 0 0 0
> MC50 2 0 0 1
> MC51 13 0 0 0
> MC66 4 0 0 0
> MY11B 13 1 0 0
> MY13 0 0 0 0
> MY7B 1 0 0 0
> MY8 3 2 0 1
>
>
> This is my call to randomForest:
>
> FY09BUGS.rF <- randomForest(Site~ .,data=FY09Bugs, ntree=500, mtry=sqrt(ncol(FY09Bugs)), replace=TRUE,importance=TRUE, proximity=TRUE, norm.votes=TRUE, keep.forest=TRUE, do.trace=100)
>
> I am following the iris data example with my formula but the print data on FY09BUGS.rf returns 100% OOB error rate and the summary returns:
>
> summary(FY09BUGS.rF)
> Length Class Mode
> call 11 -none- call
> type 1 -none- character
> predicted 23 factor numeric
> err.rate 12000 -none- numeric
> confusion 552 -none- numeric
> votes 529 matrix numeric
> oob.times 23 -none- numeric
> classes 23 -none- character
> importance 3625 -none- numeric
> importanceSD 3480 -none- numeric
> localImportance 0 -none- NULL
> proximity 529 -none- numeric
> ntree 1 -none- numeric
> mtry 1 -none- numeric
> forest 14 -none- list
> y 23 factor numeric
> test 0 -none- NULL
> inbag 0 -none- NULL
> terms 3 terms call
>
> One concern I have is that the iris example does not appear to give a training data set and so I don't believe I have done that either. I feel like there is potential here but I can't seem to find the solution searching online so I put the questions to you! Thanks in advance for any assistance or constructive criticism.
>
> Kyle
>
>
> Kyle Hall .
> City of Charlotte Storm Water Services
> Water Quality Modeler
> 600 East Fourth Street
> Charlotte, NC 28202
> 704.336.4110
> Fax: 704.353.0473
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
>
--
Gavin Simpson, PhD [t] +1 306 337 8863
Adjunct Professor, Department of Biology [f] +1 306 337 2410
Institute of Environmental Change & Society [e] gavin.simpson at uregina.ca
523 Research and Innovation Centre [tw] @ucfagls
University of Regina
Regina, SK S4S 0A2, Canada
More information about the R-sig-ecology
mailing list