[R-sig-eco] Random Forest classification on species counts

Gavin Simpson gavin.simpson at ucl.ac.uk
Wed Jun 5 19:06:03 CEST 2013


On Thu, 2013-05-30 at 20:36 +0000, Hall, Kyle wrote:
> First time poster, please forgive me for errors.
> 
> I have a data set of 23 sites with 145 different species counts for
> macroinvertebrate communities for a given year. each species is
> represented at least once per site and there are a lot of 0's for some
> species. I have been applying a variety of vegan functions to the data
> set to get a better understanding of the structure and I would like to
> classify sites based on species using randomForest. My thought is that
> this will give me a more understandable classification based on
> species that I can use to cluster my sites and also see which species
> are of more importance in classification.
> 
> Question 1. Am I barking up the wrong...tree (pun intended) with
> randomForest for this purpose?

Yes - I doubt unsupervised RF would give you anything more than you
could get from a suitably-chosen dissimilarity matrix or even ordination
to check if you actually have clusters.

For supervised RF, even if you had a classification, you have far to few
sites/samples to warrant a machine learning tool.

> With PCA two sites are typically separated from the rest but the other
> 21 sites show no discernible structure; spread like white noise over
> both axes.
> When I perform NMDS I tend to get a shot gun look and there are not
> tight groupings on these reduced axes. However with Wards clustering
> (in hclust) I do see some clusters plotting heavily to one side of an
> axis or another (albeit spread wide on the orthogonal axis).

Ward's clustering, as with any clustering, will find clusters - it
[Ward's method] tends to find compact, spherical ones IIRC and hence
often looks convincing. Your job is to demonstrate that the clustering
into k cluster explains more of the variance in the model than no
clustering. Simply eye-balling the dendrogram is not a solution to this.

> Question 2. Is it possible that my data set just doesn't have enough
> structure to neatly classify Sites by species count or am I simply a
> newbie that is applying randomForest incorrectly?

With so few data I wouldn't both with machine learning tools - they are
designed to work with hundreds and thousands or more samples.

HTH

G

> Example of data structure:
> Site ABLA.MAL   ABLA.PAR   ACEN.SPP   ACRO.MEL
> MC14A 1    1    2    0
> MC17 4    2    0    0
> MC22A 8    0    0    0
> MC25 13   3    0    0
> MC27 0    0    0    0
> MC29A1     1    0    0    0
> MC30A 1    0    0    0
> MC31A 4    1    0    0
> MC31B 4    0    0    0
> MC33 8    0    0    0
> MC38 7    0    0    0
> MC40A 12   3    0    0
> MC42 0    0    0    0
> MC45 9    0    0    0
> MC47A 0    0    0    0
> MC49A 5    0    0    0
> MC50 2    0    0    1
> MC51 13   0    0    0
> MC66 4    0    0    0
> MY11B 13   1    0    0
> MY13 0    0    0    0
> MY7B 1    0    0    0
> MY8  3    2    0    1
> 
> 
> This is my call to randomForest:
> 
> FY09BUGS.rF <- randomForest(Site~ .,data=FY09Bugs, ntree=500, mtry=sqrt(ncol(FY09Bugs)), replace=TRUE,importance=TRUE, proximity=TRUE, norm.votes=TRUE, keep.forest=TRUE, do.trace=100)
> 
> I am following the iris data example with my formula but the print data on FY09BUGS.rf returns 100% OOB error rate and the summary returns:
> 
> summary(FY09BUGS.rF)
>                 Length Class  Mode
> call               11  -none- call
> type                1  -none- character
> predicted          23  factor numeric
> err.rate        12000  -none- numeric
> confusion         552  -none- numeric
> votes             529  matrix numeric
> oob.times          23  -none- numeric
> classes            23  -none- character
> importance       3625  -none- numeric
> importanceSD     3480  -none- numeric
> localImportance     0  -none- NULL
> proximity         529  -none- numeric
> ntree               1  -none- numeric
> mtry                1  -none- numeric
> forest             14  -none- list
> y                  23  factor numeric
> test                0  -none- NULL
> inbag               0  -none- NULL
> terms               3  terms  call
> 
> One concern I have is that the iris example does not appear to give a training data set and so I don't believe I have done that either. I feel like there is potential here but I can't seem to find the solution searching online so I put the questions to you! Thanks in advance for any assistance or constructive criticism.
> 
> Kyle
> 
> 
> Kyle Hall                             .
> City of Charlotte Storm Water Services
> Water Quality Modeler
> 600 East Fourth Street
> Charlotte, NC 28202
> 704.336.4110
> Fax: 704.353.0473
> 
> 	[[alternative HTML version deleted]]
> 
> _______________________________________________
> R-sig-ecology mailing list
> R-sig-ecology at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-ecology
> 

-- 
Gavin Simpson, PhD                          [t] +1 306 337 8863
Adjunct Professor, Department of Biology    [f] +1 306 337 2410
Institute of Environmental Change & Society [e] gavin.simpson at uregina.ca
523 Research and Innovation Centre          [tw] @ucfagls
University of Regina
Regina, SK S4S 0A2, Canada



More information about the R-sig-ecology mailing list