[R] Question on RandomForest in unsupervised mode
Irilenia Nobeli
irilenia.nobeli at kcl.ac.uk
Wed Jun 6 18:27:23 CEST 2007
Hi,
I attempted to run the randomForest() function on a dataset without
predefined classes. According to the manual, running randomForest
without a response variable/class labels should result in the
function assuming you are running in unsupervised mode. In this case,
I understand that my data is all assigned to one class whereas a
second synthetic class is made up, which is assigned to a second
class. The online manual suggests that an oob misclassification error
in this two-class problem of ~40% or more would indicate that the x-
variables look like independent variables to random forests (and I
assume that in this case the proximities obtained by the randomForest
would not be informative for clustering).
When I run randomForest() in the unsupervised mode my first problem
is that I get NULL entries for the confusion matrix and the err.rate,
but I suppose this is normal behaviour. My only information (apart
from the proximities of course), seems to be the votes, from which I
can deduce whether the variables are meaningful or not. The second
problem is that, in my case, all my observations seem to have about
20-40% of the votes from class 1 and the rest from class 2 (i.e.
class 2 "wins" always). Assuming that class 1 was assigned to my
original data, I'd say this is rather surprising.
Initially I thought this was simply a problem of my data not being
meaningful, but I repeated simply the forest with the "iris" example
data and I get more or less the same result.
I did simply:
iris.urf <- randomForest(iris[,-5])
iris.urf$votes
and I got again most of the votes coming from class 2, although here
vote percentages are slightly more balanced than with my data
(approximately 40 to 60% most of the time).
Has anyone got experience with unsupervised randomForest() in R and
can explain to me why I'm observing this behaviour? In general, any
hints about pitfalls regarding random forests in unsupervised mode
would be very much appreciated.
Many thanks in advance,
Irilenia
-----------------------------
Irilenia (Irene) Nobeli
Randall Division of Cell and Molecular Biophysics
New Hunt's House (room 3.14)
King's College London, Guy's Campus
London, SE1 1UL
U.K.
irilenia.nobeli at kcl.ac.uk
+44(0)207-8486329
More information about the R-help
mailing list