[R] randomForests and Y-scrambling on a small synthetic dataset
clayton.springer@pharma.novartis.com
clayton.springer at pharma.novartis.com
Fri May 7 17:26:04 CEST 2004
Dear r-help,
The following dataset (generated with perl) has 10 observations of 100
dependant variables (integers drawn uniformly
from [1:9]) which is split evenly between two classes..
First I show some work, and then ask two questions at the end.
> data <- read.table ("rf_input.dat")
> library (randomForest)
# if we do randomForest one time it looks like this:
> rf <- randomForest (factor(V101) ~. ,data=data)
> rf$confusion
1 2 class.error
1 5 5 0.5
2 4 6 0.4
# now we do it 100 times
>
tnum <- numeric()
for (i in 1:100) { MT <- data$V101
MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])
number <- as.integer (summary ( predict(MT.rf) == MT)[3] )
tnum <- c(tnum,number)
}
> > > > + + + + + + + >
# and this distribution of results (about 13 correct out of 20)
> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
9 11 12 12 13 13 13 14 14 15 17
# now lets permute (re-randomize?) the classes and repeat 1000 times:
> library (gregmisc)
tnum <- numeric()
for (i in 1:1000) { MT <- permute (data$V101)
MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])
number <- as.integer (summary ( predict(MT.rf) == MT)[3] )
tnum <- c(tnum,number)
}
# I get these results: the average is about 8 correct (out of 20) with 13
correct being at about
# the 95% confidence level
> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
0% 10% 20% 30% 40% 50% 60% 70% 80% 90% 100%
1 4 5 6 7 8 8 9 10 12 18
> quantile (tnum,probs = seq (0.9,1,0.01),na.rm = T)
90% 91% 92% 93% 94% 95% 96% 97% 98% 99% 100%
12 12 12 12 12 13 13 14 14 15 18
--------
My two questions:
Question 1: Naively I might have expected to get 10/20 for the Y-scrambled
examples, but instead I got 8/20. Why is that?
(Persumably has something to do with the randomForest only training on 2/3
of the examples.)
Question 2: With my Y scrambling exercise I seem to have demonstrated that
the original dataset was not random. But yet it
is random by construction. Is this just a fluke, or is something wrong
with my protocol?
thanks in advance,
Clayton
More information about the R-help
mailing list