[R] randomForests and Y-scrambling on a small synthetic dataset

Fri May 7 17:26:04 CEST 2004

Dear r-help,

The following dataset (generated with perl) has 10 observations of 100 
dependant variables (integers drawn uniformly
from [1:9]) which is split evenly between two classes..

First I show some work, and then ask two questions at the end.

> data <- read.table ("rf_input.dat")
> library (randomForest)
# if we do randomForest one time it looks like this:

> rf <- randomForest (factor(V101) ~. ,data=data)
> rf$confusion
  1 2 class.error
1 5 5         0.5
2 4 6         0.4

# now we do it 100 times 

> 
tnum <- numeric()

for (i in 1:100) { MT <- data$V101
   MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])
   number <- as.integer (summary ( predict(MT.rf) == MT)[3]     )
   tnum <- c(tnum,number)
}

> > > > + + + + + + + > 

# and this distribution of results (about 13 correct out of 20) 
>  quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
   9   11   12   12   13   13   13   14   14   15   17 

# now lets permute (re-randomize?) the classes and repeat 1000 times: 

> library (gregmisc)
tnum <- numeric()

for (i in 1:1000) { MT <- permute (data$V101)
   MT.rf <- randomForest (factor(MT) ~ . ,data =data[-c(101)])
   number <- as.integer (summary ( predict(MT.rf) == MT)[3]     )
   tnum <- c(tnum,number)
}

# I get these results: the average is about 8 correct (out of 20) with 13 
correct being at about
# the 95% confidence level

> quantile (tnum,probs = seq (0,1,0.1),na.rm = T)
  0%  10%  20%  30%  40%  50%  60%  70%  80%  90% 100% 
   1    4    5    6    7    8    8    9   10   12   18 
>  quantile (tnum,probs = seq (0.9,1,0.01),na.rm = T)
 90%  91%  92%  93%  94%  95%  96%  97%  98%  99% 100% 
  12   12   12   12   12   13   13   14   14   15   18 

--------

My two questions:

Question 1: Naively I might have expected to get 10/20 for the Y-scrambled 
examples, but instead I got 8/20.  Why is that?
(Persumably has something to do with the randomForest only training on 2/3 
of the examples.)

Question 2: With my Y scrambling exercise I seem to have demonstrated that 
the original dataset was not random. But yet it
is random by construction. Is this just a fluke, or is something wrong 
with my protocol?

thanks in advance,

Clayton