[Rd] rpart error with 0-frequency factor levels (with partial fix) (PR#1378)

Wed, 13 Mar 2002 18:05:42 +0100 (MET)

(I'm sending to r-bugs because rpart is one of the recommended packages and
is always installed.  I'm also sending it directly to Dr. Ripley, as the
maintainer.)

rpart working as a classifier does not work (produces no splits) when the
class indicator has no instances of one of the factor levels, as long as the
factor level is not the final level. I have at least a partial fix, which I
describe after the example.  I don't think the fix is ideal.  I also don't
know whether some of the other rpart methods might have a similar problem.

This may be an atypical situation, but I have come across this when using
rpart on a subset of a data frame that happens not to have any instances of
one level of a factor.

I've gotten identical results with R-1.4.1 on NT 4.0 and Mac OS X (Darwin
version, running as an inferior process under emacs).  System info for both
machines is below.

----------------------------------------------------------------------------
----------------------------------------------------------------------------
-

Here's an example.

The following is run immediately following startup in a completely empty
directory (no .RData, no .Rhistory).

> t1 <- data.frame(V1 = sample(c("A", "B"), 100, replace = T), V2 =
runif(100), V3 = runif(100))
> 
> levels(t1$V1)
[1] "A" "B"
> 
> t2 <- t1
> t2$V1 <- factor(t2$V1, levels = c("A", "B", "C"))
> t3 <- t2
> t3$V1[t3$V1 == "B"] <- "C"
> 
> table(t2$V1)

 A  B  C 
51 49  0 
> table(t3$V1)

 A  B  C 
51  0 49 
> 
> library(rpart)
> t1.rpart <- rpart(V1 ~ V2 + V3, data = t1, method = "class")
> t2.rpart <- rpart(V1 ~ V2 + V3, data = t2, method = "class")
> t3.rpart <- rpart(V1 ~ V2 + V3, data = t3, method = "class")
> 
> print(t1.rpart)
n= 100 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 100 49 A (0.5100000 0.4900000)  
   2) V3< 0.319873 33 13 A (0.6060606 0.3939394)  
     4) V3>=0.0740043 26  8 A (0.6923077 0.3076923) *
     5) V3< 0.0740043 7  2 B (0.2857143 0.7142857) *
   3) V3>=0.319873 67 31 B (0.4626866 0.5373134)  
     6) V2>=0.2482707 54 26 A (0.5185185 0.4814815)  
      12) V3< 0.7424548 31 13 A (0.5806452 0.4193548)  
        24) V3>=0.3740281 23  8 A (0.6521739 0.3478261) *
        25) V3< 0.3740281 8  3 B (0.3750000 0.6250000) *
      13) V3>=0.7424548 23 10 B (0.4347826 0.5652174)  
        26) V3>=0.9291285 8  3 A (0.6250000 0.3750000) *
        27) V3< 0.9291285 15  5 B (0.3333333 0.6666667) *
     7) V2< 0.2482707 13  3 B (0.2307692 0.7692308) *
> 
> print(t2.rpart)
n= 100 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 100 49 A (0.5100000 0.4900000)  
   2) V3< 0.319873 33 13 A (0.6060606 0.3939394)  
     4) V3>=0.0740043 26  8 A (0.6923077 0.3076923) *
     5) V3< 0.0740043 7  2 B (0.2857143 0.7142857) *
   3) V3>=0.319873 67 31 B (0.4626866 0.5373134)  
     6) V2>=0.2482707 54 26 A (0.5185185 0.4814815)  
      12) V3< 0.7424548 31 13 A (0.5806452 0.4193548)  
        24) V3>=0.3740281 23  8 A (0.6521739 0.3478261) *
        25) V3< 0.3740281 8  3 B (0.3750000 0.6250000) *
      13) V3>=0.7424548 23 10 B (0.4347826 0.5652174)  
        26) V3>=0.9291285 8  3 A (0.6250000 0.3750000) *
        27) V3< 0.9291285 15  5 B (0.3333333 0.6666667) *
     7) V2< 0.2482707 13  3 B (0.2307692 0.7692308) *
> 
> print(t3.rpart)
n= 100 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 100 NaN A (NaN NaN NaN) *
> 
> 

If I recode the factor back down to 2 levels it works again:

> t4 <- t3
> t4$V1 <- factor(t4$V1, levels = c("A", "C"))
> table(t4$V1)

 A  C 
51 49 
> 
> t4.rpart <- rpart(V1 ~ V2 + V3, data = t4, method = "class")
> print(t4.rpart)
n= 100 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

 1) root 100 49 A (0.5100000 0.4900000)  
   2) V3< 0.319873 33 13 A (0.6060606 0.3939394)  
     4) V3>=0.0740043 26  8 A (0.6923077 0.3076923) *
     5) V3< 0.0740043 7  2 C (0.2857143 0.7142857) *
   3) V3>=0.319873 67 31 C (0.4626866 0.5373134)  
     6) V2>=0.2482707 54 26 A (0.5185185 0.4814815)  
      12) V3< 0.7424548 31 13 A (0.5806452 0.4193548)  
        24) V3>=0.3740281 23  8 A (0.6521739 0.3478261) *
        25) V3< 0.3740281 8  3 C (0.3750000 0.6250000) *
      13) V3>=0.7424548 23 10 C (0.4347826 0.5652174)  
        26) V3>=0.9291285 8  3 A (0.6250000 0.3750000) *
        27) V3< 0.9291285 15  5 C (0.3333333 0.6666667) *
     7) V2< 0.2482707 13  3 C (0.2307692 0.7692308) *

And, just to make sure I get the same result if it's the first factor level
missing (so it's not just a property of the middle level missing):

> t5 <- t3
> t5$V1[t5$V1 == "A"] <- "B"
> table(t5$V1)

 A  B  C 
 0 51 49 
> 
> 
> t5.rpart <- rpart(V1 ~ V2 + V3, data = t5, method = "class")
> 
> print(t5.rpart)
n= 100 

node), split, n, loss, yval, (yprob)
      * denotes terminal node

1) root 100 NaN A (NaN NaN NaN) *
> 
----------------------------------------------------------------------------
-----------------------------------
I tried to trace the problem, and took a look at the beginning of
rpart.class:

"rpart.class" <-
  function (y, offset, parms, wt) 
{
  if (!is.null(offset)) 
    stop("No offset allowed in classification models")
  fy <- as.factor(y)
  y <- as.integer(fy)
  numclass <- max(y[!is.na(y)])
...
}
I thought the problem might have to do with the fact that numclass would
register 3 for my examples t3 and t5, but 2 for my examples t2 and t4.  So I
tried substituting 

numclass <- length(unique(y[!is.na(y)])) 

to get the number of factors present rather than the largest integer code
used, but that didn't help.

Then I tried running nnet using this data, and it worked fine on all the
examples, with a warning about an empty level.  So I lifted some code from
nnet to make the beginning of rpart.class look like this:

"rpart.class" <-
  function (y, offset, parms, wt) 
{
  if (!is.null(offset)) 
    stop("No offset allowed in classification models")
  fy <- as.factor(y)
### here comes code from nnet
    lev <- levels(y)
  counts <-  table(fy)
  if (any(counts == 0)) {
      warning(paste("group(s)", paste(lev[counts == 0], 
                                      collapse = " "), "are empty"))
      fy <- factor(fy, levels = lev[counts > 0])
    }
### done with code from nnet
  y <- as.integer(fy)
  numclass <- max(y[!is.na(y)])
...
}

When I run this it works, but the factor gets recoded.  For many
applications that's fine -- It seems to work OK when I use (say) the
predicted factor as a factor, but I'm worried about what could happen if I
use as.integer(predict(rpart.obj)) eventually and compare it to something
with the original encoding.   Not good programming practice, but I wouldn't
be surprised to find that I did it somewhere.

So the ideal would be saving the coding and restoring it somewhere.  But it
looks to me like that would require saving this somewhere and restoring the
coding after rpart (if I restore it in rpart.class, I should just run into
the problems I had before).  And then I'm not sure whether the core group
wants to mess with rpart to this extent.

Thanks for the terrific software.  I hope this is helpful.

Matthew Wiener
Applied Computer Science and Mathematics Department
Merck Research Labs
Rahway, NJ  07065-0900
732-594-5303

--please do not edit the information below--

Version:
 platform = i386-pc-mingw32
 arch = x86
 os = Win32
 system = x86, Win32
 status = 
 major = 1
 minor = 4.1
 year = 2002
 month = 01
 day = 30
 language = R

Windows NT 4.0 (build 1381) Service Pack 6

Search Path:
 .GlobalEnv, package:rpart, package:ctest, Autoloads, package:base

--please do not edit the information below--

Version:
 platform = powerpc-apple-darwin5.2
 arch = powerpc
 os = darwin5.2
 system = powerpc, darwin5.2
 status = 
 major = 1
 minor = 4.1
 year = 2002
 month = 01
 day = 30
 language = R

Search Path:
 .GlobalEnv, package:rpart, package:ctest, Autoloads, package:base

-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-.-
r-devel mailing list -- Read http://www.ci.tuwien.ac.at/~hornik/R/R-FAQ.html
Send "info", "help", or "[un]subscribe"
(in the "body", not the subject !)  To: r-devel-request@stat.math.ethz.ch
_._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._._