[Rd] unexpected behavior of rpart 3.1-43, loss matrix

Liaw, Andy andy_liaw at merck.com
Tue May 5 15:39:42 CEST 2009


Just expressing MHO:  The  algorithm cannot give predictions in classes
that never appear in the training data, so any entries in the loss
matrix related to such classes are irrelevant w.r.t. the training data.
They should be removed before feeding to rpart (or any other algorithm
that can make use of a loss matrix).  As I see it, it's the
responsibility of the data analyst to take care of such things.  The
current error message may not make it obvious what the problem is, but
if I were the developer, I would not write the code to accept such
disparate input without issuing error.

Andy 

> -----Original Message-----
> From: r-devel-bounces at r-project.org 
> [mailto:r-devel-bounces at r-project.org] On Behalf Of Lars
> Sent: Thursday, April 30, 2009 12:43 PM
> To: r-devel at r-project.org
> Subject: [Rd] unexpected behavior of rpart 3.1-43, loss matrix
> 
> Hi,
> 
> I just noticed that rpart behaves unexpectecly, when performing
> classification learning and specifying a loss matrix.
> if the response variable y is a factor and if not all levels of the
> factor  occur in the observations, rpart exits with an error:
> 
> 
> > df=data.frame(attr=1:5,class=factor(c(2,3,1,5,3),levels=1:6))
> > rpart(class~attr,df,parms=list(loss=matrix(0,6,6)))
> Error in (get(paste("rpart", method, sep = ".")))(Y, offset, 
> parms, wt)
> :   Wrong length for loss matrix
> 
> 
> note that while the levels of the factor range from 1:6, for the
> concrete obseration data, only levels 1, 2, 3, 5 do occur.
> 
> the error is caused by the code of rpart.class:
> 
>  fy <- as.factor(y)
>  y <- as.integer(fy)
>  numclass <- max(y[!is.na(y)])
> ...
> 
> temp2 <- parms$loss
> if (length(temp2) != numclass^2)
>   stop("Wrong length for loss matrix")
> 
> 
> for the example, numclass is set to 5 instead of 6.
> 
> 
> while for that small example, it may be discussable whether or not
> numclass should be 6, consider a set of data for that the response
> variable has a certain range. Then, it may be the case that for some
> data, not all levels of the response variable do occur. at the same
> time, it is desirable to use the same loss matrix when training a
> deicision tree from the data.
> 
> 
> having said that, i am very happy with the rpart package and with its
> high configurability.
> 
> best regards
> lars
> 
> ______________________________________________
> R-devel at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-devel
> 
Notice:  This e-mail message, together with any attachme...{{dropped:12}}



More information about the R-devel mailing list