[R] rpart memory problem

Mon Mar 21 20:32:29 CET 2005

jenniferbecq at free.fr wrote:

> Hi everyone,
> 
> I have a problem using rpart (R 2.0.1 under Unix)
> 
> Indeed, I have a large matrix (9271x7), my response variable is numeric and all
> my predictor variables are categorical (from 3 to 8 levels).

Your problem is the number of levels. You get a similar number of dummy 
variables and your problem becomes really huge.

Uwe Ligges

> 
> Here is an example :
> 
> 
>>mydata[1:5,]
> 
>                   distance group3 group4 group5 group6 group7 group8
> pos_1    0.141836040224967      a      c      e      a      g      g
> pos_501  0.153605961621317      a      a      a      a      g      g
> pos_1001 0.152246705384699      a      c      e      a      g      g
> pos_1501 0.145563737522463      a      c      e      a      g      g
> pos_2001 0.143940027378837      a      c      e      e      g      g
> 
> When using rpart() as follow, the program runs for ages, and after a few hours,
> R is abruptly killed :
> 
> library(rpart)
> fit <- rpart(distance ~ ., data = mydata)
> 
> When I change the categorical variables into numeric values (e.g. a = 1, b = 2,
> c = 3, etc...), the program runs normally in a few seconds. But this is not
> what I want because it separates my variables according to "group7 > 4.5"
> (continuous) and not "group7 = a,b,d,f" or "c,e,g" (discrete).
> 
> here is the result :
> 
>>fit
> 
> n= 9271
> 
> node), split, n, deviance, yval
>       * denotes terminal node
> 
>  1) root 9271 28.43239000 0.1768883
>    2) group7>=4.5 5830  4.87272700 0.1534626
>      4) group5< 5.5 5783  3.29538700 0.1520110
>        8) group5>=4.5 3068  0.68517040 0.1412967 *
>        9) group5< 4.5 2715  1.86003600 0.1641184 *
>      5) group5>=5.5 47  0.06597044 0.3320614 *
>    3) group7< 4.5 3441 14.93984000 0.2165781
>      6) group5< 1.5 1461  1.00414700 0.1906630 *
>      7) group5>=1.5 1980 12.23050000 0.2357002
>       14) group6>=2.5 1659  2.95395700 0.2090232
>         28) group3>=2.5 1315  1.65184200 0.1957505 *
>         29) group3< 2.5 344  0.18490260 0.2597607 *
>       15) group6< 2.5 321  1.99404400 0.3735729 *
> 
> 
> When I create a small dataframe such as the example above, e.g. :
> 
> distance = rnorm(5,0.15,0.01)
> group3 = c("a","a","a","a","a")
> group4 = c("c","a","c","c","c")
> group5 = c("e","a","e","e","e")
> group6 = c("a","a","a","a","e")
> smalldata = data.frame(cbind(distance,group3,group4,group5,group6))
> 
> The program runs normally in a few seconds.
> 
> Why does it work using the large dataset whith only numeric values but not with 
> categorical predictor variables ?
 >
> I have the impression that it considers my response variable also as a
> categorical variable and therefore it can't handle 9271 levels, which is quite
> normal. Is there a way to solve this problem ?
> 
> I thank you all for your time and help,
> 
> Jennifer Becq
> 
> ______________________________________________
> R-help at stat.math.ethz.ch mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide! http://www.R-project.org/posting-guide.html