[R] rpart question
Joshua Wiley
jwiley.psych at gmail.com
Mon Jan 16 03:10:57 CET 2012
Hi Amanda,
Sorry for the bit of a slow response (classes and research have been
chaotic). Below are details on what I looked at and a few suggestions
at the end for what you can do.
To the general R community: summary.rpart() makes explicit the default
dropping behavior of `[` which makes me think that it may be
important, but it seems to cause problems in the case of only one node
because a 1 x k matrix is passed which when the dimensions are dropped
results in a vector. Could this be changed to drop = FALSE (fixing
the case for one node) without causing problems for other models?
Cheers,
Josh
## Read in example data
trial <- structure(list(ENROLL_YN = structure(c(1L, 1L, 1L, 1L, 2L, 1L,
1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("N",
"Y"), class = "factor"), MINORITY = c(0L, 0L, 1L, 0L, 0L, 0L,
0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L)), .Names =
c("ENROLL_YN",
"MINORITY"), class = "data.frame", row.names = c(8566L, 7657L,
3155L, 6429L, 8651L, 7973L, 6L, 5865L, 5878L, 5037L, 6950L, 9139L,
960L, 3058L, 7979L, 2465L, 4231L, 1529L, 7500L, 8248L))
require(rpart)
## fit the model
## no errors suggesting the problem is not here
m <- rpart(ENROLL_YN ~ MINORITY, data = trial, method="class")
## this throws an error
## makes me think that either some summary information
## or the print/show methods are the cause
summary(m)
## look at the class of the model object
class(m)
## look at the methods for summary
methods(summary)
## poke in the source code for summary.rpart
## (note non exported function so using :::)
rpart:::summary.rpart
## we already know from your traceback() output the code to look for
## x$functions$summary
## looking at the summary.rpart source
## x is the model object
## so....
m$functions$summary
## yval, the first argument evidently needs at least two dimensions
## and at least 2 columns
## back at the summary.rpart code, it looks like what is getting passed is
## else tprint <- x$functions$summary(ff$yval2[rows, , drop = FALSE],
## ff$dev[rows], ff$wt[rows], ylevel, digits)
# so what is ff is defined earlier as x$frame (where x is the model object)
m$frame$yval2
## is a 1 x 5 matrix
## look what happens when we select all of it with drop = TRUE
m$frame$yval2[, , drop = TRUE]
## looking now at ?rpart.object where we learn that the frame element contains:
## Extra response information is in 'yval2', which contains the
## number of events at the node (poisson), or a matrix
## containing the fitted class, the class counts for each node
## and the class probabilities (classification). Also included
## in the frame are 'complexity', the complexity parameter at
## which this split will collapse, 'ncompete', the number of
## competitor splits retained, and 'nsurrogate', the number of
## surrogate splits retained.
## basically, the issue is, your model (at least in the example data)
only has 1 node
## so the matrix has 1 row, and when drop = TRUE, this reduces yval2 to a vector
## which causes problems for the summary methods
## I am not familiar enough with rpart to say if this is at it should be
## or if perhaps a modification is in order
## for here and now, you can either just not use summary()
## find a way to get more nodes
## or create a copy of rpart:::summary.rpart where you change drop =
TRUE to drop = FALSE
## around line 57 of the function. Call it something new (like rpartSummary2)
## then rpartSummary2(m) and it will work
## I did this and got:
## > rpartSummary2(m)
## Call:
## rpart(formula = ENROLL_YN ~ MINORITY, data = trial, method = "class")
## n= 20
## CP nsplit rel error
## 1 0.01 0 1
## Node number 1: 20 observations
## predicted class=N expected loss=0.15
## class counts: 17 3
## probabilities: 0.850 0.150
On Wed, Jan 11, 2012 at 1:31 PM, Amanda Marie Elling <elling at stolaf.edu> wrote:
> Hi Josh,
> Thanks for getting back to us so fast!!
> We created a subset of 20 cases and still ran into the same issue, I have
> copied the code below along with the dput() and traceback() outputs.
>
>> trial=accept.students.n08[sample(1:5000,20),]
>> dput(trial[, c("ENROLL_YN", "MINORITY")])
> structure(list(ENROLL_YN = structure(c(1L, 1L, 1L, 1L, 2L, 1L,
> 1L, 1L, 1L, 1L, 1L, 1L, 1L, 2L, 1L, 1L, 1L, 1L, 2L, 1L), .Label = c("N",
> "Y"), class = "factor"), MINORITY = c(0L, 0L, 1L, 0L, 0L, 0L,
> 0L, 0L, 0L, 0L, 1L, 0L, 0L, 0L, 0L, 0L, 0L, 1L, 1L, 0L)), .Names =
> c("ENROLL_YN",
> "MINORITY"), class = "data.frame", row.names = c(8566L, 7657L,
> 3155L, 6429L, 8651L, 7973L, 6L, 5865L, 5878L, 5037L, 6950L, 9139L,
> 960L, 3058L, 7979L, 2465L, 4231L, 1529L, 7500L, 8248L))
>> fit_rpart2=rpart(trial$ENROLL_YN~trial$MINORITY, method="class")
>> summary(fit_rpart2)
> Call:
> rpart(formula = trial$ENROLL_YN ~ trial$MINORITY, method = "class")
> n= 20
>
> CP nsplit rel error
> 1 0.01 0 1
>
> Error in yval[, 1] : incorrect number of dimensions
>> traceback()
> 3: x$functions$summary(ff$yval2[rows, , drop = TRUE], ff$dev[rows],
> ff$wt[rows], ylevel, digits)
> 2: summary.rpart(fit_rpart2)
> 1: summary(fit_rpart2)
>
>> data.frame(trial$MINORITY,trial$ENROLL_YN)
> trial.MINORITY trial.ENROLL_YN
> 1 0 N
> 2 0 N
> 3 1 N
> 4 0 N
> 5 0 Y
> 6 0 N
> 7 0 N
> 8 0 N
> 9 0 N
> 10 0 N
> 11 1 N
> 12 0 N
> 13 0 N
> 14 0 Y
> 15 0 N
> 16 0 N
> 17 0 N
> 18 1 N
> 19 1 Y
> 20 0 N
>
> We are still unsure what the error is referring to. Thoughts?? Let us know
> if you need anything else. Thanks so much for your help!
>
> Amanda
>
>
> On Sun, Jan 8, 2012 at 7:41 PM, Joshua Wiley <jwiley.psych at gmail.com> wrote:
>>
>> Hi Amanda,
>>
>> Can you reproduce the error with a small subset of the data? If so,
>> could you send it to us? For instance if say 20 cases is sufficient,
>> you could send the output of dput() which pastes easily into the
>> console:
>>
>> dput(yourdata[, c("ENROLL_YN", "MINORITY")])
>>
>> You could also try calling traceback() after the error to get a bit
>> more diagnostics (and post those if they do not make any sense or help
>> you).
>>
>> Hope this helps,
>>
>> Josh
>>
>> On Sun, Jan 8, 2012 at 1:48 PM, Amanda Marie Elling <elling at stolaf.edu>
>> wrote:
>> > We are trying to make a decision tree using rpart and we are continually
>> > running into the following error:
>> >
>> >> fit_rpart=rpart(ENROLL_YN~MINORITY,method="class")
>> >> summary(fit_rpart)
>> > Call:
>> > rpart(formula = ENROLL_YN ~ MINORITY, method = "class")
>> > n= 5725
>> >
>> > CP nsplit rel error
>> > 1 0 0 1
>> > Error in yval[, 1] : incorrect number of dimensions
>> >
>> > ENROLL_YN is a categorical variable with two options- yes or no.
>> > MINORITY is also a categorical variable with two options- 0 or 1.
>> >
>> > We have confirmed that all variables are the same length and there are
>> > no
>> > NAs.
>> >
>> > Does anyone have any ideas that might help?? All thoughts would be
>> > appreciated, thanks!
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> > http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>>
>>
>> --
>> Joshua Wiley
>> Ph.D. Student, Health Psychology
>> Programmer Analyst II, Statistical Consulting Group
>> University of California, Los Angeles
>> https://joshuawiley.com/
>
>
--
Joshua Wiley
Ph.D. Student, Health Psychology
Programmer Analyst II, Statistical Consulting Group
University of California, Los Angeles
https://joshuawiley.com/
More information about the R-help
mailing list