[R] Problem while predicting in regression trees

Max Kuhn mxkuhn at gmail.com
Tue May 10 00:22:30 CEST 2016


I've brought this up numerous times... you shouldn't use `predict.rpart`
(or whatever modeling function) from the `finalModel` object. That object
has no idea what was done to the data prior to its invocation.

The issue here is that `train(formula)` converts the factors to dummy
variables. `rpart` does not require that and the `finalModel` object has no
idea that that happened. Using `predict.train` works just fine so why not
use it?

> table(predict(tr_m, newdata = testPFI))

-2617.42857142857 -1786.76923076923 -1777.58333333333           -1217.3
                3                 3                 6                 3
-886.666666666667          -408.375            -375.7 -240.307692307692
                5                 1                 4                 5
-201.612903225806 -19.6071428571429  30.8083333333333              43.9
               30                72                66                 9
            151.5  209.647058823529
                6                28

On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal <
Muhammad2.Bilal at live.uwe.ac.uk> wrote:

> Please find the sample dataset attached along with R code pasted below to
> reproduce the issue.
>
>
> #Loading the data frame
>
> pfi <- read.csv("pfi_data.csv")
>
> #Splitting the data into training and test sets
> split <- sample.split(pfi, SplitRatio = 0.7)
> trainPFI <- subset(pfi, split == TRUE)
> testPFI <- subset(pfi, split == FALSE)
>
> #Cross validating the decision trees
> tr.control <- trainControl(method="repeatedcv", number=20)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration
> + sector + contract_type + capital_value, data = trainPFI, method="rpart",
> trControl=tr.control, tuneGrid = cp.grid)
>
> #Displaying the train results
> tr_m
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> #Plotting the best tree
> prp(best_tree)
>
> #Using the best tree to make predictions *[This command raises the error]*
> best_tree_pred <- predict(best_tree, newdata = testPFI)
>
> #Calculating the SSE
> best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
>
> #
> tree_pred.sse
>
> ...
>
> Many Thanks and
>
>
> Kind Regards
>
>
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> *muhammad2.bilal at live.uwe.ac.uk* <olugbenga2.akinade at live.uwe.ac.uk>
>
>
> ------------------------------
> *From:* Max Kuhn <mxkuhn at gmail.com>
> *Sent:* 09 May 2016 17:22:22
> *To:* Muhammad Bilal
> *Cc:* Bert Gunter; r-help at r-project.org
>
> *Subject:* Re: [R] Problem while predicting in regression trees
>
> It is extremely difficult to tell what the issue might be without a
> reproducible example.
>
> The only thing that I can suggest is to use the non-formula interface to
> `train` so that you can avoid creating dummy variables.
>
> On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
> Muhammad2.Bilal at live.uwe.ac.uk> wrote:
>
>> Hi Bert,
>>
>> Thanks for the response.
>>
>> I checked the datasets, however, the Hospitals level appears in both of
>> them. See the output below:
>>
>> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
>>             sector count(*)
>> 1          Defense        9
>> 2        Hospitals      101
>> 3          Housing       32
>> 4           Others       99
>> 5 Public Buildings       39
>> 6          Schools      148
>> 7      Social Care       10
>> 8      Transportation       27
>> 9            Waste       26
>> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
>>             sector count(*)
>> 1          Defense        5
>> 2        Hospitals       47
>> 3          Housing       11
>> 4           Others       44
>> 5 Public Buildings       18
>> 6          Schools       69
>> 7      Social Care        9
>> 8   Transportation        8
>> 9            Waste       12
>>
>> Any thing else to try?
>>
>> --
>> Muhammad Bilal
>> Research Fellow and Doctoral Researcher,
>> Bristol Enterprise, Research, and Innovation Centre (BERIC),
>> University of the West of England (UWE),
>> Frenchay Campus,
>> Bristol,
>> BS16 1QY
>>
>> muhammad2.bilal at live.uwe.ac.uk
>>
>>
>> ________________________________________
>> From: Bert Gunter <bgunter.4567 at gmail.com>
>> Sent: 09 May 2016 01:42:39
>> To: Muhammad Bilal
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Problem while predicting in regression trees
>>
>> It seems that the data that you used for prediction contained a level
>> "Hospitals" for the sector factor that did not appear in the training
>> data (or maybe it's the other way round). Check this.
>>
>> Cheers,
>> Bert
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
>> <Muhammad2.Bilal at live.uwe.ac.uk> wrote:
>> > Hi All,
>> >
>> > I have the following script, that raises error at the last command. I
>> am new to R and require some clarification on what is going wrong.
>> >
>> > #Creating the training and testing data sets
>> > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
>> > trainPFI <- subset(pfi_v3, splitFlag==TRUE)
>> > testPFI <- subset(pfi_v3, splitFlag==FALSE)
>> >
>> >
>> > #Structure of the trainPFI data frame
>> >> str(trainPFI)
>> > *******
>> > 'data.frame': 491 obs. of  16 variables:
>> >  $ project_id             : int  1 2 3 6 7 9 10 12 13 14 ...
>> >  $ project_lat            : num  51.4 51.5 52.2 51.9 52.5 ...
>> >  $ project_lon            : num  -0.642 -1.85 0.08 -0.401 -1.888 ...
>> >  $ sector                 : Factor w/ 9 levels
>> "Defense","Hospitals",..: 4 4 4 6 6 6 6 6 6 6 ...
>> >  $ contract_type          : chr  "Turnkey" "Turnkey" "Turnkey"
>> "Turnkey" ...
>> >  $ project_duration       : int  1826 3652 121 730 730 790 522 819 998
>> 372 ...
>> >  $ project_delay          : int  -323 0 -60 0 0 0 -91 0 0 7 ...
>> >  $ capital_value          : num  6.7 5.8 21.8 24.2 40.7 10.7 70 24.5
>> 60.5 78 ...
>> >  $ project_delay_pct      : num  -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
>> >  $ delay_type             : Ord.factor w/ 9 levels "7 months early &
>> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
>> >
>> > library(caret)
>> > library(e1071)
>> >
>> > set.seed(100)
>> >
>> > tr.control <- trainControl(method="cv", number=10)
>> > cp.grid <- expand.grid(.cp = (0:10)*0.001)
>> >
>> > #Fitting the model using regression tree
>> > tr_m <- train(project_delay ~ project_lon + project_lat +
>> project_duration + sector + contract_type + capital_value, data = trainPFI,
>> method="rpart", trControl=tr.control, tuneGrid = cp.grid)
>> >
>> > tr_m
>> >
>> > CART
>> > 491 samples
>> > 15 predictor
>> > No pre-processing
>> > Resampling: Cross-Validated (10 fold)
>> > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
>> > Resampling results across tuning parameters:
>> >   cp     RMSE      Rsquared
>> >   0.000  441.1524  0.5417064
>> >   0.001  439.6319  0.5451104
>> >   0.002  437.4039  0.5487203
>> >   0.003  432.3675  0.5566661
>> >   0.004  434.2138  0.5519964
>> >   0.005  431.6635  0.5577771
>> >   0.006  436.6163  0.5474135
>> >   0.007  440.5473  0.5407240
>> >   0.008  441.0876  0.5399614
>> >   0.009  441.5715  0.5401718
>> >   0.010  441.1401  0.5407121
>> > RMSE was used to select the optimal model using  the smallest value.
>> > The final value used for the model was cp = 0.005.
>> >
>> > #Fetching the best tree
>> > best_tree <- tr_m$finalModel
>> >
>> > Alright, all the aforementioned commands worked fine.
>> >
>> > Except the subsequent command raises error, when the developed model is
>> used to make predictions:
>> > best_tree_pred <- predict(best_tree, newdata = testPFI)
>> > Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found
>> >
>> > Can someone guide me what to do to resolve this issue.
>> >
>> > Any help will be highly appreciated.
>> >
>> > Many Thanks and
>> >
>> > Kind Regards
>> >
>> > --
>> > Muhammad Bilal
>> > Research Fellow and Doctoral Researcher,
>> > Bristol Enterprise, Research, and Innovation Centre (BERIC),
>> > University of the West of England (UWE),
>> > Frenchay Campus,
>> > Bristol,
>> > BS16 1QY
>> >
>> > muhammad2.bilal at live.uwe.ac.uk<mailto:olugbenga2.akinade at live.uwe.ac.uk
>> >
>> >
>> >
>> >         [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>

	[[alternative HTML version deleted]]



More information about the R-help mailing list