[R] Problem while predicting in regression trees
Max Kuhn
mxkuhn at gmail.com
Tue May 10 00:22:30 CEST 2016
I've brought this up numerous times... you shouldn't use `predict.rpart`
(or whatever modeling function) from the `finalModel` object. That object
has no idea what was done to the data prior to its invocation.
The issue here is that `train(formula)` converts the factors to dummy
variables. `rpart` does not require that and the `finalModel` object has no
idea that that happened. Using `predict.train` works just fine so why not
use it?
> table(predict(tr_m, newdata = testPFI))
-2617.42857142857 -1786.76923076923 -1777.58333333333 -1217.3
3 3 6 3
-886.666666666667 -408.375 -375.7 -240.307692307692
5 1 4 5
-201.612903225806 -19.6071428571429 30.8083333333333 43.9
30 72 66 9
151.5 209.647058823529
6 28
On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal <
Muhammad2.Bilal at live.uwe.ac.uk> wrote:
> Please find the sample dataset attached along with R code pasted below to
> reproduce the issue.
>
>
> #Loading the data frame
>
> pfi <- read.csv("pfi_data.csv")
>
> #Splitting the data into training and test sets
> split <- sample.split(pfi, SplitRatio = 0.7)
> trainPFI <- subset(pfi, split == TRUE)
> testPFI <- subset(pfi, split == FALSE)
>
> #Cross validating the decision trees
> tr.control <- trainControl(method="repeatedcv", number=20)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration
> + sector + contract_type + capital_value, data = trainPFI, method="rpart",
> trControl=tr.control, tuneGrid = cp.grid)
>
> #Displaying the train results
> tr_m
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> #Plotting the best tree
> prp(best_tree)
>
> #Using the best tree to make predictions *[This command raises the error]*
> best_tree_pred <- predict(best_tree, newdata = testPFI)
>
> #Calculating the SSE
> best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
>
> #
> tree_pred.sse
>
> ...
>
> Many Thanks and
>
>
> Kind Regards
>
>
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> *muhammad2.bilal at live.uwe.ac.uk* <olugbenga2.akinade at live.uwe.ac.uk>
>
>
> ------------------------------
> *From:* Max Kuhn <mxkuhn at gmail.com>
> *Sent:* 09 May 2016 17:22:22
> *To:* Muhammad Bilal
> *Cc:* Bert Gunter; r-help at r-project.org
>
> *Subject:* Re: [R] Problem while predicting in regression trees
>
> It is extremely difficult to tell what the issue might be without a
> reproducible example.
>
> The only thing that I can suggest is to use the non-formula interface to
> `train` so that you can avoid creating dummy variables.
>
> On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
> Muhammad2.Bilal at live.uwe.ac.uk> wrote:
>
>> Hi Bert,
>>
>> Thanks for the response.
>>
>> I checked the datasets, however, the Hospitals level appears in both of
>> them. See the output below:
>>
>> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
>> sector count(*)
>> 1 Defense 9
>> 2 Hospitals 101
>> 3 Housing 32
>> 4 Others 99
>> 5 Public Buildings 39
>> 6 Schools 148
>> 7 Social Care 10
>> 8 Transportation 27
>> 9 Waste 26
>> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
>> sector count(*)
>> 1 Defense 5
>> 2 Hospitals 47
>> 3 Housing 11
>> 4 Others 44
>> 5 Public Buildings 18
>> 6 Schools 69
>> 7 Social Care 9
>> 8 Transportation 8
>> 9 Waste 12
>>
>> Any thing else to try?
>>
>> --
>> Muhammad Bilal
>> Research Fellow and Doctoral Researcher,
>> Bristol Enterprise, Research, and Innovation Centre (BERIC),
>> University of the West of England (UWE),
>> Frenchay Campus,
>> Bristol,
>> BS16 1QY
>>
>> muhammad2.bilal at live.uwe.ac.uk
>>
>>
>> ________________________________________
>> From: Bert Gunter <bgunter.4567 at gmail.com>
>> Sent: 09 May 2016 01:42:39
>> To: Muhammad Bilal
>> Cc: r-help at r-project.org
>> Subject: Re: [R] Problem while predicting in regression trees
>>
>> It seems that the data that you used for prediction contained a level
>> "Hospitals" for the sector factor that did not appear in the training
>> data (or maybe it's the other way round). Check this.
>>
>> Cheers,
>> Bert
>>
>>
>> Bert Gunter
>>
>> "The trouble with having an open mind is that people keep coming along
>> and sticking things into it."
>> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>>
>>
>> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
>> <Muhammad2.Bilal at live.uwe.ac.uk> wrote:
>> > Hi All,
>> >
>> > I have the following script, that raises error at the last command. I
>> am new to R and require some clarification on what is going wrong.
>> >
>> > #Creating the training and testing data sets
>> > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
>> > trainPFI <- subset(pfi_v3, splitFlag==TRUE)
>> > testPFI <- subset(pfi_v3, splitFlag==FALSE)
>> >
>> >
>> > #Structure of the trainPFI data frame
>> >> str(trainPFI)
>> > *******
>> > 'data.frame': 491 obs. of 16 variables:
>> > $ project_id : int 1 2 3 6 7 9 10 12 13 14 ...
>> > $ project_lat : num 51.4 51.5 52.2 51.9 52.5 ...
>> > $ project_lon : num -0.642 -1.85 0.08 -0.401 -1.888 ...
>> > $ sector : Factor w/ 9 levels
>> "Defense","Hospitals",..: 4 4 4 6 6 6 6 6 6 6 ...
>> > $ contract_type : chr "Turnkey" "Turnkey" "Turnkey"
>> "Turnkey" ...
>> > $ project_duration : int 1826 3652 121 730 730 790 522 819 998
>> 372 ...
>> > $ project_delay : int -323 0 -60 0 0 0 -91 0 0 7 ...
>> > $ capital_value : num 6.7 5.8 21.8 24.2 40.7 10.7 70 24.5
>> 60.5 78 ...
>> > $ project_delay_pct : num -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
>> > $ delay_type : Ord.factor w/ 9 levels "7 months early &
>> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
>> >
>> > library(caret)
>> > library(e1071)
>> >
>> > set.seed(100)
>> >
>> > tr.control <- trainControl(method="cv", number=10)
>> > cp.grid <- expand.grid(.cp = (0:10)*0.001)
>> >
>> > #Fitting the model using regression tree
>> > tr_m <- train(project_delay ~ project_lon + project_lat +
>> project_duration + sector + contract_type + capital_value, data = trainPFI,
>> method="rpart", trControl=tr.control, tuneGrid = cp.grid)
>> >
>> > tr_m
>> >
>> > CART
>> > 491 samples
>> > 15 predictor
>> > No pre-processing
>> > Resampling: Cross-Validated (10 fold)
>> > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
>> > Resampling results across tuning parameters:
>> > cp RMSE Rsquared
>> > 0.000 441.1524 0.5417064
>> > 0.001 439.6319 0.5451104
>> > 0.002 437.4039 0.5487203
>> > 0.003 432.3675 0.5566661
>> > 0.004 434.2138 0.5519964
>> > 0.005 431.6635 0.5577771
>> > 0.006 436.6163 0.5474135
>> > 0.007 440.5473 0.5407240
>> > 0.008 441.0876 0.5399614
>> > 0.009 441.5715 0.5401718
>> > 0.010 441.1401 0.5407121
>> > RMSE was used to select the optimal model using the smallest value.
>> > The final value used for the model was cp = 0.005.
>> >
>> > #Fetching the best tree
>> > best_tree <- tr_m$finalModel
>> >
>> > Alright, all the aforementioned commands worked fine.
>> >
>> > Except the subsequent command raises error, when the developed model is
>> used to make predictions:
>> > best_tree_pred <- predict(best_tree, newdata = testPFI)
>> > Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found
>> >
>> > Can someone guide me what to do to resolve this issue.
>> >
>> > Any help will be highly appreciated.
>> >
>> > Many Thanks and
>> >
>> > Kind Regards
>> >
>> > --
>> > Muhammad Bilal
>> > Research Fellow and Doctoral Researcher,
>> > Bristol Enterprise, Research, and Innovation Centre (BERIC),
>> > University of the West of England (UWE),
>> > Frenchay Campus,
>> > Bristol,
>> > BS16 1QY
>> >
>> > muhammad2.bilal at live.uwe.ac.uk<mailto:olugbenga2.akinade at live.uwe.ac.uk
>> >
>> >
>> >
>> > [[alternative HTML version deleted]]
>> >
>> > ______________________________________________
>> > R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> > https://stat.ethz.ch/mailman/listinfo/r-help
>> > PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> > and provide commented, minimal, self-contained, reproducible code.
>>
>> ______________________________________________
>> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list