Muhammad Bilal
Muhammad2.Bilal at live.uwe.ac.uk
Tue May 10 12:17:26 CEST 2016
Many thanks Max for these valuable suggestions.
From: Max Kuhn <mxkuhn at gmail.com>
Sent: 09 May 2016 23:22:30
To: Muhammad Bilal
Cc: Bert Gunter; r-help at r-project.org
Subject: Re: [R] Problem while predicting in regression trees
I've brought this up numerous times... you shouldn't use `predict.rpart` (or whatever modeling function) from the `finalModel` object. That object has no idea what was done to the data prior to its invocation.
The issue here is that `train(formula)` converts the factors to dummy variables. `rpart` does not require that and the `finalModel` object has no idea that that happened. Using `predict.train` works just fine so why not use it?
> table(predict(tr_m, newdata = testPFI))
-2617.42857142857 -1786.76923076923 -1777.58333333333 -1217.3
3 3 6 3
-886.666666666667 -408.375 -375.7 -240.307692307692
5 1 4 5
-201.612903225806 -19.6071428571429 30.8083333333333 43.9
30 72 66 9
151.5 209.647058823529
6 28
On Mon, May 9, 2016 at 2:46 PM, Muhammad Bilal wrote:
Please find the sample dataset attached along with R code pasted below to reproduce the issue.
#Loading the data frame
pfi <- read.csv("pfi_data.csv")
#Splitting the data into training and test sets
split <- sample.split(pfi, SplitRatio = 0.7)
trainPFI <- subset(pfi, split == TRUE)
testPFI <- subset(pfi, split == FALSE)
#Cross validating the decision trees
tr.control <- trainControl(method="repeatedcv", number=20)
cp.grid <- expand.grid(.cp = (0:10)*0.001)
tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + sector + contract_type + capital_value, data = trainPFI, method="rpart", trControl=tr.control, tuneGrid = cp.grid)
#Displaying the train results
#Fetching the best tree
best_tree <- tr_m$finalModel
#Plotting the best tree
#Using the best tree to make predictions [This command raises the error]
best_tree_pred <- predict(best_tree, newdata = testPFI)
#Calculating the SSE
best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
From: Max Kuhn <mxkuhn at gmail.com<mailto:mxkuhn at gmail.com>>
Sent: 09 May 2016 17:22:22
To: Muhammad Bilal
Cc: Bert Gunter; r-help at r-project.org<mailto:r-help at r-project.org>
Subject: Re: [R] Problem while predicting in regression trees
It is extremely difficult to tell what the issue might be without a reproducible example.
The only thing that I can suggest is to use the non-formula interface to `train` so that you can avoid creating dummy variables.
On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal wrote:
Hi Bert,
Thanks for the response.
I checked the datasets, however, the Hospitals level appears in both of them. See the output below:
> sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
sector count(*)
1 Defense 9
2 Hospitals 101
3 Housing 32
4 Others 99
5 Public Buildings 39
6 Schools 148
7 Social Care 10
8 Transportation 27
9 Waste 26
> sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
sector count(*)
1 Defense 5
2 Hospitals 47
3 Housing 11
4 Others 44
5 Public Buildings 18
6 Schools 69
7 Social Care 9
8 Transportation 8
9 Waste 12
Any thing else to try?
From: Bert Gunter
Sent: 09 May 2016 01:42:39
Subject: Re: [R] Problem while predicting in regression trees
Sent: 09 May 2016 01:42:39
To: Muhammad Bilal
Cc: r-help at r-project.org<mailto:r-help at r-project.org>
Subject: Re: [R] Problem while predicting in regression trees
It seems that the data that you used for prediction contained a level
"Hospitals" for the sector factor that did not appear in the training
data (or maybe it's the other way round). Check this.
On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal wrote:
<Muhammad2.Bilal at live.uwe.ac.uk<mailto:Muhammad2.Bilal at live.uwe.ac.uk>> wrote:
> Hi All,
> I have the following script, that raises error at the last command. I am new to R and require some clarification on what is going wrong.
> #Creating the training and testing data sets
> splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> testPFI <- subset(pfi_v3, splitFlag==FALSE)
> #Structure of the trainPFI data frame
>> str(trainPFI)
> *******
> 'data.frame': 491 obs. of 16 variables:
> $ project_id : int 1 2 3 6 7 9 10 12 13 14 ...
> $ project_lat : num 51.4 51.5 52.2 51.9 52.5 ...
> $ project_lon : num -0.642 -1.85 0.08 -0.401 -1.888 ...
> $ sector : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 4 6 6 6 6 6 6 6 ...
> $ contract_type : chr "Turnkey" "Turnkey" "Turnkey" "Turnkey" ...
> $ project_duration : int 1826 3652 121 730 730 790 522 819 998 372 ...
> $ project_delay : int -323 0 -60 0 0 0 -91 0 0 7 ...
> $ capital_value : num 6.7 5.8 21.8 24.2 40.7 10.7 70 24.5 60.5 78 ...
> $ project_delay_pct : num -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
> $ delay_type : Ord.factor w/ 9 levels "7 months early & beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
> library(caret)
> library(e1071)
> set.seed(100)
> tr.control <- trainControl(method="cv", number=10)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> #Fitting the model using regression tree
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration + sector + contract_type + capital_value, data = trainPFI, method="rpart", trControl=tr.control, tuneGrid = cp.grid)
> tr_m
> 491 samples
> 15 predictor
> No pre-processing
> Resampling: Cross-Validated (10 fold)
> Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> Resampling results across tuning parameters:
> cp RMSE Rsquared
> 0.000 441.1524 0.5417064
> 0.001 439.6319 0.5451104
> 0.002 437.4039 0.5487203
> 0.003 432.3675 0.5566661
> 0.004 434.2138 0.5519964
> 0.005 431.6635 0.5577771
> 0.006 436.6163 0.5474135
> 0.007 440.5473 0.5407240
> 0.008 441.0876 0.5399614
> 0.009 441.5715 0.5401718
> 0.010 441.1401 0.5407121
> RMSE was used to select the optimal model using the smallest value.
> The final value used for the model was cp = 0.005.
> #Fetching the best tree
> best_tree <- tr_m$finalModel
> Alright, all the aforementioned commands worked fine.
> Except the subsequent command raises error, when the developed model is used to make predictions:
> best_tree_pred <- predict(best_tree, newdata = testPFI)
> Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found
> Can someone guide me what to do to resolve this issue.
> Any help will be highly appreciated.
