[R] Problem while predicting in regression trees
William Dunlap
wdunlap at tibco.com
Mon May 9 21:27:14 CEST 2016
Why are you predicting from tr_m$finalModel instead of from tr_m?
Bill Dunlap
TIBCO Software
wdunlap tibco.com
On Mon, May 9, 2016 at 11:46 AM, Muhammad Bilal <
Muhammad2.Bilal at live.uwe.ac.uk> wrote:
> Please find the sample dataset attached along with R code pasted below to
> reproduce the issue.
>
>
> #Loading the data frame
>
> pfi <- read.csv("pfi_data.csv")
>
> #Splitting the data into training and test sets
> split <- sample.split(pfi, SplitRatio = 0.7)
> trainPFI <- subset(pfi, split == TRUE)
> testPFI <- subset(pfi, split == FALSE)
>
> #Cross validating the decision trees
> tr.control <- trainControl(method="repeatedcv", number=20)
> cp.grid <- expand.grid(.cp = (0:10)*0.001)
> tr_m <- train(project_delay ~ project_lon + project_lat + project_duration
> + sector + contract_type + capital_value, data = trainPFI, method="rpart",
> trControl=tr.control, tuneGrid = cp.grid)
>
> #Displaying the train results
> tr_m
>
> #Fetching the best tree
> best_tree <- tr_m$finalModel
>
> #Plotting the best tree
> prp(best_tree)
>
> #Using the best tree to make predictions [This command raises the error]
> best_tree_pred <- predict(best_tree, newdata = testPFI)
>
> #Calculating the SSE
> best_tree_pred.sse <- sum((best_tree_pred - testPFI$project_delay)^2)
>
> #
> tree_pred.sse
>
> ...
>
>
> Many Thanks and
>
>
> Kind Regards
>
>
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bilal at live.uwe.ac.uk<mailto:olugbenga2.akinade at live.uwe.ac.uk>
>
>
> ________________________________
> From: Max Kuhn <mxkuhn at gmail.com>
> Sent: 09 May 2016 17:22:22
> To: Muhammad Bilal
> Cc: Bert Gunter; r-help at r-project.org
> Subject: Re: [R] Problem while predicting in regression trees
>
> It is extremely difficult to tell what the issue might be without a
> reproducible example.
>
> The only thing that I can suggest is to use the non-formula interface to
> `train` so that you can avoid creating dummy variables.
>
> On Mon, May 9, 2016 at 11:23 AM, Muhammad Bilal <
> Muhammad2.Bilal at live.uwe.ac.uk<mailto:Muhammad2.Bilal at live.uwe.ac.uk>>
> wrote:
> Hi Bert,
>
> Thanks for the response.
>
> I checked the datasets, however, the Hospitals level appears in both of
> them. See the output below:
>
> > sqldf("SELECT sector, count(*) FROM trainPFI GROUP BY sector")
> sector count(*)
> 1 Defense 9
> 2 Hospitals 101
> 3 Housing 32
> 4 Others 99
> 5 Public Buildings 39
> 6 Schools 148
> 7 Social Care 10
> 8 Transportation 27
> 9 Waste 26
> > sqldf("SELECT sector, count(*) FROM testPFI GROUP BY sector")
> sector count(*)
> 1 Defense 5
> 2 Hospitals 47
> 3 Housing 11
> 4 Others 44
> 5 Public Buildings 18
> 6 Schools 69
> 7 Social Care 9
> 8 Transportation 8
> 9 Waste 12
>
> Any thing else to try?
>
> --
> Muhammad Bilal
> Research Fellow and Doctoral Researcher,
> Bristol Enterprise, Research, and Innovation Centre (BERIC),
> University of the West of England (UWE),
> Frenchay Campus,
> Bristol,
> BS16 1QY
>
> muhammad2.bilal at live.uwe.ac.uk<mailto:muhammad2.bilal at live.uwe.ac.uk>
>
>
> ________________________________________
> From: Bert Gunter <bgunter.4567 at gmail.com<mailto:bgunter.4567 at gmail.com>>
> Sent: 09 May 2016 01:42:39
> To: Muhammad Bilal
> Cc: r-help at r-project.org<mailto:r-help at r-project.org>
> Subject: Re: [R] Problem while predicting in regression trees
>
> It seems that the data that you used for prediction contained a level
> "Hospitals" for the sector factor that did not appear in the training
> data (or maybe it's the other way round). Check this.
>
> Cheers,
> Bert
>
>
> Bert Gunter
>
> "The trouble with having an open mind is that people keep coming along
> and sticking things into it."
> -- Opus (aka Berkeley Breathed in his "Bloom County" comic strip )
>
>
> On Sun, May 8, 2016 at 4:14 PM, Muhammad Bilal
> <Muhammad2.Bilal at live.uwe.ac.uk<mailto:Muhammad2.Bilal at live.uwe.ac.uk>>
> wrote:
> > Hi All,
> >
> > I have the following script, that raises error at the last command. I am
> new to R and require some clarification on what is going wrong.
> >
> > #Creating the training and testing data sets
> > splitFlag <- sample.split(pfi_v3, SplitRatio = 0.7)
> > trainPFI <- subset(pfi_v3, splitFlag==TRUE)
> > testPFI <- subset(pfi_v3, splitFlag==FALSE)
> >
> >
> > #Structure of the trainPFI data frame
> >> str(trainPFI)
> > *******
> > 'data.frame': 491 obs. of 16 variables:
> > $ project_id : int 1 2 3 6 7 9 10 12 13 14 ...
> > $ project_lat : num 51.4 51.5 52.2 51.9 52.5 ...
> > $ project_lon : num -0.642 -1.85 0.08 -0.401 -1.888 ...
> > $ sector : Factor w/ 9 levels "Defense","Hospitals",..:
> 4 4 4 6 6 6 6 6 6 6 ...
> > $ contract_type : chr "Turnkey" "Turnkey" "Turnkey" "Turnkey"
> ...
> > $ project_duration : int 1826 3652 121 730 730 790 522 819 998
> 372 ...
> > $ project_delay : int -323 0 -60 0 0 0 -91 0 0 7 ...
> > $ capital_value : num 6.7 5.8 21.8 24.2 40.7 10.7 70 24.5
> 60.5 78 ...
> > $ project_delay_pct : num -17.7 0 -49.6 0 0 0 -17.4 0 0 1.9 ...
> > $ delay_type : Ord.factor w/ 9 levels "7 months early &
> beyond"<..: 1 5 3 5 5 5 2 5 5 6 ...
> >
> > library(caret)
> > library(e1071)
> >
> > set.seed(100)
> >
> > tr.control <- trainControl(method="cv", number=10)
> > cp.grid <- expand.grid(.cp = (0:10)*0.001)
> >
> > #Fitting the model using regression tree
> > tr_m <- train(project_delay ~ project_lon + project_lat +
> project_duration + sector + contract_type + capital_value, data = trainPFI,
> method="rpart", trControl=tr.control, tuneGrid = cp.grid)
> >
> > tr_m
> >
> > CART
> > 491 samples
> > 15 predictor
> > No pre-processing
> > Resampling: Cross-Validated (10 fold)
> > Summary of sample sizes: 443, 442, 441, 442, 441, 442, ...
> > Resampling results across tuning parameters:
> > cp RMSE Rsquared
> > 0.000 441.1524 0.5417064
> > 0.001 439.6319 0.5451104
> > 0.002 437.4039 0.5487203
> > 0.003 432.3675 0.5566661
> > 0.004 434.2138 0.5519964
> > 0.005 431.6635 0.5577771
> > 0.006 436.6163 0.5474135
> > 0.007 440.5473 0.5407240
> > 0.008 441.0876 0.5399614
> > 0.009 441.5715 0.5401718
> > 0.010 441.1401 0.5407121
> > RMSE was used to select the optimal model using the smallest value.
> > The final value used for the model was cp = 0.005.
> >
> > #Fetching the best tree
> > best_tree <- tr_m$finalModel
> >
> > Alright, all the aforementioned commands worked fine.
> >
> > Except the subsequent command raises error, when the developed model is
> used to make predictions:
> > best_tree_pred <- predict(best_tree, newdata = testPFI)
> > Error in eval(expr, envir, enclos) : object 'sectorHospitals' not found
> >
> > Can someone guide me what to do to resolve this issue.
> >
> > Any help will be highly appreciated.
> >
> > Many Thanks and
> >
> > Kind Regards
> >
> > --
> > Muhammad Bilal
> > Research Fellow and Doctoral Researcher,
> > Bristol Enterprise, Research, and Innovation Centre (BERIC),
> > University of the West of England (UWE),
> > Frenchay Campus,
> > Bristol,
> > BS16 1QY
> >
> > muhammad2.bilal at live.uwe.ac.uk<mailto:muhammad2.bilal at live.uwe.ac.uk
> ><mailto:olugbenga2.akinade at live.uwe.ac.uk<mailto:
> olugbenga2.akinade at live.uwe.ac.uk>>
> >
> >
> > [[alternative HTML version deleted]]
> >
> > ______________________________________________
> > R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
> UNSUBSCRIBE and more, see
> > https://stat.ethz.ch/mailman/listinfo/r-help
> > PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> > and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org<mailto:R-help at r-project.org> mailing list -- To
> UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
> ______________________________________________
> R-help at r-project.org mailing list -- To UNSUBSCRIBE and more, see
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide
> http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
>
[[alternative HTML version deleted]]
More information about the R-help
mailing list