[R] Creating data frame of predicted and actual values in R for plotting
Muhammad Bilal
Muhammad2.Bilal at live.uwe.ac.uk
Wed May 11 01:45:50 CEST 2016
Hi All,
I have the following dataset:
> str(pfi_v3)
'data.frame': 714 obs. of 8 variables:
$ project_id : int 1 2 3 4 5 6 7 8 9 10 ...
$ project_lat : num 51.4 51.5 52.2 51.5 53.5 ...
$ project_lon : num -0.642 -1.85 0.08 0.126 -1.392 ...
$ sector : Factor w/ 9 levels "Defense","Hospitals",..: 4 4 4 6 6 6 6 6 6 6 ...
$ project_duration : int 1826 3652 121 520 1087 730 730 730 790 522 ...
$ project_delay : int -323 0 -60 0 0 0 0 0 0 -91 ...
$ capital_value : num 6.7 5.8 21.8 47.3 47 24.2 40.7 71.9 10.7 70 ...
$ contract_type : Factor w/ 2 levels "Lumpsum","Turnkey": 2 2 2 2 2 2 2 2 2 2 ...
I'm using following commands to create training and test sets:
split <- sample.split(pfi_v3, SplitRatio = 0.8)
trainPFI <- subset(pfi_v3, split == TRUE)
testPFI <- subset(pfi_v3, split == FALSE)
I am using several predictive models to estimate delay in projects.
The commands are given as below:
1. Simple linear regression
lm_m <- lm(project_delay ~ project_lon +
project_lat +
project_duration +
sector +
contract_type +
capital_value,
data = trainPFI)
lm_pred <- predict(lm_m2, newdata = testPFI)
2. Regression tree
tree_m <- rpart(project_delay ~ project_lon +
project_lat +
project_duration +
sector +
contract_type +
capital_value,
data = trainPFI)
tree_pred <- predict(tree_m2, newdata = testPFI)
3. Cp optimsed regression tree
train_m <- train(project_delay ~ project_lon +
project_lat +
project_duration +
sector +
contract_type +
capital_value,
data = trainPFI,
method="rpart",
trControl=tr.control, tuneGrid = cp.grid)
train_pred <- predict(tr_m, newdata = testPFI)
4. Random Forest
rf_m <- randomForest(project_delay ~ project_lon +
project_lat +
project_duration +
sector +
contract_type +
capital_value,
data = trainPFI,
importance=TRUE,
ntree = 2000)
rf_pred <- predict(rf_m, newdata = testPFI)
5. Conditional Forest
cf_m <- cforest(project_delay ~ project_lon +
project_lat +
project_duration +
sector +
contract_type +
capital_value,
data = trainPFI,
controls=cforest_unbiased(ntree=2000, mtry=3))
cf_pred <- predict(cf_m, testPFI, OOB=TRUE, type = "response")
That is it.
Now I want to create a new data frame to combine the actual and predicted values such that the new frame has the following columns:
$project_id
$actual_delay
$lm_predicted_delay
$tree_predicted_delay
$train_predicted_delay
$rf_predicted_delay
$cf_predicted_delay
I want to use this dataframe to draw the line chart to compare predictions.
How to achieve this?
Any help will be highly appreciated.
Many Thanks and
Kind Regards
--
Muhammad Bilal
Research Fellow and Doctoral Researcher,
Bristol Enterprise, Research, and Innovation Centre (BERIC),
University of the West of England (UWE),
Frenchay Campus,
Bristol,
BS16 1QY
muhammad2.bilal at live.uwe.ac.uk<mailto:olugbenga2.akinade at live.uwe.ac.uk>
[[alternative HTML version deleted]]
More information about the R-help
mailing list