[R] Decision Tree Issue: Why does tree() not pick all variables for the nodes
nguy2952 University of Minnesota
nguy2952 @ending from umn@edu
Wed Jun 6 04:35:27 CEST 2018
I am working on a project at my work place and I am running into some
issues with my decision tree analysis. THIS IS NOT A HOMEWORK ASSIGNMENT.
Sample dataset
PRODUCT_SUB_LINE_DESCR MAJOR_CATEGORY_DESCR CUST_REGION_DESCR
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH EAST REGION
SUNDRY PREVENTIVE SOUTH CENTRAL REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY PREVENTIVE SOUTH EAST REGION
SUNDRY SMALL EQUIP NORTH CENTRAL REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY SMALL EQUIP MOUNTAIN WEST REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE NORTH CENTRAL REGION
SUNDRY COMPOSITE OHIO VALLEY REGION
SUNDRY COMPOSITE NORTH EAST REGION
Sales QtySold MFGCOST MarginDollars new_ProductName
209.97 3 134.55 72.72 no
-76.15 -1 -44.85 -30.4 no
275.6 2 162.5 109.84 no
138.7 1 81.25 55.82 no
226 2 136 87.28 no
115 1 68 45.64 no
210.7 2 136 71.98 no
29 1 18.85 9.77 no
29 1 18.85 9.77 no
46.32 2 37.7 7.86 no
159.86 1 132.4 24.81 no
441.3 2 264.8 171.2 no
209.62 1 132.4 74.57 no
209.62 1 132.4 74.57 no
1) My tree has only two nodes and here is why
>summary(tree_model)
Classification tree:
tree(formula = new_ProductName ~ ., data = training_data)
Variables actually used in tree construction:
[1] "PRODUCT_SUB_LINE_DESCR"
Number of terminal nodes: 2
Residual mean deviance: 0 = 0 / 41140
Misclassification error rate: 0 = 0 / 41146
2) I did create a new data frame which has only factors with level less
than 22 level. There is one factor with 25 levels, but the tree() does not
give an error so I think the algorithm accepts 25 levels
>str(new_Dataset)
'data.frame': 51433 obs. of 7 variables:
$ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE
LABEL",..: 3 3 3 3 3 3 3 3 3 3 ...
$ MAJOR_CATEGORY_DESCR : Factor w/ 25 levels "AIR ABRASION",..: 23 23
23
23 21 21 21 23 23 23 ...
$ CUST_REGION_DESCR : Factor w/ 7 levels "MOUNTAIN WEST
REGION",..: 3
6 6 3 5 6 6 2 1 1 ...
$ Sales : num 210 -76.2 275.6 138.7 226 ...
$ QtySold : int 3 -1 2 1 2 1 2 1 1 2 ...
$ MFGCOST : num 134.6 -44.9 162.5 81.2 136 ...
$ MarginDollars : num 72.7 -30.4 109.8 55.8 87.3 ...
3) Here is how I set up my analysis
# I choose product name as my main attribute(maybe that is why it
appears at
the root node?)
new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE
LABEL","yes","no")
data = data.frame(new_Dataset, new_ProductName)
set.seed(100)
train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices
training_data = data[train,] # training data
testing_data = data[-train,] # testing data
#fit the tree model using training data
tree_model = tree(new_ProductName ~.,data = training_data)
summary(tree_model)
plot(tree_model)
text(tree_model, pretty = 0)
out = predict(tree_model) # predict the training data
# actuals
input.newproduct = as.character(training_data$new_ProductName)
# predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))]
mean (input.newproduct != pred.newproduct) # misclassification %
# Cross Validation to see how much we need to prune the tree
set.seed(400)
cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross
validation
attach(cv_Tree)
plot(cv_Tree) # plot the CV
plot(size, dev, type = "b")
# set size corresponding to lowest value in the plot above.
treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod)
text(treePruneMod, pretty = 0)
out = predict(treePruneMod) # fit the pruned tree
# Predicted
pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))]
# calculate Mis-classification error
mean(training_data$new_ProductName != pred.newproduct)
# Predict testData with Pruned tree
out = predict(treePruneMod, testing_data, type = "class")
4) I have never done this before. I watched a couple of youtube videos and
started to do this. I welcome great advice, explanation, criticism and
please help me through this process. This has been challenging to me.
> table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName)
no yes
Handpieces 164 0
PRIVATE LABEL 0 14802
SUNDRY 36467 0
Best,
Hugh N
-------------- next part --------------
A non-text attachment was scrubbed...
Name: Rplot01.png
Type: image/png
Size: 13333 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20180605/07f77314/attachment.png>
More information about the R-help
mailing list