[R] Decision Tree Issue: Why does tree() not pick all variables for the nodes

Wed Jun 6 04:35:27 CEST 2018

I am working on a project at my work place and I am running into some
issues with my decision tree analysis. THIS IS NOT A HOMEWORK ASSIGNMENT.
Sample dataset

    PRODUCT_SUB_LINE_DESCR   MAJOR_CATEGORY_DESCR      CUST_REGION_DESCR
    SUNDRY                        SMALL EQUIP          NORTH EAST REGION
    SUNDRY                        SMALL EQUIP          SOUTH EAST REGION
    SUNDRY                        SMALL EQUIP          SOUTH EAST REGION
    SUNDRY                        SMALL EQUIP          NORTH EAST REGION
    SUNDRY                        PREVENTIVE          SOUTH CENTRAL REGION
    SUNDRY                        PREVENTIVE          SOUTH EAST REGION
    SUNDRY                        PREVENTIVE          SOUTH EAST REGION
    SUNDRY                        SMALL EQUIP          NORTH CENTRAL REGION
    SUNDRY                        SMALL EQUIP          MOUNTAIN WEST REGION
    SUNDRY                        SMALL EQUIP          MOUNTAIN WEST REGION
    SUNDRY                        COMPOSITE          NORTH CENTRAL REGION
    SUNDRY                        COMPOSITE          NORTH CENTRAL REGION
    SUNDRY                        COMPOSITE          OHIO VALLEY REGION
    SUNDRY                        COMPOSITE          NORTH EAST REGION

    Sales QtySold      MFGCOST MarginDollars new_ProductName
    209.97 3           134.55 72.72          no
    -76.15 -1           -44.85 -30.4          no
    275.6 2           162.5     109.84          no
    138.7 1           81.25     55.82          no
    226     2           136     87.28          no
    115     1           68     45.64          no
    210.7 2           136     71.98          no
    29     1           18.85     9.77          no
    29     1           18.85     9.77          no
    46.32 2           37.7     7.86          no
    159.86 1           132.4     24.81          no
    441.3 2           264.8     171.2          no
    209.62 1           132.4     74.57          no
    209.62 1           132.4     74.57          no

1) My tree has only two nodes and here is why

    >summary(tree_model)
    Classification tree:
    tree(formula = new_ProductName ~ ., data = training_data)
    Variables actually used in tree construction:
    [1] "PRODUCT_SUB_LINE_DESCR"
    Number of terminal nodes:  2
    Residual mean deviance:  0 = 0 / 41140
    Misclassification error rate: 0 = 0 / 41146

2) I did create a new data frame which has only factors with level less
than 22 level. There is one factor with 25 levels, but the tree() does not
give an error so I think the algorithm accepts 25 levels

    >str(new_Dataset)
    'data.frame': 51433 obs. of  7 variables:
     $ PRODUCT_SUB_LINE_DESCR: Factor w/ 3 levels "Handpieces","PRIVATE
                                 LABEL",..: 3 3 3 3 3 3 3 3 3 3 ...
     $ MAJOR_CATEGORY_DESCR  : Factor w/ 25 levels "AIR ABRASION",..: 23 23
23
                                 23 21 21 21 23 23 23 ...
     $ CUST_REGION_DESCR     : Factor w/ 7 levels "MOUNTAIN WEST
REGION",..: 3
                                 6 6 3 5 6 6 2 1 1 ...
     $ Sales                 : num  210 -76.2 275.6 138.7 226 ...
     $ QtySold               : int  3 -1 2 1 2 1 2 1 1 2 ...
     $ MFGCOST               : num  134.6 -44.9 162.5 81.2 136 ...
     $ MarginDollars         : num  72.7 -30.4 109.8 55.8 87.3 ...

3) Here is how I set up my analysis

     # I choose product name as my main attribute(maybe that is why it
appears at
     the root node?)
     new_ProductName = ifelse( PRODUCT_SUB_LINE_DESCR == "PRIVATE
                                  LABEL","yes","no")
     data = data.frame(new_Dataset, new_ProductName)
     set.seed(100)
     train = sample(1:nrow(data), 0.8*nrow(data)) # training row indices
     training_data = data[train,] # training data
     testing_data = data[-train,] # testing data

     #fit the tree model using training data
     tree_model = tree(new_ProductName ~.,data = training_data)
     summary(tree_model)
     plot(tree_model)
     text(tree_model, pretty = 0)
     out = predict(tree_model) # predict the training data
     # actuals
     input.newproduct = as.character(training_data$new_ProductName)
     # predicted
     pred.newproduct = colnames(out)[max.col(out,ties.method = c("first"))]
     mean (input.newproduct != pred.newproduct) # misclassification %

    # Cross Validation to see how much we need to prune the tree
    set.seed(400)
    cv_Tree = cv.tree(tree_model, FUN = prune.misclass) # run cross
validation
    attach(cv_Tree)
    plot(cv_Tree) # plot the CV
    plot(size, dev, type = "b")
    # set size corresponding to lowest value in the plot above.
    treePruneMod = prune.misclass(tree_model, best = 9) plot(treePruneMod)
    text(treePruneMod, pretty = 0)
    out = predict(treePruneMod) # fit the pruned tree
    # Predicted
    pred.newproduct = colnames(out)[max.col(out,ties.method = c("random"))]
    # calculate Mis-classification error
    mean(training_data$new_ProductName != pred.newproduct)
    # Predict testData with Pruned tree
    out = predict(treePruneMod, testing_data, type = "class")

4) I have never done this before. I watched a couple of youtube videos and
started to do this. I welcome great advice, explanation, criticism and
please help me through this process. This has been challenging to me.

    > table(data$PRODUCT_SUB_LINE_DESCR, data$new_ProductName)

                      no      yes
      Handpieces      164     0
      PRIVATE LABEL   0       14802
      SUNDRY          36467    0

Best,
Hugh N

-------------- next part --------------
A non-text attachment was scrubbed...
Name: Rplot01.png
Type: image/png
Size: 13333 bytes
Desc: not available
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20180605/07f77314/attachment.png>