[R] Training nnet in two ways, trying to understand the performance difference - with (i hope!) commented, minimal, self-contained, reproducible code

Wed Feb 18 12:40:06 CET 2009

Dear all,

Objective: I am trying to learn about neural networks. I want to see
if i can train an artificial neural network model to discriminate
between spam and nonspam emails.

Problem: I created my own model (example 1 below) and got an error of
about 7.7%. I created the same model using the Rattle package (example
2 below, based on rattles log script) and got a much better error of
about 0.073%.

Question 1: I don't understand why the rattle script gives a better
result? I must therefore be doing something wrong in my own script
(example 1) and would appreciate some insight  :-)

Question 2: As rattle gives a much better result, i would be happy to
use it's r-code instead of my own. How can I interpret it's
predictions as either being either 'spam' or 'nonspam'? I have looked
at the type='class' parameter in ?predict.nnet but it doesn't apply to
this situation i believe.

Below i give commented, minimal, self-contained and reproducible code.
(if you ignore the output, it really is very few lines of code and
therefore minimal i believe?)

## load library
>library(nnet)

## Load in spam dataset from package kernlab
>data(list = "spam", package = "kernlab")
>set.seed(42)
>my.sample <- sample(nrow(spam), 3221)
>spam.train <- spam[my.sample, ]
>spam.test <- spam[-my.sample, ]

## Example 1 - my own code
# train artificial neural network (nn1)
>( nn1 <- nnet(type~., data=spam.train, size=3, decay=0.1, maxit=1000) )
# predict spam.test dataset on nn1
> ( nn1.pr.test <- predict(nn1, spam.test, type='class') )
   [1] "spam"    "spam"    "spam"    "spam"    "nonspam" "spam"
"spam"
   [etc...]
# error matrix
>(nn1.test.tab<-table(spam.test$type, nn1.pr.test, dnn=c('Actual', 'Predicted')))
           Predicted
  Actual    nonspam spam
    nonspam     778   43
    spam           63    496
# Calucate overall error percentage ~ 7.68%
>(nn1.test.perf <- 100 * (nn1.test.tab[2] + nn1.test.tab[3]) / sum(nn1.test.tab))
[1] 7.68116

## Example 2 - code based on rattles log script
# train artifical neural network
>nn2<-nnet(as.numeric(type)-1~., data=spam.train, size=3, decay=0.1, maxit=1000)
# predict spam.test dataset on nn2.
# ?predict.nnet does have the parameter type='class', but i can't use
that here as an option
>nn2.pr.test <- predict(nn2, spam.test)
               [,1]
3    0.984972396013
4    0.931149225918
10   0.930001139978
13   0.923271300707
21   0.102282256315
[etc...]
# error matrix
>( nn2.test.tab <- round(100*table(nn2.pr.test, spam.test$type,
                            dnn=c("Predicted", "Actual"))/length
(nn2.pr.test)) )
                                   Actual
  Predicted                    nonspam spam
    -0.741896935969825              0    0
    -0.706473834678304              0    0
    -0.595327594045746              0    0
  [etc...]
# calucate overall error percentage. Am not sure how this line works
tbh,
# and i think it should be multiplied by 100. I got this from rattle's
log script.
>(function(x){return((x[1,2]+x[2,1])/sum(x))})
            (table(nn2.pr.test, spam.test$type,  dnn=c("Predicted",
"Actual")))
[1] 0.0007246377
# i'm guessing the above should be ~0.072%

I know the above probably seems complicated, but any help that can be
offered would be much appreicated.

Thank you kindly in advance,
Tony

OS = Windows Vista Ultimate, running R in admin mode
> sessionInfo()
R version 2.8.1 (2008-12-22)
i386-pc-mingw32

locale:
LC_COLLATE=English_United Kingdom.1252;LC_CTYPE=English_United Kingdom.
1252;LC_MONETARY=English_United Kingdom.
1252;LC_NUMERIC=C;LC_TIME=English_United Kingdom.1252

attached base packages:
[1] grid      stats     graphics  grDevices utils     datasets
methods   base

other attached packages:
[1] RGtk2_2.12.8     vcd_1.2-2        colorspace_1.0-0
MASS_7.2-45      rattle_2.4.8     nnet_7.2-45

loaded via a namespace (and not attached):
[1] tools_2.8.1