# [R] Nominal variables in SVM?

Achim Zeileis Achim.Zeileis at wu-wien.ac.at
Wed Aug 12 23:21:47 CEST 2009

```On Wed, 12 Aug 2009, Noah Silverman wrote:

> Hi,
>
> to a more important question.
>
> What is the "best practice" way to feed nominal variable to an SVM.

As some of the previous posters have already indicated: The data structure
for storing categorical (including nominal) variables in R is a "factor".

Your comment about "truly nominal" is wrong. A character variable is a
character variable, not necessarily a categorical variable. Categorical
means that the answer falls into one of a finite number of known
categories, known as "levels" in R's "factor" class.

If you start out from character information:

x <- c("red", "red", "blue", "green", "blue")

You can turn it into a factor via:

x <- factor(x, levels = c("red", "green", "blue"))

R now knows how to do certain things with such a variable, e.g., produces
useful summaries or knows how to deal with it in regression problems:

model.matrix(~ x)

which seems to be what you asked for. Moreover, you don't need call this
yourself but most regression functions in R will do that for you
(including svm() in "e1071" or ksvm() in "kernlab", among others).

In short: Keep your categorical variables as "factor" columns in a
"data.frame" and use the formula interface of svm()/ksvm() and you are
fine.
Z

> For example:
> color = ("red, "blue", "green")
>
> I could translate that into an index so I wind up with
> color= (1,2,3)
>
> But my concern is that the SVM will now think that the values are numeric in
> "range" and not discrete conditions.
>
> Another thought would be to create 3 binary variables from the single color
> variable, so I have:
>
> red = (0,1)
> blue = (0,1)
> green = (0,1)
>
> A example fed to the SVM would have one positive and two negative values to
> indicate the color value:
> i.e. for a blue example:
> red = 0, blue =1 , green = 0
>
> Or, do any of the SVM packages intelligently handle this internally so that I
> don't have to mess with it.  If so, do I need to be concerned about different
> "translation" of the data if the test data set isn't exactly the same as the
> training set.
> For example:
> training data  =  color ("red, "blue", "green")
> test data = color ("red, "green")
>
> How would I be sure that the "red" and "green" examples get encoded the same
> so that the SVM is accurate?
>
>
> -N
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help