[R] Nominal variables in SVM?

Erik Iverson eiverson at NMDP.ORG
Wed Aug 12 23:33:00 CEST 2009


This is where a small, reproducible example will definitely help us discover your problem. 

-----Original Message-----
From: r-help-bounces at r-project.org [mailto:r-help-bounces at r-project.org] On Behalf Of Noah Silverman
Sent: Wednesday, August 12, 2009 4:29 PM
To: Achim Zeileis
Cc: r help
Subject: Re: [R] Nominal variables in SVM?

Thanks for all the suggestions.

My data was loaded in from a csv file with about 80 columns (3 of these 
columns are nominal)  no specific settings for the nominal columns.

Currently, if I call svm (e1071), I get an error about the nominal column.

Do I need to tell R to change the column to a factor?  i.e. foo$color <- 
factor(foo$color)


On 8/12/09 2:21 PM, Achim Zeileis wrote:
> On Wed, 12 Aug 2009, Noah Silverman wrote:
>
>> Hi,
>>
>> The answers to my previous question about nominal variables has lead 
>> me to a more important question.
>>
>> What is the "best practice" way to feed nominal variable to an SVM.
>
> As some of the previous posters have already indicated: The data 
> structure for storing categorical (including nominal) variables in R 
> is a "factor".
>
> Your comment about "truly nominal" is wrong. A character variable is a 
> character variable, not necessarily a categorical variable. 
> Categorical means that the answer falls into one of a finite number of 
> known categories, known as "levels" in R's "factor" class.
>
> If you start out from character information:
>
>   x <- c("red", "red", "blue", "green", "blue")
>
> You can turn it into a factor via:
>
>   x <- factor(x, levels = c("red", "green", "blue"))
>
> R now knows how to do certain things with such a variable, e.g., 
> produces useful summaries or knows how to deal with it in regression 
> problems:
>
>   model.matrix(~ x)
>
> which seems to be what you asked for. Moreover, you don't need call 
> this yourself but most regression functions in R will do that for you 
> (including svm() in "e1071" or ksvm() in "kernlab", among others).
>
> In short: Keep your categorical variables as "factor" columns in a 
> "data.frame" and use the formula interface of svm()/ksvm() and you are 
> fine.
> Z
>
>
>> For example:
>> color = ("red, "blue", "green")
>>
>> I could translate that into an index so I wind up with
>> color= (1,2,3)
>>
>> But my concern is that the SVM will now think that the values are 
>> numeric in "range" and not discrete conditions.
>>
>> Another thought would be to create 3 binary variables from the single 
>> color variable, so I have:
>>
>> red = (0,1)
>> blue = (0,1)
>> green = (0,1)
>>
>> A example fed to the SVM would have one positive and two negative 
>> values to indicate the color value:
>> i.e. for a blue example:
>> red = 0, blue =1 , green = 0
>>
>> Or, do any of the SVM packages intelligently handle this internally 
>> so that I don't have to mess with it.  If so, do I need to be 
>> concerned about different "translation" of the data if the test data 
>> set isn't exactly the same as the training set.
>> For example:
>> training data  =  color ("red, "blue", "green")
>> test data = color ("red, "green")
>>
>> How would I be sure that the "red" and "green" examples get encoded 
>> the same so that the SVM is accurate?
>>
>> Thanks in advance!!
>>
>> -N
>>
>> ______________________________________________
>> R-help at r-project.org mailing list
>> https://stat.ethz.ch/mailman/listinfo/r-help
>> PLEASE do read the posting guide 
>> http://www.R-project.org/posting-guide.html
>> and provide commented, minimal, self-contained, reproducible code.
>>
>>

	[[alternative HTML version deleted]]

______________________________________________
R-help at r-project.org mailing list
https://stat.ethz.ch/mailman/listinfo/r-help
PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
and provide commented, minimal, self-contained, reproducible code.




More information about the R-help mailing list