[R] SVM. How to use categorical attributes?

Alekseiy Beloshitskiy abeloshitskiy at velti.com
Wed Mar 28 11:21:05 CEST 2012


Thank you, Steve,

I was thinking about smth like this. Just not sure about the efficiency of using several thousands of additional variables. And the second problem will be time-consumption for managing all these data in memory.

Here I posted more brief description:
http://stats.stackexchange.com/questions/25355/multi-value-categorical-attributes-how-r

Thank you,
-Alex

________________________________________
From: Steve Lianoglou [mailinglist.honeypot at gmail.com]
Sent: 27 March 2012 21:47
To: Alekseiy Beloshitskiy
Cc: r-help at r-project.org
Subject: Re: [R] SVM. How to use categorical attributes?

Hi,

On Tue, Mar 27, 2012 at 6:05 AM, Alekseiy Beloshitskiy
<abeloshitskiy at velti.com> wrote:
> Hi All,
>
> Here is the case. I want to build classification model (SVM). Some of variables for this model are categorical attributes which represent words  (usually 3-10 words - query for search in google). For example:
> search_id | query_words                        |..| result
> -----------+----------------------------------+--+--------
> 1            | how,to,grow,tree                  |..| 4
> 2            | smartfone,htc,buy,price         |..| 7
> 3            | buy,house,realty,london         |..| 6
> 4            | where,to,go,weekend,cinema |..| 4
> ...
> As you can see, words in the query are disordered and may occur in different queries. Total number of unique words for all queries is several thousands.
> The question is how to represent this variable (query_words) to use for SVM.
>
> Thank you for any advices!

One approach is to wire up a "bag of words" type of design matrix.

That is to say the matrix has as many columns as there are unique
words. Each row is an observation (query), and the words that appear
in the query have a value of 1 (or you can count the number of times
each word appears).

You can maybe get smarter and try to group like words together, but
... now you'll have two problems ...

Hope you have lots of data!

-steve

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact



More information about the R-help mailing list