[R] svm of e1071 package

Wed Apr 7 01:16:33 CEST 2010

Dear Steve,

Thanks again for your help and reply. your help was very useful and gave us some options. We will follow your suggestions and let you know about it.

Regards,
shyama
________________________________________
From: Steve Lianoglou [mailinglist.honeypot at gmail.com]
Sent: Tuesday, April 06, 2010 9:40 PM
To: Shyamasree Saha [shs]
Cc: r help
Subject: Re: [R] svm of e1071 package

Hi Shyama,

Don't forget to CC the r-help list in your discussions so that there
are more eyes on this problem, and others might potentially benefit
from discussion.

Comments in line.

On Tue, Apr 6, 2010 at 4:06 PM, Shyamasree Saha [shs] <shs at aber.ac.uk> wrote:
> Dear Steve,
>
> Thanks a lot for your reply. As you have suggested kernlab and SparseM packages, we have now installed it and reading about these packages. I am trying to answer your questions. I have also added a bit of code. Please let me know whether you need to know more and what is your suggestions.
>
> Thanks again for your help.
>
> Regards,
> Shyamasree
>
> R> .Machine$sizeof.pointer ## it should be 8
> Yes, it is indeed 8.

OK

>> * What type of kernel are you using? Have you tried different ones?
> Just tried the linear kernel, haven't tried with other kernels.

OK, let's stick with that for now.

>> * Are you doing classification or regression?
> We are doing multi-class classification. There are 11 classes.

Is it any better if you just do 1-vs-all?
Also (from your code at the end of the email) what if you train the
model with `probability=FALSE`?

>> * Is your data/feature matrix sparse? If so, are you passing libsvm a
>> SparseM matrix?
> Yes, the feature matrix is indeed very sparse. Just passing a matrix
> at the moment.
> Not sure how to define it as SparseM matrix.

R> library(SparseM)
R> ?as.matrix.csr

>> * Have you tried playing with some of the params in the svm call, like
>> the values for tolerance, epsilon, cost/nu/etc.
> No, have not played with these at all. What do you recommend?

Try to increase (I think (maybe decrease??)) the tolerance from its
default value. Moving this in one direction or the other allows the
solver to converge to a less-precise solution -- haven't read the
source in a while, though, so test it.

>> * Try an even smaller subset of your data (< 1.4 GB)
> It works fine with a much smaller subset but have not tried with
> intermediate sizes.

OK

Can you give an idea of how long it takes for your call to `svm` to
return with different data sizes?
How does its memory stats look like?

>> * What is the dimensionality of your X matrix -- how many examples,
>> and how many features does each example have
> X matrix dimensionality: 35,500 rows x 52,058 cols . All features are
> binary.

I think that's quite large.

This might be a good reason to try liblinear as it is more appropriate
for large feature spaces and is made by the same libsvm folks:

http://www.csie.ntu.edu.tw/~cjlin/liblinear/
http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf

>> * Include sessionInfo() -- we don't know what version of R/e1071 etc.
> R version 2.10.1 (2009-12-14)
> e1071      "1.5-23"
> running on
> Linux version 2.6.32-303-ec2 (buildd at crested) (gcc version 4.4.3
> (Ubuntu 4.4.3-3ubuntu1) ) #7-Ubuntu SMP Wed Mar 10 11:23:24 UTC 2010
> on an m2.2xlarge amazon instance with
> 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2
> Compute Units each)
>
>> * There is a kernlab package that also implements the svm, try that.
> Thanks. Does kernlab implement libsvm as well? What is the difference
> between the two packages?

libsvm is at the core of kernlab as well, but it's used a bit differently

>> * You can also try to precompute a kernel matrix and send that into
>> kernlab's ksvm function, maybe that helps?
> Any staring tips for this?

R> library(kernlab)
R> ?kernelMatrix

>> Don't know, lots of things ... and you didn't provide any code, so
>> it's hard to figure out what's up.
>>
>> If your problem is really too huge, there are other svm
>> implementations you might consider looking into, such as Pegasos SVM,
>> liblienar, svm^perf, etc., depending on the problem you're trying to
>> solve.
> Which of these do you recommend for the problem at hand and the size
> of the matrix

As mentioned above, you can try liblinear. There is no R wrapper, so
you can either write out the input files and run liblinear/train from
the command line, or you can try one of the wrappers from another
language (maybe you're familiar with Python?)

I reckon it wouldn't hurt for someone to make an R wrapper for
liblinear, though ...

>
>
>
> code:::
>
> svm_learn <- function(pClass){
> sink(logfile,append=T)
> print("In svm_learn function")
> sink(NULL)
> multi.svm<-svm(x=as.matrix(ycln[idxtrn, ]), y=as.factor(pClass)[idxtrn], kernel='linear', probability=T)
>
> summary(multi.svm)
>
> # do prediction
> svmpredtrn<-predict(multi.svm,newdata=as.matrix(ycln[idxtrn, ]), decision.values=T)
> svmpredtst<-predict(multi.svm,newdata=as.matrix(ycln[idxtst, ]), decision.values=T)
>
> # Check accuracy for training data:
>
>
> # Check accuracy for testing data:
>
> print("Finished svm_learn function")
> list(tabtrn=table(pClass[idxtrn],svmpredtrn), tabtst=table(pClass[idxtst],svmpredtst))
> }
>
>
>>
>> -steve
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>> | Memorial Sloan-Kettering Cancer Center
>> | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact