[R] svm of e1071 package

Wed Apr 14 16:38:33 CEST 2010

Hello Steve, 
Thanks for your reply. yes, i converted input matrix to a sparse matrix (via SparseMatrix). rbind is fine for our case as anyway we have to do it. So, instead of using rbind on dense matrix and convert the whole matrix at the end, we take convert each chunk and add it to the big one. No parameter tweaking was necessary. But as you have mentioned, I will like to try read.matrix.csr and write.matrix.csr . this can save lot of our time and resources. 

Thanks again for your help.

Regards
Shyama
________________________________________
From: Steve Lianoglou [mailinglist.honeypot at gmail.com]
Sent: Wednesday, April 14, 2010 2:49 PM
To: Shyamasree Saha [shs]
Cc: r help
Subject: Re: [R] svm of e1071 package

Hi Shyama,

On Tue, Apr 13, 2010 at 10:40 AM, Shyamasree Saha [shs] <shs at aber.ac.uk> wrote:
> Dear Steve,
>
> We have finally managed to run our code. Sparse matrix is helping a lot (I should say without matrix.csr, we would not be able to do it). This time it is taking very small amount of memory while running svm, but we could not use as.matrix.csr directly on our huge data. we had to divide the data in small chunk. we created matrix.csr from those small chunk, removed our original object, loaded next chunk and used rbind to put all of them together. we need to be very careful how much data we load at a time. Thanks again for you kind help.

This is great news :-)

So, just to recap -- is the only thing that you did to get this to
work is to convert your input matrix to a sparse matrix (via SparseM)?
No parameter tweaking necessary?

Would an alternative approach to creating the sparse matrix be more
helpful? You can, for instance, create the entire sparse matrix in one
shot, like:

R> m <- as.matrix.csr(0, nrow=100000, ncol=100000)

(with appropriate numbers for your nrow,ncol params)

Then you can load subsets of your data into it and skipping the
chunked-`rbing` strategy .. is that easier?

Also, the e1071 package has read.matrix.csr and write.matrix.csr
functinos that might help facilitate loading/saving your data matrices
in the future.

-steve

>
> Regards
> Shyama
> ________________________________________
> From: Steve Lianoglou [mailinglist.honeypot at gmail.com]
> Sent: Wednesday, April 07, 2010 4:00 PM
> To: Shyamasree Saha [shs]
> Cc: r help
> Subject: Re: [R] svm of e1071 package
>
> Hi,
>
> On Tue, Apr 6, 2010 at 7:16 PM, Shyamasree Saha [shs] <shs at aber.ac.uk> wrote:
>> Dear Steve,
>>
>> Thanks again for your help and reply. your help was very useful and gave us some options. We will follow your suggestions and let you know about it.
>
> No problem ... yeah, please write back when you figure out what's up
> (or hit more roadblocks), I'd be curious to know what the solution is.
>
> Thanks,
> -steve
>
>>
>> Regards,
>> shyama
>> ________________________________________
>> From: Steve Lianoglou [mailinglist.honeypot at gmail.com]
>> Sent: Tuesday, April 06, 2010 9:40 PM
>> To: Shyamasree Saha [shs]
>> Cc: r help
>> Subject: Re: [R] svm of e1071 package
>>
>> Hi Shyama,
>>
>> Don't forget to CC the r-help list in your discussions so that there
>> are more eyes on this problem, and others might potentially benefit
>> from discussion.
>>
>> Comments in line.
>>
>> On Tue, Apr 6, 2010 at 4:06 PM, Shyamasree Saha [shs] <shs at aber.ac.uk> wrote:
>>> Dear Steve,
>>>
>>> Thanks a lot for your reply. As you have suggested kernlab and SparseM packages, we have now installed it and reading about these packages. I am trying to answer your questions. I have also added a bit of code. Please let me know whether you need to know more and what is your suggestions.
>>>
>>> Thanks again for your help.
>>>
>>> Regards,
>>> Shyamasree
>>>
>>> R> .Machine$sizeof.pointer ## it should be 8
>>> Yes, it is indeed 8.
>>
>> OK
>>
>>>> * What type of kernel are you using? Have you tried different ones?
>>> Just tried the linear kernel, haven't tried with other kernels.
>>
>> OK, let's stick with that for now.
>>
>>>> * Are you doing classification or regression?
>>> We are doing multi-class classification. There are 11 classes.
>>
>> Is it any better if you just do 1-vs-all?
>> Also (from your code at the end of the email) what if you train the
>> model with `probability=FALSE`?
>>
>>>> * Is your data/feature matrix sparse? If so, are you passing libsvm a
>>>> SparseM matrix?
>>> Yes, the feature matrix is indeed very sparse. Just passing a matrix
>>> at the moment.
>>> Not sure how to define it as SparseM matrix.
>>
>> R> library(SparseM)
>> R> ?as.matrix.csr
>>
>>>> * Have you tried playing with some of the params in the svm call, like
>>>> the values for tolerance, epsilon, cost/nu/etc.
>>> No, have not played with these at all. What do you recommend?
>>
>> Try to increase (I think (maybe decrease??)) the tolerance from its
>> default value. Moving this in one direction or the other allows the
>> solver to converge to a less-precise solution -- haven't read the
>> source in a while, though, so test it.
>>
>>>> * Try an even smaller subset of your data (< 1.4 GB)
>>> It works fine with a much smaller subset but have not tried with
>>> intermediate sizes.
>>
>> OK
>>
>> Can you give an idea of how long it takes for your call to `svm` to
>> return with different data sizes?
>> How does its memory stats look like?
>>
>>>> * What is the dimensionality of your X matrix -- how many examples,
>>>> and how many features does each example have
>>> X matrix dimensionality: 35,500 rows x 52,058 cols . All features are
>>> binary.
>>
>> I think that's quite large.
>>
>> This might be a good reason to try liblinear as it is more appropriate
>> for large feature spaces and is made by the same libsvm folks:
>>
>> http://www.csie.ntu.edu.tw/~cjlin/liblinear/
>> http://www.csie.ntu.edu.tw/~cjlin/papers/liblinear.pdf
>>
>>>> * Include sessionInfo() -- we don't know what version of R/e1071 etc.
>>> R version 2.10.1 (2009-12-14)
>>> e1071      "1.5-23"
>>> running on
>>> Linux version 2.6.32-303-ec2 (buildd at crested) (gcc version 4.4.3
>>> (Ubuntu 4.4.3-3ubuntu1) ) #7-Ubuntu SMP Wed Mar 10 11:23:24 UTC 2010
>>> on an m2.2xlarge amazon instance with
>>> 34.2 GB of memory, 13 EC2 Compute Units (4 virtual cores with 3.25 EC2
>>> Compute Units each)
>>>
>>>> * There is a kernlab package that also implements the svm, try that.
>>> Thanks. Does kernlab implement libsvm as well? What is the difference
>>> between the two packages?
>>
>> libsvm is at the core of kernlab as well, but it's used a bit differently
>>
>>>> * You can also try to precompute a kernel matrix and send that into
>>>> kernlab's ksvm function, maybe that helps?
>>> Any staring tips for this?
>>
>> R> library(kernlab)
>> R> ?kernelMatrix
>>
>>>> Don't know, lots of things ... and you didn't provide any code, so
>>>> it's hard to figure out what's up.
>>>>
>>>> If your problem is really too huge, there are other svm
>>>> implementations you might consider looking into, such as Pegasos SVM,
>>>> liblienar, svm^perf, etc., depending on the problem you're trying to
>>>> solve.
>>> Which of these do you recommend for the problem at hand and the size
>>> of the matrix
>>
>> As mentioned above, you can try liblinear. There is no R wrapper, so
>> you can either write out the input files and run liblinear/train from
>> the command line, or you can try one of the wrappers from another
>> language (maybe you're familiar with Python?)
>>
>> I reckon it wouldn't hurt for someone to make an R wrapper for
>> liblinear, though ...
>>
>>>
>>>
>>>
>>> code:::
>>>
>>> svm_learn <- function(pClass){
>>> sink(logfile,append=T)
>>> print("In svm_learn function")
>>> sink(NULL)
>>> multi.svm<-svm(x=as.matrix(ycln[idxtrn, ]), y=as.factor(pClass)[idxtrn], kernel='linear', probability=T)
>>>
>>> summary(multi.svm)
>>>
>>> # do prediction
>>> svmpredtrn<-predict(multi.svm,newdata=as.matrix(ycln[idxtrn, ]), decision.values=T)
>>> svmpredtst<-predict(multi.svm,newdata=as.matrix(ycln[idxtst, ]), decision.values=T)
>>>
>>> # Check accuracy for training data:
>>>
>>>
>>> # Check accuracy for testing data:
>>>
>>> print("Finished svm_learn function")
>>> list(tabtrn=table(pClass[idxtrn],svmpredtrn), tabtst=table(pClass[idxtst],svmpredtst))
>>> }
>>>
>>>
>>>>
>>>> -steve
>>>>
>>>> --
>>>> Steve Lianoglou
>>>> Graduate Student: Computational Systems Biology
>>>> | Memorial Sloan-Kettering Cancer Center
>>>> | Weill Medical College of Cornell University
>>>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>>
>>
>>
>>
>> --
>> Steve Lianoglou
>> Graduate Student: Computational Systems Biology
>>  | Memorial Sloan-Kettering Cancer Center
>>  | Weill Medical College of Cornell University
>> Contact Info: http://cbio.mskcc.org/~lianos/contact
>>
>
>
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
>  | Memorial Sloan-Kettering Cancer Center
>  | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
>

--
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact