[R] predictive modeling and extremely large data

Wed Sep 7 17:19:13 CEST 2011

Hi,

On Wed, Sep 7, 2011 at 5:25 AM, Divyam <divyamurali13 at gmail.com> wrote:
> Hi,
>
> I am new to R and here is what I am doing in it now. I am using machine
> learning technique (svm) to do predictive modeling. The data that I am using
> is one that is bound to grow perpetually. what I want to know is, say, I fed
> in a data set with 5000 data points to svm initially. The algorithm derives
> a certain intelligence (i.e.,output)  based on these 5000 data points. I
> have an additional 10000 data points today. Now if i remove the first fed
> 5000 data and then feed in this new additional 10000 data, I want the
> algorithm to make use of the intelligence derived from the initial data(5000
> data points) too while evaluating the new delta data points(10000) and the
> end result to be an aggregated measure of the total 15000 data. This is
> important to me from an efficiency point of view. If there are any other
> packages in r that does the same (i.e., enable statistical models to learn
> from the past experience continuously while deleting the prior data used
> from which the intelligence is derived) kindly post about them. This will be
> of immense help to me.

I'm not sure that I understand what you mean ... maybe because some of
the terminology you are using is a bit nonstandard.

If you want the predictive model you build to be "immediately
effective" and learn from new data later, you can:

(1) Train an SVM on the data you have now (ie. do it "offline"). Use
this for future/new data. At some point in the future, retrain your
SVM on all of the data you have available to you (or some subset of
it) -- again, offline. You can see if your new SVM outperforms your
old one on your new data to see where your point of diminishing
returns is: when it stops making sense to try to learn a new model
after you have x many data points already.

(2) You can look into "online learning" methods -- search google for
online svms and other online methods that might interest you (if
you're not married to the SVM). For what it's worth, you mention
"extremely large data," but not sure what you mean (certainly 10k
datapoints isn't that). If you *really* mean "big data," and you want
to explore online learning, take a look at vowpal wabbit:

http://hunch.net/~vw/code.html
https://github.com/JohnLangford/vowpal_wabbit

That's not R though.

The recent 1.0 release of the shogun-toolbox includes support for
online learning, too (with vw I believe):
http://www.shogun-toolbox.org/

It has an R interface of different flavors, but might be a bit painful
to use through it (I'm working on making a better one on my spare
time, but not too much of that lately). If the features in shogun
strike your fancy, from what I understand the best supported way to
use it is through its "python_modular" interface.

Hope that helps,
-steve

-- 
Steve Lianoglou
Graduate Student: Computational Systems Biology
 | Memorial Sloan-Kettering Cancer Center
 | Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact