[R] SVM probability output variation
Anders Carlsson
Anders.Carlsson at immun.lth.se
Wed Oct 21 19:05:37 CEST 2009
Hi again, and thank you Steve for your reply!
> Hi Anders,
>
> On Oct 21, 2009, at 8:49 AM, Anders Carlsson wrote:
>
> > Dear R:ers,
> >
> > I'm using the svm from the e1071 package to train a model with the
> > option "probabilities = TRUE". I then use "predict" with
> > "probabilities = TRUE" and get the probabilities for the data point
> > belonging to either class. So far all is well.
> >
> > My question is why I get different results each time I train the
> > model, although I use exactly the same data. The prediction seems to
> > be reproducible, but if I re-train the model, the probabilities vary
> > some what.
> >
> > Here, I have trained a model on exactly the same data five times.
> > When predicting using the different models, this is how the
> > probabilities vary:
>
> I'm not sure I'm following the example your giving and the scenario
> you are describing.
I think you got it!
>
> > probabilities
> > Grp.0 Grp.1
> > 0.7077155 0.2922845
> > 0.7938782 0.2061218
> > 0.8178833 0.1821167
> > 0.7122203 0.2877797
>
> This seems fine to me: it looks like the probabilities of class
> membership for 4 examples (Note that Grp.0 + Grp.1 = 1).
>
Yes, within each run all was OK, but I was surprised that it varied to such a high degree.
>
> > How can the predictions using the same training and test data vary
> > so much?
>
> I'm trying the code below several times (taken from the example), and
> the probabilities calculated from the call to prediction don't change
> much at all:
>
> R> data(iris)
> R> attach(iris)
>
> R> model <- svm(x, y, probability=TRUE)
> R> predict(model, x, probability=TRUE)
>
> To be fair, the probabilities aren't exactly the same, but the
> difference between two runs is really small:
>
> R> model <- svm(x, y, probability=TRUE)
> R> a <- predict(model, x, probability=TRUE)
>
> R> model <- svm(x, y, probability=TRUE)
> R> b <- predict(model, x, probability=TRUE)
>
> R> mean(abs(attr(a, 'probabilities') - attr(b, 'probabilities')))
> [1] 0.003215959
>
> Is this what you were talking about, or ... ?
Yes, exactly that. In your example, though, the variation seems to be a lot smaller. I'm guessing that has to with the data.
If I instead output the decision values, the whole procedure is fully reproducible, i.e. the exact same values are returned when I retrain the model.
I have no idea how the probabilities are calculated, but it seems to be in this step that the differences arise. In my case, I feel a bit hesitant to use them when they differ that much between runs (15% or so)...
If important, I use a linear kernel and don't tune the model in any way.
Thank's again!
/Anders
>
> -steve
>
> --
> Steve Lianoglou
> Graduate Student: Computational Systems Biology
> | Memorial Sloan-Kettering Cancer Center
> | Weill Medical College of Cornell University
> Contact Info: http://cbio.mskcc.org/~lianos/contact
More information about the R-help
mailing list