[R] probabilities from predict.svm
Watling,James I
watlingj at ufl.edu
Thu Aug 19 18:31:22 CEST 2010
Hi Steve--
Thanks for your suggestions--I'll give this a shot, but I'm not sure if the issue is the test/train split. A few more details: I used tune.svm to come up with the values for cost and gamma. I spent quite a bit of time playing around with different combinations of parameters and gamma=1 and cost = 10000 was the best. This was done using the entire dataset. The reason I don't think the problem lies in the test/train split is that I have written code for a randomization procedure to randomly select training and testing subsets to come up with the original model--I get the "good" AUC values I mentioned consistently across alternative partitions of the full dataset into training and validation subsets.
It really feels like it is something about the probabilities themselves. Maybe a new attachment will help shed some light on the situation--these are the ASCII files being read directly into the GIS for visualization of the map. I made screen shots of the part of the file delimiting Florida & the southeast USA; the -9999 values are NA values defining the ocean, and the probabilities define the land surface. You can see the outline of Florida in both maps, so I know the probabilities are falling in the right place on the map. But in the first map the probabilities are all over the place; I have highlighted some cell values in North Carolina with a much higher probability than values in south Florida where the crocodile actually occurs. The second map is the same thing, but with probabilities taken from the ASCII image using openmodeller; there the probabilities increase as you head south through the Florida peninsula, and there is strong spatial autocorrelation in the probabilities (as would be expected given the underlying climate predictors--the probabilities in the first image are all over the place spatially, which also does not make sense).
Since I can't seem to figure out what's going on, I will try some alternative approaches to determining cost and gamma values.
Thanks again
James
-----Original Message-----
From: Steve Lianoglou [mailto:mailinglist.honeypot at gmail.com]
Sent: Thursday, August 19, 2010 11:39 AM
To: Watling,James I
Cc: r-help at lists.R-project.org
Subject: Re: [R] probabilities from predict.svm
On Thu, Aug 19, 2010 at 10:56 AM, Watling,James I <watlingj at ufl.edu> wrote:
> Hi Steve--
>
> Thanks for your interest in helping me figure this out. I think the problem has to do with the values of the probabilities returned from the use of the model to predict occurrence in a new dataframe.
Ok, so if you're sure this is the problem, and not, say, getting the
correct values for the predictor variables at a given point, then I'd
be a bit more thorough when building your model.
Originally you said:
> I have used a training dataset to train the model, and tested it against a validation data set with good results: AUC is high, and the confusion matrix indicates low commission and omission errors.
Maybe your originally "good" AUC's was just a function of your train/test split?
Why not use all of your data and do something like 10 fold cross
validation to find:
(1) Your average accuracy over your folds
(2) The best value for your cost parameter; (how did you pick cost=10000)?
(3) or even the best kernel to use.
Doing 2 and 3 will likely be time consuming. To help with (2) you
might try looking at the svmpath package:
http://cran.r-project.org/web/packages/svmpath/index.html
It only works on 2-class classification problems, and (I think) using
a linear kernel (sorry, don't remember off hand, but it's written in
the package help and linked pubs).
You don't need to use svmpath, but then you'll need to define a "grid"
of C values (or maybe a 2d grid, if your svm + kernel combo has more
params) and train over these values ... takes lots of cpu time, but
not too much human time.
Does that make sense?
--
Steve Lianoglou
Graduate Student: Computational Systems Biology
| Memorial Sloan-Kettering Cancer Center
| Weill Medical College of Cornell University
Contact Info: http://cbio.mskcc.org/~lianos/contact
-------------- next part --------------
A non-text attachment was scrubbed...
Name: SVM ASCII.pdf
Type: application/pdf
Size: 335101 bytes
Desc: SVM ASCII.pdf
URL: <https://stat.ethz.ch/pipermail/r-help/attachments/20100819/407c7e2c/attachment.pdf>
More information about the R-help
mailing list