Georg Ruß research at georgruss.de
Fri Jan 7 11:28:46 CET 2011

On 06/01/11 23:10:59, Noah Silverman wrote:
> I have a data set with about 30,000 training cases and 103 variable.
> I've trained an SVM (using the e1071 package) for a binary classifier
> {0,1}.  The accuracy isn't great.  I used a grid search over the C and G
> parameters with an RBF kernel to find the best settings. [...]
> Can anyone suggest an approach to seek the ideal subset of variables for
> my SVM classifier?

The standard feature selection stuff (backward/forward etc.) is probably
ruled out by the time it takes to compute all the sets and subsets. What
you could try is the following:

First, do a cross-validation setup: split up your data set into a training
and testing set (ratio 0.9 / 0.1 or so).

Second, train your SVM on the training set (try conservative parameters

Third, have your trained SVM classify the test set and compute the
classification error.

Fourth, iterate over all variables and do the following:
  a) choose one variable and permute its values (only) in the test set
  b) have your trained SVM (from step 2) classify this test set and 
  measure the classification error
  c) repeat a) and b) a (high) number of times to be significant 
  d) go to next variable

Fifth, you can get an impression of the importance that one variable has
by comparing the errors generated on the permuted test set for each
variable with the non-permuted test set classification error. If the
permutation of one variable drastically increases the classification
error, the variable is probably important.

Sixth: repeat the cross-validation / random sampling a number of times to
be significant.

This is more like an ad-hoc approach and there are some pitfalls, but the
idea is easily explained and can also be carried over to any other
regression model with cross-validation. The computational burden in SVM is
assumed to be the training and not the prediction step and you only need a
relatively low number of training runs (sixth step) here.

Research Assistant
Otto-von-Guericke-Universität Magdeburg
research at georgruss.de

