[R] classification for huge datasets: SVM yields memory troubles

Wed Dec 15 00:59:07 CET 2004

While it is true that the large number of variables relative to
the number of observations restricts what can be inferred,
the situation is not as hopeless as Bert seems to suggest.
If it were, attempts at the analysis of expression array data
would be a waste to time.  Methods developed to that
general area may well be relevant to other data where the
number of variables is similarly far larger than the number
of observations.

See Ambroise, C. and Mclachlan, G.J. 2002.  Selection bias
in gene extraction on the basis of microarray gene-expression
data.  PNAS 99: 6562--6566.

This discusses some of the literature on the use of SVMs.

The selection bias that these authors discuss also affects
plots, even principal components and other ordination-base
plots where features have been selected on the basis of their
ability to separate into known groups.  I have draft versions
of code that addresses this selection bias as it affects the
plotting of graphs, which (along a paper that has been
submitted for inclusion in a conference proceedings) I am
happy to make available to anyone who wants to experiment.

Another good place to look, as a starting point, may be
Gordon Smyth's LIMMA User's Guide.  This can be a bit
hard to find. With limma installed, type help.start().
After some time a browser window should open. Click on
Packages | limma | Overview | LIMMA User's Guide (pdf)

John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.

On 14 Dec 2004, at 10:09 PM, r-help-request at stat.math.ethz.ch wrote:

> From: Berton Gunter <gunter.berton at gene.com>
> Date: 14 December 2004 9:23:08 AM
> To: "'Andreas'" <wolf.privat at gmx.de>, <r-help at stat.math.ethz.ch>
> Cc: Subject: RE: [R] classification for huge datasets: SVM yields 
> memory troubles
>
>
> " I have a matrix with 30 observations and roughly 30000
> variables, ... <snipped>"
>
> Comment: This is ** not ** a "huge" data set -- it is a tiny one with a
> large number of covariates. The difference is: If it were truly huge, 
> SVM
> and/or LDA or ... might actually be able to produce useful results. 
> With so
> few data and so many variables, it is hard to see how any approach 
> that one
> uses is not simply a fancy random number generator.
>
John Maindonald             email: john.maindonald at anu.edu.au
phone : +61 2 (6125)3473    fax  : +61 2(6125)5549
Centre for Bioinformation Science, Room 1194,
John Dedman Mathematical Sciences Building (Building 27)
Australian National University, Canberra ACT 0200.