[BioC] classification method applied to microarrays (CMA package)

Tue Oct 27 12:59:24 CET 2009

The svm is a reasonable classifier that performs OK on microarray data  
and usually requires no tuning of parameters (usually)-- although many  
others do too.

In order to understand the GeneSelection method you need to understand  
cross validation (this occurs within the classification function). The  
cross validation is estimating the classification error by splitting  
the data into many training and test set combinations. The model --  
your svm is built on the training set-- and then tested against the  
test set to see how many errors of classification are made.

If you choose GeneSelection (which you probably should) then the data  
is reduced to a subset of features/genes based on a simple stat.  
However not only one set of genes will be selected-- but genes for  
every training set in the cross validation. Otherwise the likely svm  
misclassification error would be an overestimate.

So when you use the toplist function on your GeneSelection object you  
will find that there are a number of feature lists none exactly the  
same. The 'informative' genes are those that occur most frequently in  
the toplists. You can examine the GeneSelection toplist before you run  
the classification function-- but obviously you will want to run the  
classification function to check that the features are indeed  
'informative'.

You can use the GeneSelection method that gives the least cross- 
validation error. I'd start with limma but if there is a reasonable  
separation of classes then they should work similarly.

jeez I hope that is clear....

Stephen Henderson
UCL

On 27 Oct 2009, at 11:21, Juan Carlos Oliveros Collazos wrote:

> Dear all,
>
> I am starting using the CMA package for classification of microarray  
> samples.
>
> In particular, I want to know which genes are the main responsible  
> for separating about 60 lists of expression values into 2 groups  
> that are already known. I understand that SVM is a good method to  
> find the hyperplane that best separate the two groups but what I  
> need are the genes, not the hyperplane parameters.
>
> My questions are:
>
> To get a list of genes, should I use in some manner SVMs (or another  
> classification method) or what I need is simply to identify the  
> "informative" genes by using GeneSelection function of CMA package?
>
> If so, the learning sets are needed? why?
>
> Any recomendation for choosing a gene selection method?
>
> Thanks in advance.
>
> best,
>
> Juan Carlos Oliveros
> CNB-CSIC, Madrid, Spain
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor