[BioC] Machine learning, cross validation and gene selection

Wed Sep 1 17:48:27 CEST 2010

Traditionally the purpose of cross-validation is to reduce bias in
model appraisal.  The "resubstitution estimate" of classification
accuracy uses the training data to appraise the model derived from the
training data, and is typically biased; this is the subject of a
substantial literature.  Cross-validation introduces a series of
partitions into training and test sets, so that a collection of
appraisals that are independent of the training data are obtained, and
these are summarized.  When the training process involves feature
selection, this should be part of each cross-validation step.  Clearly
this process leads to a collection of chosen features likely
possessing different elements for each step. There is no '"final"
optimal' classifier implied by the procedure, but surveying the
features chosen at each step may provide insight into commonly
selected or informative features.  Random forests has a variable
importance measure derived from a bootstrapping approach similar in
some respects to cross validation; and a varSelRF package or function
was discussed in recent list entries.  MLInterfaces package, and
probably many others such as CMA, provides tools to control and
interpret cross-validation with embedded feature selection.  Be
careful what you wish for -- what exactly do you mean by 'optimal
classifier'?

On Wed, Sep 1, 2010 at 10:55 AM, Daniel Brewer <daniel.brewer at icr.ac.uk> wrote:
> Hello,
>
> I am getting a bit confused about gene selection and machine learning
> and I was wondering if you could help me out.  I have a dataset that is
> classified into two groups and my aim is to get a small number of genes
> (10-20) in a gene signature that I will in theory be able to apply to
> over datasets to optimal classify the samples.  As I do not have a test
> and training set I am using Leave-one-out cross-validation to help
> determine the robustness.  I have read that one should perform gene
> selection for each split of the samples i.e.
>
> 1) Select one group as the test set
> 2) On the remainder select genes
> 3) Apply machine learning algorithm
> 4) Test whether the test set is correctly classified
> 5) Go to one
>
> If you do this, you might get different genes each time, so how do you
> get your "final" optimal gene classifier?
>
> Many thanks
>
> Dan
>
> --
> **************************************************************
> Daniel Brewer, Ph.D.
>
> Institute of Cancer Research
> Molecular Carcinogenesis
> Email: daniel.brewer at icr.ac.uk
> **************************************************************
>
> The Institute of Cancer Research: Royal Cancer Hospital, a charitable Company Limited by Guarantee, Registered in England under Company No. 534147 with its Registered Office at 123 Old Brompton Road, London SW7 3RP.
>
> This e-mail message is confidential and for use by the a...{{dropped:2}}
>
> _______________________________________________
> Bioconductor mailing list
> Bioconductor at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/bioconductor
> Search the archives: http://news.gmane.org/gmane.science.biology.informatics.conductor
>