BagBoosting for Tumor Classification with Gene Expression Data

 

Author : Marcel Dettling

Published: In Bioinformatics, December 12th, 2004

Motivation:
Microarray experiments are expected to contribute significantly to progress in cancer treatment by enabling a precise and early diagnosis. They create a need for class prediction tools that can deal with a large number of highly correlated input variables, perform feature selection, and provide class probability estimates that serve as a quantification of the predictive uncertainty. A very promising solution is to combine the two ensemble schemes bagging and boosting to a novel algorithm called BagBoosting.

Results:
When bagging is used as a module in boosting, the resulting classifier consistently improves the predictive performance and the probability estimates of both bagging and boosting on real and simulated gene expression data. This quasi-guaranteed improvement can be obtained by simply making a bigger computing effort. The empirical advantage is also clearly present when comparing BagBoosting to several established class prediction tools for microarray data.

Software:
Software for several modified boosting algorithms and for simulation of microarray data is available as an R-Package called boost from CRAN. There is also a Windows binary version available. Please report any bugs and pitfalls to me.

Length: 11 pages

Reference: Bioinformatics (2004), Vol. 20, No. 18, p. 3583-3593.

Download: The article is available online from the Bioinformatics webpage: click here. A slightly outdated preprint is available as PDF(334k) and PS(327k). For reprint requests and further information, please contact me via e-mail.

Datasets:
For the purpose of comparison, you can download the preprocessed microarray gene expression datasets exactly as I used them in my empirical study. They are provided as R data files and contain both the expression matrix and the response variable. Click here for the Leukemia data (2010k), Colon data (970k), Prostate data (4809k), Lymphoma data (1951k), SRBCT data (1137k) and Brain data (1837k). For information about the origin and the preprocessing of the datasets, please read my paper about Supervised Clustering of Genes.

Related material:
Our first paper about boosting for tumor classification with gene expression data is available here.




Back / Home Marcel Dettling, 20.04.2005