BagBoosting for Tumor Classification with Gene Expression Data
| Author : | Marcel
Dettling |
| Published: | In Bioinformatics, December 12th, 2004 |
| Motivation: | Microarray experiments are expected to contribute
significantly to progress in cancer treatment by enabling a precise and
early diagnosis. They create a need for class prediction tools that can
deal with a large number of highly correlated input variables, perform
feature selection, and provide class probability estimates that serve
as a quantification of the predictive uncertainty. A very promising
solution is to combine the two ensemble schemes bagging and boosting to
a novel algorithm called BagBoosting. |
| Results: | When bagging is used as a module in boosting, the
resulting classifier consistently improves the predictive performance
and the probability estimates of both bagging and boosting on real and
simulated gene expression data. This quasi-guaranteed improvement can
be obtained by simply making a bigger computing effort. The empirical
advantage is also clearly present when comparing BagBoosting to several
established class prediction tools for microarray
data. |
| Software: | Software for several modified boosting algorithms
and for simulation of microarray data is available as an R-Package called
boost from CRAN. There is also a Windows
binary version available. Please report any bugs and pitfalls
to me. |
| Length: | 11 pages |
| Reference: | Bioinformatics (2004), Vol. 20, No. 18,
p. 3583-3593. |
| Download: | The article
is available online from the Bioinformatics webpage: click
here. A slightly outdated preprint is available as PDF(334k)
and PS(327k).
For reprint requests and further information, please contact me via e-mail.
|
| Datasets: | For the purpose of comparison, you can download the
preprocessed microarray gene expression datasets exactly as I used them
in my empirical study. They are provided as R data files and contain
both the expression matrix and the response variable. Click here for
the Leukemia
data (2010k), Colon data
(970k), Prostate
data (4809k), Lymphoma
data (1951k), SRBCT data
(1137k) and Brain data
(1837k). For information about the origin and the preprocessing of the
datasets, please read my paper about Supervised Clustering of Genes.
|
| Related material: | Our first paper about boosting for tumor classification
with gene expression data is available here.
|
| Back / Home | Marcel Dettling, 20.04.2005 |