Boosting for Tumor Classification with Gene Expression Data


Authors: Marcel Dettling and Peter Buehlmann

Published: In Bioinformatics, June 12, 2003

Microarray experiments generate large datasets with expression values for thousands of genes but not more than a few dozens of samples. Accurate supervised classification of tissue samples in such high-dimensional problems is difficult but often crucial for successful diagnosis and treatment. A promising way to meet this challenge is by using boosting in conjunction with decision trees.

We demonstrate that the generic boosting algorithm needs some modifications to become an accurate classifier in the context of gene expression data. In particular, we present a feature preselection method, a more robust boosting procedure and a new approach for multi-categorical problems. This allows for slight to drastic increase in performance and yields competitive results on several publicly available datasets.

The most recommended alternative is to use the R-package boost from CRAN, of which there is also a Windows binary version available. However, you can also work with the much older original, non-CRAN package LogitBoost contains an implementation of our modified boosting algorithm. It is available as Linux/Unix (.tar.gz) version, as well as a precompiled Windows (.zip) version. Its manual (ps/pdf) contains a function index. The LogitBoost package requires the R-package rpart, which contains software for decision trees. As a Windows user, you can alternatively compile the package from source. Read here (ps/pdf) how this works.

Length: 9 pages

Reference: Bioinformatics (2003), Vol. 19, No. 9, p. 1061-1069

Download: PDF(105k)

Back / Home Marcel Dettling, 20.10.2003