Ph.D. Thesis
| Heading: | SUPERVISED LEARNING IN VERY HIGH
DIMENSIONAL PROBLEMS WITH APPLICATIONS TO MICROARRAY
DATA |
| Abstract: | In the past decade, the advent of efficient
genome sequencing tools and high-throughput experimental
biotechnology has lead to enormous progress in the life
sciences. Among the most important innovations is the microarray
technology. It allows to quantify the expression for thousands of
genes simultaneously by measuring the hybridization from a tissue
of interest to probes on a small glass or plastic slide. The
characteristics of these data include a fair amount of random
noise, a predictor dimension in the thousands, and a sample size
in the dozens. A particular application of the microarray technology is in cancer research, where the goal is a precise and early diagnosis of tumorous malignancies, allowing for a tailored treatment with less side-effects and higher cure rates. The challenge for statistical research is the development and adaptation of class prediction tools that reliably work in this very-high dimensional situation. The problem may be relaxed to some extent by the fact that the true underlying signal may be sparse, meaning that only a few genes significantly contribute to the outcome variation. This thesis contributes with two papers that pursue the novel concept of finding predictive gene groups from microarray data. It is motivated from the biological assumption that a few latent gene expression signatures are most accurate for phenotype discrimination. We present two algorithms that are based on non-exhaustive, but efficient greedy search heuristics, plus two statistically motivated, likelihood-based objective functions. The competitive classification power of these parsimonious prediction models has been carefully evaluated and empirically confirms the benefit of these supervised grouping techniques. Two further chapters of this thesis focus on statistically motivated machine learning methods for class prediction with gene expression data. The first contribution is a tailored boosting algorithm that contradicts and clarifies the statement that boosting methods do not work well for microarray data, as observed in several earlier publications. The second paper suggests a completely novel hybrid approach between the two ensemble methods bagging and boosting. This modification results in an algorithm performing among the best within the machine learning methods. Moreover, the second paper presents some innovative ideas about measuring the influence of single genes for a biological interpretation of the prediction models. The validity of these machine learning approaches has been confirmed by application on many real datasets and several simulation models. |
| Award: | My Ph.D. thesis was honored with the
Arthur-Linder-Prize, as outstanding research work in the area
of biometry |
| Supervisor: |
Prof. Dr. Peter Bühlmann |
| Submission Date: | June 2004 |
| Reference: | Diss. ETH No. 15580 |
| Length: | 156 pages |
| Publications: |
All papers I published in 2004 and earlier
were written during the course of my Ph.D. thesis. You can find
them on my publications
page. |
| Download: | My
Ph.D. thesis is available as PS (2180k) and
PDF (1440k) |
| Supplement: | I
have a little article on the basics
of microarray analysis. It is written in German, and may thus not
be too helpful if you do not speak that language. |
| Back / Home | Marcel Dettling, 19.4.2005 |