Explaining Bagging

Peter Bühlmann and Bin Yu

May 2000

Abstract

Bagging is one of the most effective computationally intensive procedures to improve on instable estimators or classifiers, useful especially for high dimensional data set problems. Here we formalize the notion of instability and derive theoretical results to explain a variance reduction effect of bagging (or its variant) in hard decision problems, which include estimation after testing in regression and decision trees for continuous regression functions and classifiers. Hard decisions create instability, and bagging is shown to smooth such hard decisions yielding smaller variance and mean squared error. With theoretical explanations, we motivate subagging based on subsampling as an alternative aggregation scheme. It is computationally cheaper but still showing approximately the same accuracy as bagging. Moreover, our theory reveals improvements in first order and in line with simulation studies; in contrast with the second-order explanation of Friedman and Hall (2000) for smooth functionals which does not cover the most popular base learner for bagging, namely decision trees. In particular, we obtain an asymptotic limiting distribution at the cube-root rate for the split point when fitting piecewise constant functions. Denoting sample size by $n$, it follows that in a cylindric neighborhood of diameter $n^{-1/3}$ of the theoretically optimal split point, the variance and mean squared error reduction of subagging can be characterized analytically. Because of the slow rate, our reasoning also provides an explanation on the global scale for the whole covariate space in a decision tree with finite many splits.

Download:

Compressed Postscript (225 Kb)
PDF(467 Kb)

Go back to the Research Reports from Seminar für Statistik.