Explaining Bagging
Peter Bühlmann and Bin Yu
May 2000
Abstract
Bagging is one of the most effective computationally intensive procedures
to improve on instable estimators or classifiers, useful
especially for high dimensional data set problems.
Here we formalize the notion of instability and
derive theoretical results to explain a variance
reduction effect of bagging (or its variant) in hard decision problems,
which include
estimation after testing in regression and decision trees for
continuous regression functions and classifiers. Hard decisions create
instability, and bagging is shown to smooth such hard decisions
yielding smaller variance and mean squared error.
With theoretical explanations, we motivate subagging based on
subsampling as an
alternative aggregation scheme. It is computationally cheaper but still
showing approximately the same accuracy as bagging.
Moreover, our theory reveals improvements in first
order and in line with simulation studies; in contrast with the
second-order explanation of Friedman and Hall (2000) for smooth
functionals which does not cover the most popular base learner for bagging,
namely decision trees.
In particular, we obtain an asymptotic limiting distribution at the
cube-root rate for the split point when fitting piecewise constant
functions. Denoting sample size by $n$, it follows that in a cylindric
neighborhood of diameter $n^{-1/3}$ of the theoretically optimal split
point, the variance and mean squared error reduction
of subagging can be characterized
analytically. Because of the slow rate,
our reasoning also provides an explanation on the global scale for the
whole covariate space in a decision tree with finite many splits.
Download:
Compressed Postscript (225 Kb)
PDF(467 Kb)
Go back to the
Research Reports
from
Seminar für Statistik.