[R-sig-Geo] Sampling polygons for analysis in CART/randomForests

Wed Mar 9 16:01:03 CET 2005

Dear R - Geo community:
   This is perhaps a bit off topic (really a geostats question), but
we've been struggling with these questions for about 6 months now, and
it just reared its head again when I posed a question on one of ESRI's
discussion groups. I realized you would be much more experienced in this
issue and might be interested in offering advice. And this might be
something others are struggling with as well. 
   Background: We are creating predictive habitat distribution models
(a good synthesis paper is Guisan and Zimmerman 2000. Ecological
Modelling 135:147-186); using known locations of species to create a
model (and subsequent raster) that predicts the locations for additional
similar habitats. Many different algorithms have been used for actually
doing the predictive modeling (e.g. logistic regression, Classification
and regression trees [CART], ordination, maximum entropy), but in all
cases the user needs a full coverage of all the environmental variables
to be used and known locations for the element (species) being modeled.
Known locations must be attributed with values from all the
environmental layers. We've been using two different different
algorithms, one parametric (MaxEnt - separate package) and one
non-parametric (the RandomForests implementation of CART). I'm using the
R implementation of randomForest, but (embarrasingly) I'm currently
doing it by dumping the data to ascii and then reading the tables back
into R (by looping through them to keep the size manageable).  I saw the
recent post on this listserv
(https://stat.ethz.ch/pipermail/r-sig-geo/2005-March/000338.html) that
mentions Furlanello et al. (wow!) and someday we'll strive to integrate
in a similar way. 

This Furlanello paper is a good example of how we diverge and our
resulting dilemma. Furlanello et al. (and most others, as far as I can
tell, but please enlighten me if I'm wrong), begin with known POINT
locations. Point locations are easy to attribute with data from your
environmental layers.   We have good-quality POLYGON data and we want to
be sure the environmental variability captured by the polygon is
represented in the 'presence' environmental data we pass on for
analysis. How to best capture that full variability?  Note one of our
algorithms is parametric (MaxEnt) and one non-parametric
(randomForest).. and I have no problem with deviating from traditional
parametric sampling methodologies and leaving many formerly required
parametric assumptions behind (or is that a big mistake?). We've thought
long and hard about random sampling, regular sampling, and grabbing all
the points. 

We currently have about 36 environmental variables, we've resampled
them all to one size (30m). That meant bringing all the rasters derived
from 10m DEM up to 30 meter and all the coarser rasters down to 30m. I'm
being transparent about breaking the 'rules' here just so you know we
are aware of the issues surrounding scale. Our rasters are pretty big
(22578 rows x 17160 cols.... a single ascii is about 2GB).

Again, our goal is to, as fully as possible, represent the environment
within each polygon. For random (or regular) sampling; if you do a
consistent number of samples for every polygon, small polygons are
overweighted. If you sample on a per/area basis, larger polygons get
more 'weight'.  We settled on grabbing ALL the raster cells in each
polygon and sticking a point in the center of each cell and attributing
these points.  This  obviously blew up with larger polygons (much to
many cells), and we realized we'd need to go back to subsampling.   I
thought the best step would be to sample randomly without replacement so
that I'd get as full representation as possible (e.g. if a sample of 90
cells/points are requested, return 90 different cells/points). One
problem that's been pointed out is that I'm treating rasters as discrete
data when most are really a representation of continuous data and
interpolating between cell centers is more appropriate. Unfortunately I
don't have an easy way of doing this, but I'm trying to look into it.

My questions: 

Has anyone ever used *all* the cells in a polygon representation when
attributing data for later statistcal analyses?

Would sampling without replacement be an appropriate approach for
returning as much within polygon variability as possible?  If so, how
would you do it?

If randomly sampling, has anyone sampled relatively more cells in
smaller polygons, and relatively fewer cells in larger polygons, so that
you reach an asymptote in number sampled per polygon?

and related: If randomly sampling, do you tend to sample a static
number of cells per polygon or a per-area number of cells. 

Thank you for your time.
Tim Howard