[R-sig-Geo] Modeling areal data with lots of holes, islands

Wed Jul 22 21:10:21 CEST 2015

Hi All,

I am using R to model data with repeated measures and strong spatial
autocorrelation in OLS residuals.  Response and independent variables are
summarized at the county level (2300 counties across contiguous US, each
measured once during each of four years).

There are lots of challenges with this data. First, the response is a
proportion, so I am looking at modeling methods that can accommodate a
beta-distribution (otherwise I'll logit-transform the response, or ignore
the issue since residual distributions don't look too bad).  Second, the 4
repeated measures per county sort of calls for either the ability to have
random intercepts or the ability to incorporate a temporal correlation
structure.  Third, the strong spatial patterning of residuals should
probably be dealt with some how.  The modeling tools I am considering
include mgcv::gam, mgcv::gamm, gamm4::gamm4, INLA::inla, spTimer::spTGibbs,
etc.

In order to choose the right tool, I need to decide on a way to specify the
spatial relationships.  This is areal data, so it seems most common and
correct to describe spatial relationships with graphs and employ
neighborhood-oriented modeling approaches (e.g., CAR models).  But this
data set is unusual in that I don't have data for all counties in the
contiguous US (roughly 2300 out of 3100).  There are a lot of holes and
islands in the map.  Conceptually, assigning neighbors seems odd when there
are big gaps between counties, especially when I suspect that the spatial
pattern in the data is due to continuous spatial processes.

I would prefer to model these data using a geostatistical approach, using
county centroids.  I know this is not ideal.  I have tried both approaches
and they both yield similar fixed effect estimates for independent
variables of interest. But the geostatistical approach produce better
fitting models, eliminating nearly all residual autocorrelation.

So, finally, the question.  Is it reasonable to model these areal data
using a geostatistical approach given (1) there are lots of holes and
islands in the areal data, (2) I suspect the spatial patterning to be due
to continuous spatial processes, not adjacency of administrative
boundaries, (3) there are roughly 2300 counties in the analysis, where
variation in county size and shape is small compared to the continental
analysis extent, (4) a geostatistical model does a better job of removing
residual autocorrelation, and (5) I better understand the geostatistical
ways of specifying spatial relationships?  What are the practical
consequences of using geostatistical methods for areal data?  Are they
greater than the consequences of using odd neighborhood specifications?

Thanks for any advice you can offer.

Best,

Tim

P.S. Sorry if this is the wrong venue for this question.  Please let me
know if there is a better place to send it.

	[[alternative HTML version deleted]]