[R-sig-Geo] Moran's I for grid data?
Roger Bivand
Roger.Bivand at nhh.no
Tue Feb 10 23:24:59 CET 2009
On Tue, 10 Feb 2009, Sandra Burmeier wrote:
> Hi,
>
> I have some questions concerning the calculation of Moran’s I for
> grid-based data.
>
> I’m a plant ecologist analysing the small-scale composition of
> populations of a certain plant species. That is, I’m intending to
> compare the spatial distribution of adult plants, juvenile plants,
> seedlings and seeds in the seed bank. I have sampled several populations
> in 1x4-m-plots which were subdivided by a 20x20-cm-grid, and my data are
> aggregated values for the grid cells (e.g. 5 adult plants in cell 1, 12
> plants in cell 2 and so on).
> The questions I would like to answer are the following:
> (1) Do adults, juveniles etc. are aggregated within the plots
> (visualisation of the data in a simple map strongly suggests they are)?
> (2) Is the level of aggregation (the average size of the clumps) the
> same for different life stages, i.e. do adults and juvenile clump on the
> same scale?
>
> My plan was to calculate Moran’s I values (to answer question 1) and
> draw individual spatial correlograms for the different life stages and
> then compare the graphs (to answer question 2) – but is that appropriate
> for my grid-based data?
>
> In case it is, I have some more questions. So far, I’ve used the package
> spdep, calculated Moran’s I using the function moran.test (using
> knn-objects with the nearest 2, 3 and 4 neighbours – any guidelines here
> on how many make sense?) and plotted the results using sp.correlogram:
> /xy.A2.07 <-cbind(A2.07$x,A2.07$y)/
> /xy.A2.07 <-matrix(xy.A2.07,ncol=2)/
> /xy.A2.07.knn <- knearneigh(xy.A2.07, k=2) # I’ve also tried k=3 and k=4/
> /xy.A2.07.nb <- knn2nb(xy.A2.07.knn)/
> /xy.A2.07.lw <- nb2listw(xy.A2.07.nb,style="B")/
> /moran.test(A2.07$Ara_bl,xy.A2.07.lw) /
> /corr.A2.07.bl <- sp.correlogram(xy.A2.07.nb, A2.07$Ara_bl, method =
> "I", order=6, zero.policy=TRUE, style="B")/
> /print.spcor(corr.A2.07.bl, "bonferroni")/
> /plot.spcor(corr.A2.07.bl)/
>
> Is there any possibility to define the lag-distances used by
> sp.correlogram? Or at least to find out what they are (in terms of the
> map units used, in my case centimetres)?
Sandra,
There are several questions here, none of which have obvious answers, so
they involve choices motivated by domain and discipline tradition
(GIScientists would call this "ontology"), as well as other knowledge of
features of the phenomena that may not have been measured as such (the
influence by distance attenuation is the key one here).
You are using aggregates, so the distances between the "points" within the
grid cells gets aggregated. Think of a cluster of 20 plants very close
together, but where 5 fall into each of 4 neighbouring grid cells - you'll
see that the grid cells may both be a typical way to represent the
phenomena, but may not be an ideal operationalisation. Shift the origin of
the grid by 10cm in each direction, and all the plants are in one cell. So
aggregation has its price. Had you had the point locations of the adults
and juveniles, you might rather have done a point pattern analysis,
probably a Kcross() in the spatstat terminology to see the
cross-interpoint distance relations between adults and juveniles. Since
you don't, you need to try to grasp the "support" of your observations
firmly.
It seems to be areal support - that is, the aggregated observations are
counts of phenomena by plot and grid cell. Are there covariates at the
plot and grid cell level - is this a hierarchical problem?
Since the grid cells are regular, you'd need to find a good reason to
choose anything other than first order contiguity (rook or queen style),
or equivalently distance with a threshold just greater than the edge
neighbour inter-centre distance for the rook style, or the corner
neighbour inter-centre distance for the queen style. That is, knearneigh()
is not appropriate here, I feel.
The confusion that exists in the use of the correlogram seems to relate to
its use in geostatistics, and other areas in which the data have point
support. If they do have point support, the inter-point distances can be
interpreted directly. If on the other hand, the observations have areal
support, using the centroids of the areas as "quasi" point support is
possible, say, for triangulation or other graph based methods, but becomes
problematic if distance is simply multiples of the resolution of a regular
grid (with areal support).
Cliff & Ord (full references given on ?sp.correlogram) felt strongly that
instead of trying to interpret "quasi" distance, it is better to examine
autocorrelation for orders of neighbours, first order, second order are
neighbours of neighbours, and so on. That is what sp.correlogram() does.
The ncf and pgirmess packages provide correlog() functions that are
distance based, but implicitly assume point support, or at least that the
variable isn't an aggregate count (I would argue).
Finally, the Moran test is based on an underlying simultaneous
autoregressive model, where the data generation process is driven by a
covariance matrix containing the crossproduct of the inverse of the
spatial operator, say (I - rho W). W are the spatial weights in matrix
form, and are typically very sparse. However, the inverse is dense, that
is, each observation *is* related to every other, but with a weight that
attenuates as one steps out through the graph of neighbours - under
certain conditions the inverse can be written as the sum of an infinite
power series:
(I - rho W)^-1 = sum rho^0 W^0 + rho^1 W^1 + rho^2 W^2 + rho^3 W^3 + ...
In that case, the underlying model for first order autocorrelation
may already include what the correlogram is trying to get hold of, but in
graph terms, not distance terms.
The Getis-Ord G statistic tries to get at this problem, but again runs up
against the areal/point support problem.
So in addition to point process analysis on the data before aggregation,
you could also arguably look at the analysis of regular (experimental)
plots (although agricultural trials would say measure crop yields per unit
area, not just counts, and relate them to experimental covariates, say
treatments).
Note that your aggregated data do seem to be counts anyway, so that the
assumption that their residuals are Normally distributed may be quite
brave. Unfortunately, there are few tried tests for autocorrelation in
count variables (or in the residuals from GLM).
In summary, your question: "but is that appropriate for my grid-based
data?" is central here. I don't think that there is a quick fix; getting
anywhere will need a good deal of thought, it certainly doesn't seem
trival to me. I would guess that plant ecologists might have a literature
giving accepted recipies for this, but are they the same recipies that
people doing agricultural trials would choose, or those that spatial
epidemiologists would prefer? The underlying problem may well be support,
that gridding and aggregating isn't necessarily a helpful way of
operationalising this kind of research question.
Hope this helps,
Roger
>
> I am new to both spatial statistics and R, so these questions may be
> rather trivial - my apologies if they are. However, I am really
> wondering whether my approach is appropriate for the kind of data I’m
> dealing with and, if it isn’t, which other method would be suitable.
>
> I would greatly appreciate any help!
>
> Many thanks,
> Sandra
>
>
--
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no
More information about the R-sig-Geo
mailing list