[R-sig-Geo] Moran's I for grid data?

Tue Feb 10 23:24:59 CET 2009

On Tue, 10 Feb 2009, Sandra Burmeier wrote:

> Hi,
>
> I have some questions concerning the calculation of Moran’s I for 
> grid-based data.
>
> I’m a plant ecologist analysing the small-scale composition of 
> populations of a certain plant species. That is, I’m intending to 
> compare the spatial distribution of adult plants, juvenile plants, 
> seedlings and seeds in the seed bank. I have sampled several populations 
> in 1x4-m-plots which were subdivided by a 20x20-cm-grid, and my data are 
> aggregated values for the grid cells (e.g. 5 adult plants in cell 1, 12 
> plants in cell 2 and so on).
> The questions I would like to answer are the following:
> (1) Do adults, juveniles etc. are aggregated within the plots 
> (visualisation of the data in a simple map strongly suggests they are)?
> (2) Is the level of aggregation (the average size of the clumps) the 
> same for different life stages, i.e. do adults and juvenile clump on the 
> same scale?
>
> My plan was to calculate Moran’s I values (to answer question 1) and 
> draw individual spatial correlograms for the different life stages and 
> then compare the graphs (to answer question 2) – but is that appropriate 
> for my grid-based data?
>
> In case it is, I have some more questions. So far, I’ve used the package 
> spdep, calculated Moran’s I using the function moran.test (using 
> knn-objects with the nearest 2, 3 and 4 neighbours – any guidelines here 
> on how many make sense?) and plotted the results using sp.correlogram:
> /xy.A2.07 <-cbind(A2.07$x,A2.07$y)/
> /xy.A2.07 <-matrix(xy.A2.07,ncol=2)/
> /xy.A2.07.knn <- knearneigh(xy.A2.07, k=2) # I’ve also tried k=3 and k=4/
> /xy.A2.07.nb <- knn2nb(xy.A2.07.knn)/
> /xy.A2.07.lw <- nb2listw(xy.A2.07.nb,style="B")/
> /moran.test(A2.07$Ara_bl,xy.A2.07.lw) /
> /corr.A2.07.bl <- sp.correlogram(xy.A2.07.nb, A2.07$Ara_bl, method = 
> "I", order=6, zero.policy=TRUE, style="B")/
> /print.spcor(corr.A2.07.bl, "bonferroni")/
> /plot.spcor(corr.A2.07.bl)/
>
> Is there any possibility to define the lag-distances used by 
> sp.correlogram? Or at least to find out what they are (in terms of the 
> map units used, in my case centimetres)?

Sandra,

There are several questions here, none of which have obvious answers, so 
they involve choices motivated by domain and discipline tradition 
(GIScientists would call this "ontology"), as well as other knowledge of 
features of the phenomena that may not have been measured as such (the 
influence by distance attenuation is the key one here).

You are using aggregates, so the distances between the "points" within the 
grid cells gets aggregated. Think of a cluster of 20 plants very close 
together, but where 5 fall into each of 4 neighbouring grid cells - you'll 
see that the grid cells may both be a typical way to represent the 
phenomena, but may not be an ideal operationalisation. Shift the origin of 
the grid by 10cm in each direction, and all the plants are in one cell. So 
aggregation has its price. Had you had the point locations of the adults 
and juveniles, you might rather have done a point pattern analysis, 
probably a Kcross() in the spatstat terminology to see the 
cross-interpoint distance relations between adults and juveniles. Since 
you don't, you need to try to grasp the "support" of your observations 
firmly.

It seems to be areal support - that is, the aggregated observations are 
counts of phenomena by plot and grid cell. Are there covariates at the 
plot and grid cell level - is this a hierarchical problem?

Since the grid cells are regular, you'd need to find a good reason to 
choose anything other than first order contiguity (rook or queen style), 
or equivalently distance with a threshold just greater than the edge 
neighbour inter-centre distance for the rook style, or the corner 
neighbour inter-centre distance for the queen style. That is, knearneigh() 
is not appropriate here, I feel.

The confusion that exists in the use of the correlogram seems to relate to 
its use in geostatistics, and other areas in which the data have point 
support. If they do have point support, the inter-point distances can be 
interpreted directly. If on the other hand, the observations have areal 
support, using the centroids of the areas as "quasi" point support is 
possible, say, for triangulation or other graph based methods, but becomes 
problematic if distance is simply multiples of the resolution of a regular 
grid (with areal support).

Cliff & Ord (full references given on ?sp.correlogram) felt strongly that 
instead of trying to interpret "quasi" distance, it is better to examine 
autocorrelation for orders of neighbours, first order, second order are 
neighbours of neighbours, and so on. That is what sp.correlogram() does. 
The ncf and pgirmess packages provide correlog() functions that are 
distance based, but implicitly assume point support, or at least that the 
variable isn't an aggregate count (I would argue).

Finally, the Moran test is based on an underlying simultaneous 
autoregressive model, where the data generation process is driven by a 
covariance matrix containing the crossproduct of the inverse of the 
spatial operator, say (I - rho W). W are the spatial weights in matrix 
form, and are typically very sparse. However, the inverse is dense, that 
is, each observation *is* related to every other, but with a weight that 
attenuates as one steps out through the graph of neighbours - under 
certain conditions the inverse can be written as the sum of an infinite 
power series:

(I - rho W)^-1 = sum rho^0 W^0 + rho^1 W^1 + rho^2 W^2 + rho^3 W^3 + ...

In that case, the underlying model for first order autocorrelation 
may already include what the correlogram is trying to get hold of, but in 
graph terms, not distance terms.

The Getis-Ord G statistic tries to get at this problem, but again runs up 
against the areal/point support problem.

So in addition to point process analysis on the data before aggregation, 
you could also arguably look at the analysis of regular (experimental) 
plots (although agricultural trials would say measure crop yields per unit 
area, not just counts, and relate them to experimental covariates, say 
treatments).

Note that your aggregated data do seem to be counts anyway, so that the 
assumption that their residuals are Normally distributed may be quite 
brave. Unfortunately, there are few tried tests for autocorrelation in 
count variables (or in the residuals from GLM).

In summary, your question: "but is that appropriate for my grid-based 
data?" is central here. I don't think that there is a quick fix; getting 
anywhere will need a good deal of thought, it certainly doesn't seem 
trival to me. I would guess that plant ecologists might have a literature 
giving accepted recipies for this, but are they the same recipies that 
people doing agricultural trials would choose, or those that spatial 
epidemiologists would prefer? The underlying problem may well be support, 
that gridding and aggregating isn't necessarily a helpful way of 
operationalising this kind of research question.

Hope this helps,

Roger

>
> I am new to both spatial statistics and R, so these questions may be 
> rather trivial - my apologies if they are. However, I am really 
> wondering whether my approach is appropriate for the kind of data I’m 
> dealing with and, if it isn’t, which other method would be suitable.
>
> I would greatly appreciate any help!
>
> Many thanks,
> Sandra
>
>

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no