[R-sig-Geo] Moran Test for Spatial Autocorrelation - Significance credible or due to misspecification of the model?

Wed Jul 20 10:23:47 CEST 2011

Julia,

>From you initial question I had assumed you had 91 sub-field observations but your latest response has me questioning if I understood the original question.  Are your 91 observations in a single field or do you have 91 separate fields across a larger geographic region? I've added another comment below.

>> Thank you for your comment below.
>> I have 91 areas across a large geopgraphic area. For each area, crop yield estimation was carried out in order to obtain a representative yield estimate.

Terry

Terry Griffin, Ph.D.
Associate Professor - Economics
University of Arkansas - Division of Agriculture

----- Original Message -----
From: "Julia Dechamps" <julia.dechamps at jesus.ox.ac.uk>
To: "Roger Bivand" <Roger.Bivand at nhh.no>
Cc: r-sig-geo at r-project.org
Sent: Tuesday, July 19, 2011 1:01:14 PM
Subject: Re: [R-sig-Geo] Moran Test for Spatial Autocorrelation - Significance credible or due to misspecification of the model?

Dear Roger,

I now used Google Maps to obtain (hopefully) more accurate coordinates of the areas. As it turns out, only for 44 out of the 91 areas, the coordinates seem to make sense/ were available. So I am now only looking at "mydata" which only contains 44 data points that are no longer representative of the original sample. This rather concerns me, but I can´t see a way out of it at the moment.

When using Delauny triangulation, I obtain the following nb object:
> mydata_nb2 <- tri2nb(coordinates(mydata_sp))
> summary(irrig_nb2)
Neighbour list object:
Number of regions: 44
Number of nonzero links: 242
Percentage nonzero weights: 12.5
Average number of links: 5.5
Link number distribution:

 3  4  5  6  7  8 10
 1  8 15 12  6  1  1
1 least connected region:
7 with 3 links
1 most connected region:
8 with 10 links

Using approach A and B from before for the construction of spatial weights I get the following test results:

# approach A:
# default: Moran's I test under randomisation:
# "greater" p-value = 6.438e-09
# "two.sided" p-value = 1.288e-08
# "less" p-value = 1

# approach B:
# binary: Moran's I test under randomisation:
# "greater" p-value = 1.874e-08
# "two-sided" p-value = 3.748e-08
# "less" p-value = 1

Having no experience in spatial statistics at all, I´m finding it difficult to envisage what spatial process (and hence what type of weights) would be appropriate for my data. I wouldn´t have thought that contagion would necessarily be appropriate (I might be misunderstanding contagion though, in which case I apologise): The grain under consideration is irrigated wheat and I wouldn´t expect yields to be influenced by similar agricultural practices spreading from one region to the next. I could imagine that nearby areas are influenced by similar weather phenomenons, however weather tends to be quite local, so maybe that´s not right either.

>>From Terry: one debate in agricultural yield data is whether the spatial error process or spatial lag process models are most appropriate; the spatial diagnostics sometimes point to a spatial lag process even when crop yield is the dependent variable which is in conflict with agronomic concepts. Based on conceptual reasons, I prefer to use the spatial error process model for these types of data where the contagion is in an underlying variable.

Another point to mention about my data is that my yields have already been smoothed prior to the averaging over years since they are aggregate yields from smaller subareas (for which no coordinates are available though). This might contribute to the origin of the apparent negative autocorrelation. When conducting a moran test for each of the 9 years separately (rather than for the average over the 9 years), the results do not change noticeably.

I am wondering whether I should maybe look for ways other than the moran test to get a rough idea whether spatial autocorrelation is present in the data or not...

Thank you very much for your time! I greatly appreciate your advice!

Regards,
Julia

________________________________________
From: Roger Bivand [Roger.Bivand at nhh.no]
Sent: 19 July 2011 15:15
To: Julia Dechamps
Cc: r-sig-geo at r-project.org
Subject: RE: [R-sig-Geo] Moran Test for Spatial Autocorrelation - Significance credible or due to misspecification of the model?

On Tue, 19 Jul 2011, Julia Dechamps wrote:

> Dear Roger,
>
> Thank you very much for your quick reply! Further to what you were asking ad 3.:
>
>> all.linked
> [1] 780.2701

Yes, this is the reason for the problems. I suggest improving the
coordinate measurements first, then constructing a graph-based neighbour
object, as your observations are very unevenly spaced. The construction of
the weights depends on the kind of spatial process you envision, which may
be contagion or perhaps diffusion in some form.

Note also that your temporal aggregation may smooth the data. I'm not
aware of appropriate tests for spatio-temporal processes, and that may be
what you really need. You may be aware that Cressie & Wikle (2011) has
appeared, and that space-time data is a very active area at the moment.

Roger

>
>> summary(mydata_nb)
> Neighbour list object:
> Number of regions: 91
> Number of nonzero links: 8168
> Percentage nonzero weights: 98.63543
> Average number of links: 89.75824
> Link number distribution:
>
> 87 88 89 90
> 4  3  4 80
> 4 least connected regions:
> 8 10 12 15 with 87 links
> 80 most connected regions:
> 1 3 4 5 6 9 11 13 14 16 17 18 19 20 22 23 24 25 27 28 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 51 52 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 with 90 links
>
> So yes, most points seem connected to eachother. I repeated the tests (approaches A-C) with a two-sided alternative as well as a one-sided less than alternative and got the following results:
>
> # approach A:
> # default: Moran's I test under randomisation:
> # "greater" p-value = 0.9967
> # "two.sided" p-value = 0.00668
> # "less" p-value = 0.00334
>
> # approach B:
> # binary: Moran's I test under randomisation:
> # "greater" p-value = 0.9968
> # "two-sided" p-value = 0.0064
> # "less" p-value = 0.0032
>
> # approach C:
> # inverse: Moran's I test under randomisation:
> # "greater" p-value < 2.2e-16
> # "two.sided" p-value < 2.2e-16
> # "less" p-value = 1
>
> Why might the fact that most of the areas are connected lead to significant negative autocorrelation tough?
> When aggregating the duplicated points (of which they are 7 in total) and repeating steps 2.-5. I end up with the following test results:
>
> # approach A:
> # default: Moran's I test under randomisation:
> # "greater" p-value = 0.5
> # "two.sided" p-value = 1
> # "less" p-value = 0.5
>
> # approach B:
> # binary: Moran's I test under randomisation:
> # "greater" p-value = NA
> # "two-sided" p-value = 1
> # "less" p-value = NA
>
> # approach C:
> # inverse: Moran's I test under randomisation:
> # "greater" p-value < 2.2e-16
> # "two.sided" p-value < 2.2e-16
> # "less" p-value = 1
>
> I think it is quite likely that the GPS coordinates I obtained are heavily approximated (the ares are quite large as they represent subdistricts in large states). I shall now try and obtain possibly more accurate coordinates for the 91 areas using the zoom on GoogleEarth...
>
> I appreciate any further comments or suggestions you might have.
>
> Regards,
> Julia
>
>
>
>
> ________________________________________
> From: Roger Bivand [Roger.Bivand at nhh.no]
> Sent: 19 July 2011 13:08
> To: Julia Dechamps
> Cc: r-sig-geo at r-project.org
> Subject: Re: [R-sig-Geo] Moran Test for Spatial Autocorrelation - Significance credible or due to misspecification of the model?
>
> On Tue, 19 Jul 2011, Julia Dechamps wrote:
>
>> Dear list,
>>
>> I am new to spatial statistics and am working on the following: I have a
>> dataset of annual grain yields over 9 years for each of 91 (irregularly
>> spaced) areas. To begin with I averaged the yields over the 9 years, so
>> that I have 91 data points, which I would now like to test for the
>> presence of spatial autocorrelation via a global Moran I test
>> (moran.test in package spdep).
>>
>> Here is how I proceeded (where in steps 2. onwards I am trying to follow
>> Roger Bivand et al: Applied Spatial Data Analysis with R, ch.9):
>>
>> 1. I obtained approximate geographical point coordinates (lat and long)
>> for the 91 areas using the online service
>> http://www.gpsvisualizer.com/geocoder/.
>>
>> 2. I created a spatial object:
>> myCRS <- CRS("+proj=longlat +ellps=WGS84")
>> mycoordmat <- matrix(c(mydata.agg$longitude,mydata.agg$latitude),ncol=2)
>> colnames(mycoordmat) <- c("long","lat")
>> areanames <- paste("SD",1:91,sep="")
>> rownames(mycoordmat) <- areanames
>> sum(duplicated(mycoordmat))             # 7
>> mydata_sp <- SpatialPoints(mycoordmat, proj4string=myCRS)
>> coords <- coordinates(mydata_sp)
>>
>> 3. I created a distance based nbd:
>> mydata_nb1 <- knn2nb(knearneigh(coords, k=1, longlat=TRUE))
>> all.linked <- max(unlist(nbdists(mydata_nb1, coords, longlat=TRUE)))
>> mydata_nb <- dnearneigh(coords, 0, all.linked)
>
> What is all.linked - is it a large distance (in km)? What does
> summary(mydata_nb) say? Are most points connected to each other?
>
>>
>> 4. I created spatial weights for the nb object:
>> # approach A: default style W:
>> mydata_lw_W <- nb2listw(mydata_nb, style="W")
>>
>> # approach A: binary style B:
>> mydata_lw_B <- nb2listw(mydata_nb, style="B")
>>
>> # approach C: weights proportional to inverse distance btw. points representing the areas
>> # (believing that the strength of the neighbour relationship, i.e. (??) autocorrelation attenuates with distance):
>> dsts <- nbdists(mydata_nb, coords, longlat=TRUE)
>> idw <- lapply(dsts, function(x) 1/x)
>> mydata_lw_idwW <- nb2listw(mydata_nb, glist=idw, style="W")
>> summary(unlist(mydata_lw_idwW$weights))
>>
>> 5. Here are the resulting p-values from the global Moran tests:
>> # approach A:
>> moran.test(mydata.agg$yr1to9avg, listw=irrig_lw_W)
>> # Moran's I test under randomisation: p-value = 0.9967
>>
>> # approach B:
>> moran.test(mydata.agg$yr1to9avg, listw=irrig_lw_B)
>> # Moran's I test under randomisation: p-value = 0.9968
>>
>
> The two above are actually significant negative autocorrelation, if you
> change alternative to other than "greater" - the default. I think that
> almost all observations are connected to each other, leading to this
> effect.
>
>> # approach C:
>> moran.test(mydata.agg$yr1to9avg, listw=irrig_lw_idwW)
>> # Moran's I test under randomisation: p-value < 2.2e-16
>>
>> Now my questions:
>>
>> Approach C seems to suggests strong evidence for rejecting the null of
>> no spatial autocorrelation, whereas the other two approaches show
>> absolutely no evidence for rejecting the null of no spatial
>> autocorrelation. Bivand et al. repeatedly warn that apparent spatial
>> autocorrelation may actually stem from misspecification of the model,
>> and I am wondering whether this is the case here. I do not have any
>> auxiliary variables such as spatially patterned ecological variables
>> that I could include in my model for the mean function. I can therefore
>> only really alter the choice of neighbour relationships and spatial
>> weights.
>
> No, see above.
>
>>
>> With regard to the neighbourhood relationships: I decided to choose
>> distance based neighbours such that all areas have at least one
>> neighbour (via all.linked). Contiguity-based or grid-based neighbours
>> are not appropriate for my data; with graph-based neighbours I run into
>> troubles because 7 of my coordinates are duplicated. Knn or higher order
>> neighbours ran into issues when I tried to implement them. In light of
>> these observations and my data, is my distance-based approach sensible?
>
> Beware of inverse distance when points are duplicated, the effects may be
> perverse. Either aggregate the duplicated points, or disambiguate them -
> move them apart. Your determination of position is likely to incur
> measurement error when the geocoder defaults to an approximation. Can you
> get more precise coordinates by zooming in on for example Google Earth,
> marking points, and dumping to KML? The KML can be read with readOGR() in
> rgdal. With disambiguated points, you can use graph-based neighbours.
>
> Hope this helps,
>
> Roger
>
>>
>> With regard to the assignment of spatial weights: I would have believed
>> that it is sensible to assume that the strength of potential spatial
>> autocorrelation in yields attenuates with distance between the areas
>> under consideration. I therefore favour approach C conceptually. However
>> Bivand et al. warn that if we know little about the assumed spatial
>> process we should avoid moving far from the binary representation of a
>> wight of unity for neighbours, which is what styles W and B (approaches
>> A and B) both implement as far as I understand? In light of these
>> observations and given the test results, which approach seems to be most
>> appropriate in my case?
>>
>> I would be extremely grateful for any advice you could give me.
>> Thank you!
>>
>> Julia
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
> --
> Roger Bivand
> Department of Economics, NHH Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; fax +47 55 95 95 43
> e-mail: Roger.Bivand at nhh.no
>
>

--
Roger Bivand
Department of Economics, NHH Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo