[R-sig-Geo] Local Moran p-values in Geoda, Python, and R
Roger Bivand
Roger.Bivand at nhh.no
Thu Mar 12 21:37:52 CET 2015
On Thu, 12 Mar 2015, William May wrote:
> Here's the R LISA map:
> http://willonrails.com/images/Lisa_map_R.png
>
> And here's the Geoda version:
> http://willonrails.com/images/Lisa_map_geoda.png
>
> I'm using Geoda 1.6.6. Since we were getting inconsistent answers
> between R and Geoda, the professor ran the analysis in Python to
> get a third opinion. I don't have the Python version of the map
> on me, but it's very similar to Geoda's.
>
> I was able to export Geoda's local I and p-values. I put up a zip file
> with all the data, R script, the maps, and the spatial weights files
> for Geoda:
> willonrails.com/images/lisa_spdep.zip
>
> The local Moran's I in R and Geoda are very similar -- on average they
> only differ by 0.000290044.
>
> The p-values, on average, differ by 0.1808931.
>
> In both I used 10 nearest neighbors for the weights. I did create the
> weights separately in R and Geoda. The Geoda weights are in the zip
> file. I'm new to coordinate projections, but here's the relevant part
> of the R script:
OK, so time for Cartography 010: your input data are in unprojected,
geographical coordinates, so you cannot measure the 10 nearest neighbours
using Euclidean distance, but rather using Great Circle distance. This
switches the neighbours of about 2000 of your 3000 counties. This isn't
the cause of the discrepancy.
Doing hist(nat2$LISA_P) shows that it only extends from 0 to 0.5, so
obviously is not the same as the values generated by spdep::localmoran (in
the 0 to 1 range), or even similar. In fact, it is hard to see what has
been done, and over 1500 counties are Pr < 0.05. Perhaps the upper half
has been folded down to the lower half?
So for GeoDa, we'd need to establish 1) are the probabilities two-sided or
greater than (Ii > E(Ii)); 2) have probabilities > 0.5 been folded back?
The functions in spdep use the "greater" alternative by default, and
"two.sided" is seldom used, but perhaps should be. Using the
p.adjust.method= argument (for example "bonferroni" or "fdr") and
accommodating only multiple comparisons among neighbours shifts most
p-values into insignificant ranges.
I've been looking at:
library(spdep)
library(maptools)
nat <- readShapeSpatial("NAT.SHP", ID="FIPSNO")
coords <- coordinates(nat)
IDs <- row.names(nat)
nat_10nnb_ll <- knn2nb(knearneigh(coords, k=10, longlat=TRUE),
row.names=IDs)
lisa_10nnb_ll <- localmoran(nat$HR60, nb2listw(nat_10nnb_ll, style="W"),
alternative="two.sided", p.adjust.method="fdr")
hist(lisa_10nnb_ll[,5])
where only a very few counties have a significant level of local spatial
autocorrelation at a 5% cutoff (295, still 10% of the total). Your
GeoDa results gave 1566 of the folded p-values < 5%, half of your
counties.
nat$PR_Ii <- lisa_10nnb_ll[,5]
spplot(nat, "PR_Ii", at=c(0,0.05,0.95,1+.Machine$double.eps),
col.regions=cm.colors(3), main="FDR-adjusted two.sided Ii")
shows a simplified map of the probabilities, dominated by counties not
showing local spatial autocorrelation.
It will also be easier to move the R nb object to GeoDa as a GAL file, the
ordering of the GWT file seems odd.
There are two main issues. One is that GeoDa uses permutations to generate
probabilities - there is no agreed position on this, and the references I
gave earlier suggest that permutating values for counties far from each
other may create a false impression of local similarity, hence positive
spatial autocorrelation. The second is the very high count of
"significant" results with your data - what is a hot spot when half of the
counties are hot?
One reason why permuting over all but i may be misleading is if many
values of the variable of interest are in one tail - HR60 is highly
skewed, with a minimum of 0, a maximum of 92, and an upper hinge of 6. So
a local clique of neighbours with values in the 20s will seem very unusual
in the permutation setting.
Finally, the global Moran's I for HR60 is highly significant, and this
indicates anyway that your mean model is mis-specified.
Hope this helps,
Roger
PS. I regularly respond to overenthusiastic usage of measures of this kind
when reviewing journal submissions. LISA are useful, but you do need to
think carefully about what you are actually trying to do with them.
Careful thematic mapping can be a very effective way of pointing up
differences in levels of a variable:
library(classInt)
cI <- classIntervals(nat$HR60, n=7, style="fisher")
fcI <- findColours(cI, pal=rev(bpy.colors(7)))
plot(nat, col=fcI, border="transparent")
legend("bottomleft", fill=attr(fcI, "palette"), legend=names(attr(fcI,
"table")), bty="n")
title(main="HR60")
or using a ColorBrewer palette. Of course, I have no idea what HR60
represents, and if it is a rate (homicide rate?), it should probably have
been represented through funnel plot or empirical Bayes methods anyway.
>
> ------------------------------------------------
> # nearest neighbors (nnb)
> coords <- coordinates(nat)
> IDs <- row.names(as(nat, "data.frame"))
> nat_10nnb <- knn2nb(knearneigh(coords, k=10), row.names=IDs)
> ------------------------------------------------
>
> 'nat' is an object of class SpatialPolygonsDataFrame, so if I
> understand correctly I don't need to set the 'longlat' argument in
> knearneigh. I also plotted a random selection of the counties in R,
> along with their neighboring counties from the weights matrix, and
> everything appeared to be working correctly.
>
> I also tried using Geoda's weights in R. Surprisingly, this map is
> even sparser (higher p-values) than the first map I made in R. I'm
> tempted to think I mixed up the data, but the map looks correct
> when I plot all counties regardless of p-value.
>
>
>
>
>
>
>
>
> ----------------------------------------
>> Date: Wed, 11 Mar 2015 19:28:11 +0100
>> From: Roger.Bivand at nhh.no
>> To: williamcmay at live.com
>> CC: r-sig-geo at r-project.org
>> Subject: Re: [R-sig-Geo] Local Moran p-values in Geoda, Python, and R
>>
>> On Wed, 11 Mar 2015, William May wrote:
>>
>>> I sent this to Roger Bivand earlier. Posting it here for
>>> posterity, and in case other people have more to add:
>>>
>>> ---------------------------------------------------------------------------
>>> Hi Roger,
>>>
>>> I'm taking a spatial econometrics class, and we've been making some
>>> LISA maps. We noticed that the p-values from the spdep package seem to
>>> be a lot different than the results from Geoda or from Python's PySAL
>>> package.
>>>
>>> I looked through the function definitions and noticed that Python, and
>>> I think Geoda, use simulations to estimate the p-value, while your
>>> localmoran function uses an equation.
>>>
>>> I attached the graphs from Geoda and from R, and my R code and the
>>> original dataset. [R code is added below.] Is this kind of
>>> difference normal?
>>
>> If you would put the figures on a website, we could compare them. One
>> possible reason for differences is the alternative= argument to
>> localmoran(). It's default is "greater", but you could try "two.sided" to
>> see whether this gives values closer to GeoDa. You don't say, by the way,
>> which version of GeoDa you are using (since you mention Python, probably
>> not legacy GeoDa), nor whether you can get the results out in tabular
>> form. Are you using the same spatial weights - both can read and write GAL
>> files. Are you aware that knearneigh() takes a longlat= argument, which
>> must be used if the coordinates are not projected. The p-values in
>> localmoran() are analytical ones from: Anselin, L. 1995. Local indicators
>> of spatial association, Geographical Analysis, 27, 93--115. As you
>> mention, in GeoDa they are by leave-one-out permutation bootstrapping.
>>
>> Roger
>>
>>>
>>> Is there a reason to prefer spdep's results over the Geoda results (or
>>> vice versa)?
>>>
>>> Thanks for any help,
>>>
>>> Will
>>> ---------------------------------------------------------------------------
>>>
>>>
>>> Here's his response:
>>>
>>> ---------------------------------------------------------------------------
>>> Please *do* write to the R-sig-geo list rather than to me directly -
>>> others can answer your question as well, perhaps better, and in a more
>>> timely way than I can. In addition, threads in the list can be
>>> searched in the archives, so others can avoid the same problem later.
>>>
>>> The only reliable test is localmoran.exact(), followed by
>>> localmoran.sad(), for reasons given in the references in their help
>>> pages. If you test more than one observation, remember to adjust the
>>> p-value for multiple comparisons. Sampling _i (all but i)
>>> overrepresents non-neighbours of i for a test on i. The formulae used
>>> are from the original paper and most often are similar to the
>>> permutation bootstrap values.
>>>
>>> Crucially, almost all use of local Moran's I (and local Geary and
>>> Getis) fall foul of the autocorrelation being related to a
>>> mis-specified mean model (omitted explanatory variables and/or omitted
>>> adjustment for global autocorrelation). Look at the references to
>>> localmoran.sad and localmoran.exact for more details, and at the
>>> relevant parts of Waller & Gotway (2004), Schabenberger & Gotway
>>> (2005), and in Bivand et al. (2008, 2013, ASDAR-book, ch. 9).
>>>
>>> LISA maps are extremely misleading when the mean model is
>>> mis-specified, and when the p-values are not adjusted for multiple
>>> comparisons. They look simple, but because adjustment for multiple
>>> comparisons is a subjective choice, you can almost choose the map you
>>> want.
>>>
>>> Roger
>>> ---------------------------------------------------------------------------
>>>
>>>
>>> And here's my R code:
>>>
>>> ---------------------------------------------------------------------------
>>> # make a LISA cluster map
>>>
>>> library(maptools)
>>> library(spdep)
>>>
>>> setwd('/home/will/classes/polisci_610/1/')
>>>
>>> # get the data
>>> nat <- readShapeSpatial("NAT.SHP", ID="FIPSNO")
>>> nat.data<-data.frame(nat)
>>> attach(nat.data)
>>>
>>> # nearest neighbors (nnb)
>>> coords<-coordinates(nat)
>>> IDs<-row.names(as(nat, "data.frame"))
>>> nat_10nnb<-knn2nb(knearneigh(coords, k=10), row.names=IDs)
>>>
>>> # "W" identifies row standardized weights
>>> nat_10nnb_w <- nb2listw(nat_10nnb, style="W")
>>>
>>> # can preset list_w object; if do this, remember to change if needed
>>> list_w <- nat_10nnb_w
>>>
>>> # LISA values
>>> lisa <- localmoran(HR60, list_w, zero.policy = T)
>>>
>>> # centers the variable of interest around its mean
>>> cDV <- HR60 - mean(HR60)
>>>
>>> # centers the local Moran's around the mean
>>> mI <- lisa[, 1]
>>> C_mI <- mI - mean(mI) # but we don't want to center it! Only the sign
>>> # matters.
>>>
>>> quadrant <- vector(mode="numeric",length=nrow(lisa))
>>> quadrant[cDV>0 & mI>0] <- 1
>>> quadrant[cDV <0 & mI>0] <- 2
>>> quadrant[cDV>0 & mI<0] <- 3
>>> quadrant[cDV <0 & mI<0] <- 4
>>>
>>> # set a statistical significance level for the local Moran's
>>> signif <- 0.05
>>>
>>> # places non-significant Moran's in the category "5"
>>> quadrant[lisa[, 5]> signif] <- 5
>>>
>>>
>>> # map
>>> png(file="Lisa_map_R.png", width = 680, res = 100)
>>> colors <- c("red", "blue", "lightpink", "skyblue2", rgb(.95, .95, .95))
>>> par(mar=c(0,0,1,0)) # sets margin parameters for plot space
>>> plot(nat, border="grey", col=colors[quadrant],
>>> main = "LISA Cluster Map, 1960 Homicides")
>>> legend("bottomright",legend=c("high-high","low-low","high-low","low-high"),
>>> fill=colors,bty="n",cex=0.7,y.intersp=1,x.intersp=1)
>>> dev.off()
>>> ---------------------------------------------------------------------------
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> R-sig-Geo at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; fax +47 55 95 91 00
>> e-mail: Roger.Bivand at nhh.no
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 91 00
e-mail: Roger.Bivand at nhh.no
More information about the R-sig-Geo
mailing list