[R] SPDEP Package, neighbours list for Moran's I on large grid dataset
Ben Haller
rhelp at sticksoftware.com
Wed Apr 20 15:05:20 CEST 2011
Laurent Jégou wrote:
> Hello list members, i'd like to calculate the Moran I on a large dataset,
> a 8640x3432 grid of values.
>
> When i try to create the neighbours list with the cell2nb function, on
> such a scale, R works for several hours and seems to crash. I'm using the last
> version (2.13), 64 bits, on a mac pro with 4go of memory.
>
> I think that i'm doing it wrong :-)
>
> I'd like to use a "moving window" weight matrix instead of a full scale
> one, but i was not lucky enough to find a solution in the docs.
I confronted this problem recently, and decided to roll my own. For a large dataset, an exhaustive computation of Moran's I is very time-consuming, as you say; but an estimate of it, based on a subset of pairs chosen randomly, seemed (for my data, at least) to converge quite nicely. Here's my code. It assumes a square grid, so you'll need to adapt it for your non-square grid, but that should be trivial. To determine the right number of pairs for a good estimate in my application, I called this function with various numbers of pairs, and plotted the estimate produced as a function of the number of pairs used, and observed good convergence by 200,000 pairs. You will probably want to do this yourself, for your dataset. 200,000 pairs took just a second or two to run, on my machine, so this is much, much faster than an exhaustive calculation. As a bonus, this code also gives you Geary's C. Oh, and the code below assumes a wraparound lattice (i.e. a torus, i.e. periodic boundaries), which may not be what you want; just get rid of the d_wraparound stuff and the following pmax() call if you want non-wraparound boundaries, and that should work fine, I think. Not sure what you mean by the moving window thing, so I leave that as an exercise for the reader. :->
I've never made an R package, and I'm not sure there's much demand for this code (is there?), so I'm presently unlikely to package it up. However, if someone who owns a package related to this problem would like to adopt this code, generalize it to non-square lattices, add a flag to choose periodic or non-periodic boundaries, etc., feel free to do so; I hereby place it in the public domain. I'd just like to hear about it if someone does this, and receive credit somewhere, that's all. I'm not super-good in R, either, so if there's a better way to do some things, like the somewhat hacked random index generation (trying to avoid floating point issues when generating a random integer), please let me know. I'm always interested in learning how to do things better.
And of course this code is provided without warranty, may have bugs, etc.
Enjoy!
Ben Haller
McGill University
correlation_stats <- function(p, n_pairs=200000)
{
# Compute Moran's I and Geary's C for the landscape p. This is tricky because the exact calculation
# would be far too time-consuming, as it involves comparison of every point to every other point.
# So I am trying to estimate it here with a small subset of all possible point comparisons.
p_size <- NROW(p) # 512
p_length <- length(p) # 262144
mean_p <- mean(p)
pv <- as.vector(p)
# select points and look up their values; for speed, points are selected by their vector index
p1ix <- floor(runif(n_pairs, min=0, max=p_length - 0.001))
p1x <- (p1ix %/% p_size) + 1
p1y <- (p1ix %% p_size) + 1
p1ix <- p1ix + 1
p2ix <- floor(runif(n_pairs, min=0, max=p_length - 0.001))
p2x <- (p2ix %/% p_size) + 1
p2y <- (p2ix %% p_size) + 1
p2ix <- p2ix + 1
v1 <- pv[p1ix] - mean_p
v2 <- pv[p2ix] - mean_p
# Calculate distances between point pairs, both directly and wrapping around
# The average direct distance is much smaller than the average wraparound distance,
# because points near the center vertically have a maximum direct distance near 256,
# but a minimum wraparound distance near 256. The Moran's I estimate from wraparound
# distances is different, as a result. Rupert recommends taking the shorter of the
# two distances, whichever it is, because that keeps us within the meaningful scale
# of our space, before it just starts repeating due to periodicity.
d_direct <- 1 / sqrt(((p1x - p2x) ^ 2) + ((p1y - p2y) ^ 2))
d_direct[is.infinite(d_direct)] <- 0
d_wraparound <- 1 / sqrt(((p1x - p2x) ^ 2) + ((p_size - abs(p1y - p2y)) ^ 2))
d_wraparound[is.infinite(d_wraparound)] <- 0
d <- pmax(d_direct, d_wraparound) # max because we want the min distance, and these are 1/distance
# precalculate some shared terms
sum_d <- sum(d)
sum_v1_sq <- sum(v1 ^ 2)
# Moran's I: -1 is perfect dispersion, 1 is perfect correlation, 0 is a random spatial pattern
M_I <- (n_pairs / sum_d) * (sum(d * v1 * v2) / ((sum_v1_sq + sum(v2 ^ 2)) / 2))
# (n_pairs / sum(d_direct)) * (sum(d_direct * v1 * v2) / ((sum(v1 ^ 2) + sum(v2 ^ 2)) / 2))
# (n_pairs / sum(d_wraparound)) * (sum(d_wraparound * v1 * v2) / ((sum(v1 ^ 2) + sum(v2 ^ 2)) / 2))
# Geary's C: 0 is positive spatial autocorrelation, 2 is negative spatial autocorrelation, 1 is a random spatial pattern
G_C <- ((n_pairs - 1) * sum(d * ((v1 - v2) ^ 2))) / (2 * sum_d * sum_v1_sq)
return(list(Moran_I=M_I, Geary_C=G_C))
}
More information about the R-help
mailing list