[R-sig-Geo] Memory problems with dnearneigh in spdep -- related data questions

Empty Empty phytophthorasb at yahoo.com
Fri Apr 12 23:29:36 CEST 2013

Thanks again, Roger. I will work on the rasters and learning the focal method for this.
In the interest of learning for beyond this particular analysis, I'm a bit confused about what I'm seeing in the data summaries and how to understand it. I realize understanding one's data is extremely fundamental, sorry.

The data I was using was from a points file generated in ArcGIS. I have another older Arc-generated points file of the same data that is not projected and proj4string NA, but everything else is the same
When I took an older raster of the data that I have and converted it to points in R (package: raster),
I get the following summary

Object of class SpatialPointsDataFrame
Coordinates:
min      max
x -17.52090 51.40407
y -34.82918 37.53746
Is projected: NA
proj4string : [NA]
Number of points: 36820887
Data attributes:
Min.   1st Qu.    Median      Mean   3rd Qu.      Max.
0.00      0.28      2.57     26.43     10.51 103100.00

So it is the same, except it isn't projected and it doesn't have that entry, NA's. Where the ones that come from ArcGIS have a very high number of NA's (all but 1 million of the points - 35m out of 36m).
Again it's all from a 1k raster.

1) Any ideas why the difference in NA's and if this might be part of my problem? And in general, what would one check for in the data to avoid this problem (if it is a problem)?
2) In the projected data set, it clearly says units=m (inside the proj4string entry), so if I do a distance threshold can I assume it is calculating in meters? If it isn't projected, is the default then lat lon and kilometers? In general, if I'm looking at a summary of data, how do I know in what units it is going to be interpreting a distance threshold (or something similar for any other function)?

Some weeks ago, I successfully ran the G* with dnearneigh on a subset of the data - just Ethiopia (the points file was generated from ArcGIS). The distance threshold I used was 30 (honestly I was just trying to see if it would run, there was no theoretical reason for 30). The output made sense and was similar to what my colleague got in ArcGIS for Ethiopia.

Summary of data:
Object of class SpatialPointsDataFrame
Coordinates:
min     max
coords.x1 -162966.9 1492490
coords.x2  377203.2 1642482
Is projected: TRUE
proj4string :
[+proj=utm +zone=37 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0]
Number of points: 1335221
Data attributes:
OBJECTID          POINTID    GRID_CODE
Min.   :      1   Min.   :0   Min.   :    0.00
1st Qu.: 333806   1st Qu.:0   1st Qu.:    5.32
Median : 667611   Median :0   Median :   15.25
Mean   : 667611   Mean   :0   Mean   :   59.38
3rd Qu.:1001416   3rd Qu.:0   3rd Qu.:   69.10
Max.   :1335221   Max.   :0   Max.   :72526.11

3) This data set doesn't have the NA's and it worked. Hmmm. But the units are in meters and I used a distance threshold of 30. For 1km data, shouldn't there have been no neighbors at 30m? Why did this work? or was it calculating kilometers instead?

Thanks for any insight,

Juliann

----- Original Message -----
From: Roger Bivand <Roger.Bivand at nhh.no>
To: Empty Empty <phytophthorasb at yahoo.com>
Cc: "r-sig-geo at r-project.org" <r-sig-geo at r-project.org>
Sent: Wednesday, April 10, 2013 12:10 AM
Subject: Re: [R-sig-Geo] Memory problems with dnearneigh in spdep

On Tue, 9 Apr 2013, Empty Empty wrote:

> Thanks so much, Roger, for your suggestions.
> I should have realized it was taking kilometers rather than meters!
>
> My summary(r) is:
> Object of class SpatialPointsDataFrame
> Coordinates:
>                 min      max
> coords.x1 -17.52090 51.40407
> coords.x2 -34.82918 37.53746
> Is projected: TRUE proj4string :
> [+proj=aea +lat_1=20 +lat_2=-23 +lat_0=0 +lon_0=25 +x_0=0 +y_0=0 +ellps=WGS84
> +datum=WGS84 +units=m +no_defs +towgs84=0,0,0]

Given your object summary, you need to check where your data came from. The bounding box is for geographical coordinates, but the declared coordinate reference system is projected in units of metres. So your 9.333 is 9.333m, and no neighbours will be found. The inclusion of NAs in gridded data represented as points is unnecessary, and all points not on land should be dropped before analysis begins.

I do suggest moving to a raster representation, and using focal methods in the raster package to generate the separate components of equation 14.3, p. 263-264 in Gettis & Ord (1996). Using focal methods defines the moving window as a matrix, here 10x10, moved over the raster and circumventing the creation of a weights object - create a new raster layer with \sum{j} w_{ij}(d) x_j values. W_i^* will be a constant, as will \bar{x}^*, and possibly the other starred terms too.

Hope this helps,

Roger

> Number of points: 36820887
> Data attributes:
>     POINTID           GRID_CODE         Min.   :      1    Min.   :     0.00   1st Qu.: 250000    1st Qu.:     0.28   Median : 500000    Median :     2.57   Mean   : 500000    Mean   :    26.43   3rd Qu.: 750000    3rd Qu.:    10.51   Max.   : 999999    Max.   :103104.70   NA's   :35820888
>
> It seems that there are a lot of NA's. (Everything in the square outside the continent's borders should be an NA, so maybe that's a reasonable number). It seemed right when I plotted it earlier.
>
> I ran it with 9.333 as the distance band (then I was going to try 1.5 next).
> nb<- dnearneigh(r, 0, 9.333)
>
> It used a constant amount of memory for the first ~90 minutes and then it increased at a constant rate for the next 20 hours
> until it was using virtually all the memory on the server and was stopped.
> Because the administrator had to kill my job, I wasn't able to get session info while the session was open. I'm not sure this is at all helpful, but when I re-started R this is the sessionInfo() :
> R version 2.15.3 (2013-03-01)
> Platform: x86_64-pc-linux-gnu (64-bit)
>
> locale:
>  [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C         LC_TIME=C            LC_COLLATE=C
>  [5] LC_MONETARY=C        LC_MESSAGES=C        LC_PAPER=C           LC_NAME=C
>  [9] LC_ADDRESS=C         LC_TELEPHONE=C       LC_MEASUREMENT=C     LC_IDENTIFICATION=C
> attached base packages:
> [1] stats     graphics  grDevices utils     datasets  methods   base
>
> loaded via a namespace (and not attached):
> [1] tools_2.15.3
>
> Thanks,
> Juliann
>
>
> ________________________________
> From: Roger Bivand <Roger.Bivand at nhh.no>
> To: Empty Empty <phytophthorasb at yahoo.com> Cc: "r-sig-geo at r-project.org" <r-sig-geo at r-project.org> Sent: Sunday, April 7, 2013 12:15 PM
> Subject: Re: [R-sig-Geo] Memory problems with dnearneigh in spdep
>
> I do prefer an affiliation, a real name, and a motivation; a
> yahoo or gmail address tells me nothing about what you might be expected
> to know. If I (or a list helper) cannot "place" you, it is much harder to
> know why you are having problems.
>
> You are also in breach of list rules because you are posting in HTML rather than just plain text. Removing the HTML will shorten your posting radically.
>
>
> On Sun, 7 Apr 2013, Empty Empty wrote:
>
>> Hi,
>>
>> I'm running into memory problems calculating Gi* using dnearneigh and wonder if someone has any suggestions for either my script or another work around.
>>
>> I have 1km data for Africa. My colleague calculated the appropriate distance band (9333m) using incremental spatial autocorrelation in ArcGIS but she was only able to run a subset of the data or with the data aggregated to 5km due to memory issues (fixed distance, Euclidean). I tried it in both ArcGIS (memory error after 45 hours) and R (Within 2 hours it was using more than 245GB on the server and then crashed). This is what I used
>
> Please always include the output of sessionInfo(). Here also include summary(r) output.
>
>>
>> library(spdep)
>>
>>
>>
>> coords<-coordinates(r)
>
> This is not needed.
>
>> nb9333<- dnearneigh(coords, 0,9333,longlat = F)
>
> nb9333<- dnearneigh(r, 0, 9333)
>
> would work, and would not override longlat=. I assume in fact that r is in longlat, and that you would then mean to use a km threshold of 9.333. If I'm right, then the memory usage is being complicated by your including all observations as neighbours of each observation.
>
> For a 1km grid, a threshold of 9333m means that each of the about 30M gridcells has 100 neighbours (rather strong smoothing). Is the threshold sensibly chosen? The neighbour object will be 30M by 100 by 4, 1.21e+10 (12G); the labels of the neighbour object at least the same; the weights 24G in addition, so nb9333.listw is about 50G. With a more sensible threshold (1.5km?), the sizes of the nb9333.listw weights object become more tractable, maybe about 5G.
>
> Is any of the data observed on this grid? For localG* (or localG) to make any sense, the variable must be observed at this resolution (satellite imagery?) - if interpolated, all bets are off.
>
> Finally, the output involves so many multiple comparisons that no statistical inference is possible under normal assumptions - where is the 3.886 coming from - is this an attempt to adjust for multiple comparisons?
>
> I'm unsure why anyone would do this, as you could use standard filtering techniques on imaging data. The raster package also has functions for computing the local Moran and Geary measures on RasterLayer objects.
>
> Hope this clarifies,
>
> Roger
>
>> nb9333.listw <- nb2listw(include.self(nb9333), style="B")???
>> G9333<-localG(r\$GRID_CODE,nb9333.listw)???
>>
>>
>> sig3 <- classIntervals(G9333, n=3, style="fixed",
>> ?????????????????????? fixedBreaks=c(min(G9333),-3.886, 3.886, max(G9333)))???? #I dont' think 3.886 is right
>> cs<-c('blue','white','red') # blue = below -3, white = intermediate, red = above 3
>> sigcol <- findColours(sig3, cs)?
>>
>> png("r.png")
>> plot(r, col=sigcol, axes=TRUE)
>> dev.off()
>>
>>     [[alternative HTML version deleted]]
>>
>>
>
> -- Roger Bivand
> Department of Economics, NHH Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; fax +47 55 95 95 43
> e-mail: Roger.Bivand at nhh.no
>

-- Roger Bivand
Department of Economics, NHH Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no