[R-sig-Geo] spdep neighbor generation and subsequent regression analysis
Jochen Albrecht
jochen at hunter.cuny.edu
Sun Nov 15 20:22:23 CET 2009
Hello:
On first sight, this is about read.gal[2,3] and read.gwt2nb, but in the
long run, it is about strategies for working with very large datasets.
Here is the background. With Robert Hijmans' support, we generated a
comprehensive database of world-wide green house gas emissions (GHGs)
and a wide range of explanatory variables. The point file now contains
(depending on source) between 1.4 and 2.1 million locations, all on a
0.1 degree grid. We would like to run a bunch of spatial regression
models on this very large dataset. In the end, we would like to
determine which (set of) variable(s) have what kind of effect on GHGs in
what part of the world. The variables are physical, economic,
demographic, and geographic (e.g. distance from ocean) in nature.
This procedure usually starts with creating a spatial weights matrix,
which we tried in R but lead to an endless process (we tried it
repeatedly on machines with 4 GM RAM and Xeon processors; it did not
bail, just kept running at about 50% CPU time using between 300 and 2100
MB of memory for more than a week until we killed the process).
GeoDA ran for about six hours and then produced a file with a good 5
million records, 99% of which contained zero neighbors. This is where
the immediate question comes into play. The read.gal function did not
like the file produced by GeoDA. There is some GeoDA documentation that
suggests that we should use read.gal2 or read.gal3 but these are not
part of the spdep distribution, nor could I find them anywhere. As it
happens, the file generated had a .gwt extension, so I tried
read.get2nb. It seemed to accept the input but then completely killed
the whole R process (I kept screen shots just for Roger). My guess is
that (a) the matrix was too big, or (b) it was too sparse, or (c) it was
a corrupt product of GeoDA in the first place.
Which brings me back to the bigger picture and the following questions:
1) Is there something inherently wrong with our approach?
2) Can anybody think of alternative ways to create a spatial regression
model for the above mentioned questions?
3) Would it be worthwhile to move onto a Linux machine and recompile all
the different packages?
Cheers,
Jochen
More information about the R-sig-Geo
mailing list