[R-sig-Geo] spdep neighbor generation and subsequent regression analysis

Sun Nov 15 20:22:23 CET 2009

Hello:
On first sight, this is about read.gal[2,3] and read.gwt2nb, but in the 
long run, it is about strategies for working with very large datasets.
Here is the background. With Robert Hijmans' support, we generated a 
comprehensive database of world-wide green house gas emissions (GHGs) 
and a wide range of explanatory variables. The point file now contains 
(depending on source) between 1.4 and 2.1 million locations, all on a 
0.1 degree grid. We would like to run a bunch of spatial regression 
models on this very large dataset. In the end, we would like to 
determine which (set of) variable(s) have what kind of effect on GHGs in 
what part of the world. The variables are physical, economic, 
demographic, and geographic (e.g. distance from ocean) in nature.
This procedure usually starts with creating a spatial weights matrix, 
which we tried in R but lead to an endless process (we tried it 
repeatedly on machines with 4 GM RAM and Xeon processors; it did not 
bail, just kept running at about 50% CPU time using between 300 and 2100 
MB of memory for more than a week until we killed the process).
GeoDA ran for about six hours and then produced a file with a good 5 
million records, 99% of which contained zero neighbors. This is where 
the immediate question comes into play. The read.gal function did not 
like the file produced by GeoDA. There is some GeoDA documentation that 
suggests that we should use read.gal2 or read.gal3 but these are not 
part of the spdep distribution, nor could I find them anywhere. As it 
happens, the file generated had a .gwt extension, so I tried 
read.get2nb. It seemed to accept the input but then completely killed 
the whole R process (I kept screen shots just for Roger). My guess is 
that (a) the matrix was too big, or (b) it was too sparse, or (c) it was 
a corrupt product of GeoDA in the first place.
Which brings me back to the bigger picture and the following questions:
1) Is there something inherently wrong with our approach?
2) Can anybody think of alternative ways to create a spatial regression 
model for the above mentioned questions?
3) Would it be worthwhile to move onto a Linux machine and recompile all 
the different packages?
Cheers,
     Jochen