[R-sig-Geo] spdep neighbor generation and subsequent regression analysis

Mon Nov 16 19:57:53 CET 2009

On Sun, 15 Nov 2009, Jochen Albrecht wrote:

> Hello:
> On first sight, this is about read.gal[2,3] and read.gwt2nb, but in the long 
> run, it is about strategies for working with very large datasets.
> Here is the background. With Robert Hijmans' support, we generated a 
> comprehensive database of world-wide green house gas emissions (GHGs) and a 
> wide range of explanatory variables. The point file now contains (depending 
> on source) between 1.4 and 2.1 million locations, all on a 0.1 degree grid. 
> We would like to run a bunch of spatial regression models on this very large 
> dataset. In the end, we would like to determine which (set of) variable(s) 
> have what kind of effect on GHGs in what part of the world. The variables are 
> physical, economic, demographic, and geographic (e.g. distance from ocean) in 
> nature.

If this is like machine learning, why not use such techniques? Do you have 
a realistic spatial process model? I think that very many of the input 
variables are interpolated too, so probably spatial dependence at any 
scale will be induced by the changes in support prior to analysis. The 
results of such analysis would (or should) have large standard errors, so 
perhaps would not take you where you want to go. If you cannot handle the 
varying impacts of spatial scales in the data generating processes on both 
left and right hand sides, any observed residual dependence will certainly 
be spurious (red herring).

Could you try a small subsample across a natural experiment (a clear 
difference in treatment)? Then the difficulty of generating a large 
weights object would go away. It would also let you examine the error 
propagation/change of support problem, which would be intractable with 
many "observations", which you need to do if your results are to be taken 
seriously.

If you need to generate neighbours for very large n, please do describe 
the functions used, as there are many ways of doing this:

> This procedure usually starts with creating a spatial weights matrix, which 
> we tried in R but lead to an endless process (we tried it repeatedly on 
> machines with 4 GM RAM and Xeon processors; it did not bail, just kept 
> running at about 50% CPU time using between 300 and 2100 MB of memory for 
> more than a week until we killed the process).

actually tells us nothing, as you haven't said how exactly you were doing 
this - presumably using point support and a distance criterion? Is the 
object a SpatialPixels object? Was the distance criterion sensible (see 
the GeoDa failure reported below - perhaps not)?

> GeoDA ran for about six hours and then produced a file with a good 5 million 
> records, 99% of which contained zero neighbors. This is where the immediate 
> question comes into play. The read.gal function did not like the file 
> produced by GeoDA. There is some GeoDA documentation that suggests that we 
> should use read.gal2 or read.gal3 but these are not part of the spdep 
> distribution, nor could I find them anywhere. As it happens, the file 
> generated had a .gwt extension, so I tried read.get2nb. It seemed to accept 
> the input but then completely killed the whole R process (I kept screen shots 
> just for Roger). My guess is that (a) the matrix was too big, or (b) it was 
> too sparse, or (c) it was a corrupt product of GeoDA in the first place.

Most likely that.

> Which brings me back to the bigger picture and the following questions:
> 1) Is there something inherently wrong with our approach?

See above.

> 2) Can anybody think of alternative ways to create a spatial regression model 
> for the above mentioned questions?

It can be done, but once you have control of the scale and process issues, 
there is nothing to stop you subsampling. If you go with a 1 degree grid, 
you shouldn't have trouble fitting a model (about 15000 on-land cells), 
but it may be largish for applying say Bayesian Model Averaging, which 
might give you a feel for which variables are in play.

> 3) Would it be worthwhile to move onto a Linux machine and recompile all the 
> different packages?

When working with larger data sets, 64-bit Linux or OSX are still 
currently more viable than Windows, I believe.

Hope this helps,

Roger

> Cheers,
>    Jochen
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no