[R-sig-Geo] read.gal() for large data sets

Tue Aug 23 19:47:57 CEST 2011

On Tue, 23 Aug 2011, Juta Kawalerowicz wrote:

> Dear List,
>
> I am looking for a general strategy for the following problem. I have a 
> large data set with 200 000 rows and 50 variables. Along the lines of 
> Anselin's R workbook I have used GeoDa to creata weights (200 Mb file), 
> but when I am trying to do read them into R by using
>
> library(spdep)
> weights<-read.gal("weights.gal")
>
> it does not seem to work (or maybe I should to wait for more than one 
> hour?).

read.gal() is quite complicated inside, because the IDs used may not be 
the integers 1:n, so needs to read in the data and manipulate it a good 
deal. I think that you also have many neighbours, as if you have a 200MB 
file and use integers 1:200000, taking 7 characters per integer on 
average, you have almost 1500 neighbours. This is far from sparse. I 
suggest generating the neighbour object in R directly if you can, as the 
smoothing effect of such a large number of neighbours on average may be 
very powerful, and may not represent the spatial process adequately. 
Depending on what you want to do with the data, you may prefer a 
graph-based or kNN approach.

> My computer runs on i7-2630QM CPU with 4 GB RAM. Any suggestions? In 
> principle - could somebody advise me on what are the strategies for 
> spatial analysis on large data sets?

4GB is not large, as most newer machines can run in 64-bit mode, and 
handle much more without trouble, this sounds like a standard laptop. I 
don't think that there is an obvious answer to your question, as 
approaches will vary greatly depending on what kind of analysis you want 
to do, and at least partly on whether the data are planar or use 
geographical (unprojected) coordinates, forcing the use of Great Circle 
distances.

If your analysis is embarassingly parallelisable and you have plenty of 
memory, you can use all your cores at once; you need more memory because 
each core uses the data and in most systems needs its own copy of part of 
the data set. One copy of your data as a matrix is about 80MB in the R 
workspace, which isn't large as such; the "lm" object from regressing one 
column on the others is about 200MB, but can be made smaller. But whether 
say Moran's I of 200000 observations tells a great deal is another matter, 
it depends on the problem you are analysing and how you have set out your 
model.

Hope this clarifies,

Roger

>
> Thanks,
> Juta
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

-- 
Roger Bivand
Department of Economics, NHH Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no