[R-sig-Geo] read.gal() for large data sets
Roger Bivand
Roger.Bivand at nhh.no
Tue Aug 23 19:47:57 CEST 2011
On Tue, 23 Aug 2011, Juta Kawalerowicz wrote:
> Dear List,
>
> I am looking for a general strategy for the following problem. I have a
> large data set with 200 000 rows and 50 variables. Along the lines of
> Anselin's R workbook I have used GeoDa to creata weights (200 Mb file),
> but when I am trying to do read them into R by using
>
> library(spdep)
> weights<-read.gal("weights.gal")
>
> it does not seem to work (or maybe I should to wait for more than one
> hour?).
read.gal() is quite complicated inside, because the IDs used may not be
the integers 1:n, so needs to read in the data and manipulate it a good
deal. I think that you also have many neighbours, as if you have a 200MB
file and use integers 1:200000, taking 7 characters per integer on
average, you have almost 1500 neighbours. This is far from sparse. I
suggest generating the neighbour object in R directly if you can, as the
smoothing effect of such a large number of neighbours on average may be
very powerful, and may not represent the spatial process adequately.
Depending on what you want to do with the data, you may prefer a
graph-based or kNN approach.
> My computer runs on i7-2630QM CPU with 4 GB RAM. Any suggestions? In
> principle - could somebody advise me on what are the strategies for
> spatial analysis on large data sets?
4GB is not large, as most newer machines can run in 64-bit mode, and
handle much more without trouble, this sounds like a standard laptop. I
don't think that there is an obvious answer to your question, as
approaches will vary greatly depending on what kind of analysis you want
to do, and at least partly on whether the data are planar or use
geographical (unprojected) coordinates, forcing the use of Great Circle
distances.
If your analysis is embarassingly parallelisable and you have plenty of
memory, you can use all your cores at once; you need more memory because
each core uses the data and in most systems needs its own copy of part of
the data set. One copy of your data as a matrix is about 80MB in the R
workspace, which isn't large as such; the "lm" object from regressing one
column on the others is about 200MB, but can be made smaller. But whether
say Moran's I of 200000 observations tells a great deal is another matter,
it depends on the problem you are analysing and how you have set out your
model.
Hope this clarifies,
Roger
>
> Thanks,
> Juta
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
--
Roger Bivand
Department of Economics, NHH Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no
More information about the R-sig-Geo
mailing list