[R-sig-Geo] point.in.polygon() on massive datasets

Roger Bivand Roger.Bivand at nhh.no
Thu Dec 13 22:25:58 CET 2007


On Thu, 13 Dec 2007, Edzer Pebesma wrote:

> overlay for points in polygons is a wrapper around point.in.polygon, the
> same that is used in gstat. What I miss in the R code (and C source
> code) is e.g. the checking on the bbox, whether the point is within the
> bounding box of a polygon, although the bbox is computed.

That's right - as it is there is an lapply() on the Polygon objects in 
each Polygons object with hole handling inside an lapply() on the Polygons 
objects in the SpatialPolygons object.

res <- lapply(slot(xx, "polygons"), bbox)

gets the bbox'es of the xx SpatialPolygons* object, and you could build a 
fast searcher in C with code in spdep/src/insiders.c:

int between(double x, double low, double up) {
 	if (x >= low && x <= up) return(1);
 	else return(0);
}

int pipbb(double pt1, double pt2, double *bbs) {
 	if ((between(pt1, bbs[0], bbs[2]) == 1) &&
 		(between(pt2, bbs[1], bbs[3]) == 1)) return(1);
 	else return(0);
}

in a compiled loop. You'd look to get a list 50M long with integer vectors 
for candidate members, then only do p-in-p for the candidate Polygons 
objects.

Roger

>
> It would of course speed things up increadibly when using tree indexes,
> either on the points or on the polygons, or preferably both, but this is
> currently not in the sp code. A nice student project!
> --
> Edzer
>
> Barry Rowlingson schrieb:
>> Markus Loecher wrote:
>>
>>> Dear all,
>>> I have a dataset of about 50 million lat/lon coordinates each of which falls into one of 550 polygons.
>>> I need to assign their memberships and have used point.in.polygon() for that purpose.
>>> However, the simple way of looping over the 50 million points clearly takes a looong time; 1 million points took about 3-4 days on a fast Linux server with lots of memory.
>>> Am I overlooking obvious ways of making this massive computation more efficient ? Would R trees help ?
>>> Should I try to compile the C code for point.in.polygon() (available from gstat) and run it outside R as a standalone executable ?
>>> I am already using apply() to mitigate the inefficiency of the for loop in R.
>>>
>>> Any help would be greatly appreciated,
>>>
>>>
>>
>>   Have you tried the 'overlay' functions from the sp package? Overlaying
>> points on polygons using those checks all the polygons for each point in
>> one go, so it may do some spatial tree optimising... You might have to
>> do your 50 million points in batches though...
>>
>> Barry
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no




More information about the R-sig-Geo mailing list