[R-sig-Geo] speed problem with %over% function

Edzer Pebesma edzer.pebesma at uni-muenster.de
Thu Mar 1 16:03:08 CET 2012


Bastien, your report is very helpful.

I believe I managed to get overlay()'s efficiency back into the over()
methods. Because of the way the code is set up for over(), the
efficiency gain will propagate to all over() methods where the second
argument has attributes. Committed to svn on r-forge; will submit to
CRAN after all dependency tests passed.

Gregor's example now runs in less than 400 Mb RAM, in 22 secs.

Bastiens:
> ### The overlay() function
> system.time(plac.dans.contour1 <- overlay(grille.plac, shape) )
   user  system elapsed
  2.920   0.000   2.928
> system.time(plac.dans.contour2<-grille.plac%over%shape)
   user  system elapsed
  6.444   0.028   6.493
so there's still some overhead going on, but less dramatic.

The reason for deprecating overlay() is not so much that it doesn't
work, but its inconsistency: overlay(x,y) does the same as overlay(y,x)
for certain combinations x,y (notably points,polygons). Improving its
behaviour would break any script that depend on the inconsistency.

In contrast, over(x,y) consistently retrieves the index of y, or if y
has attributes the attributes of y, at spatial locations of x. Note that
that operation is asymmetrical: over(x,y) does not equal over(y,x) if x
and y differ.

Best regards,


On 02/29/2012 05:19 PM, Bastien.Ferland-Raymond at mrnf.gouv.qc.ca wrote:
> Thanks to Francis and Edzer for their responses.
> 
> I been working on my problem for the last 2 days and I managed to get some very interesting information.  I also prepared a reproducible example so people can try it.
> 
> I've found 3 functions that do the thing I want:
> 1) %over%
> 2) overlay()  (which is deprecated)
> 3) gIntersection()  (from the rgeos package)
> 
> So I've tried them all on my data to compare the speed between each of them and ArcGIS.  The punchline of my analysis is that %over% is the slowest (and by a lot) of all functions including ArcGIS.  overlay() is the fastest.
> 
> To get these results, here is a reproducible example:
> 
> ### first, download the shapes from:
> ### http://www.nceas.ucsb.edu/files/scicomp/demo/read-write-shapefiles.zip
> ### This is not my files, I took them from the example at:
> ### http://www.nceas.ucsb.edu/scicomp/usecases/ReadWriteESRIShapeFiles
> 
> ###  Use the rgdal library to read the files
>> library(rgdal)
>> shape <- readOGR("C:\\Documents and Settings\\ferba1\\Bureau\\shape","nw-counties")
> OGR data source with driver: ESRI Shapefile
> Source: "C:\Documents and Settings\ferba1\Bureau\shape", layer: "nw-counties"
> with 208 features and 8 fields
> Feature type: wkbPolygon with 2 dimensions
> 
> # choose only one territory to make the analysis easier
>> shape <- shape[shape at data$COUNTYNAME=="Lemhi",]
>> plot(shape, axes=T)
> 
>> etendu <- bbox(shape)
> 
> # this code is used to transfer km in degre-decimals. It's not necessary to
> # understand it for this problem.  The only important part is that you can
> # change "separation" to increase the density of the grid. "separation"
> # represent the distance between the grid point, in that example 0.5km.
>> separation <- 0.5
>> circ.minlat <- 2*pi*6378.16*cos(pi*etendu[2,1]/180)
>> pas.degre.lon <- circ.minlat/360
>> nb.degre.lon <- separation/pas.degre.lon
>> pas.degre.lat <- 111.3199
>> nb.degre.lat <- separation/pas.degre.lat
> ###
> 
> # make the grid
>> grille.plac <- GridTopology(etendu[,1],
> +                             cellsize=c(nb.degre.lon,nb.degre.lat),
> +                             cells.dim=(ceiling((etendu[,2]-etendu[,1])/c(nb.degre.lon,nb.degre.lat))+2))
> 
>> grille.plac <- coordinates(grille.plac)
> 
>> grille.plac <- SpatialPointsDataFrame( grille.plac,data.frame(plac_ID=1:nrow(grille.plac)), proj4string = CRS(proj4string(shape)))
> 
> 
>> points(grille.plac,col="blue", pch=".")
> 
> ### The overlay() function
>> system.time(plac.dans.contour1 <- overlay(grille.plac, shape) )
>    user  system elapsed
>    2.33    0.00    2.34
>> grille.plac1 <- grille.plac[!is.na(plac.dans.contour1),]
>> points(grille.plac1,col="red", pch=".")
> 
> ###  The %over% function
>> system.time(plac.dans.contour2<-grille.plac%over%shape)
>    user  system elapsed
>  953.72    2.33  960.80
>> grille.plac2 <- grille.plac[!is.na(plac.dans.contour2)[,1],]
>> points(grille.plac2,col="yellow", pch=".")
> 
> ###  The gIntersection() function
> library(rgeos)
>> system.time(grille.plac3 <- gIntersection(grille.plac, shape))
>    user  system elapsed
>   37.83    0.05   37.99
>> points(grille.plac3,col="green", pch=".")
> 
> ###  end of code
> 
> As you can see, the %over% function is the slowest with 961 seconds to run.  The overlay() function was the best with 2.34 seconds to run.  The gIntersection was in the middle with 38 seconds to run.  Finally, the intersection tool of ArcGIS took 20 seconds to do the same thing.  Those times seem to be context specific, as in my problem, gIntersection() was the slowest with 799 seconds, while %over% did the task in 250 seconds and overlay() in 119.  overlay() was still the fastest function.
> 
> So, it seems there clearly is room for amelioration of the %over% function.  And my next question is, why is overlay() deprecated? In the help, it says: "This function is deprecated, as it has some inconsistences." but I did not find any reference on what are those "inconsistences".  Going from overlay() to %over% seems more like a downgrade than a upgrade to me.  I implemented overlay() in my code instead of %over%, saving an incredible amount of time. However I'm a little stressed about it as it may cause problems or I may lose the function in a future version of the sp package.
> 
> So I guess anybody motivated to ameliorate the %over% function could start by understanding the difference between it and the overlay() function.  Sadly, I'm no programmer so I can' help here.
> 
> Best regards,
> Bastien Ferland-Raymond
> 
> 
> Date: Tue, 28 Feb 2012 11:38:04 -0500
> From: Francis Markham <francis.markham at anu.edu.au>
> 
> I've had examples in the past where using %over% from the sp package takes
> all available RAM (7GB) and several hours, while ArcGIS takes about 300MB
> and 5 minutes, so I would agree that there is plenty of room for
> improvement here. I'll try to give a reproducable example in the coming
> weeks when I return from travel.
> 
> This is a critical issue for me insofar as spatial joins are a routine
> procedure for me and but without reasonable speed for producedures such as
> this I cannot perform all my analysis in R, making for a more complex,
> error prone work flow and scuttling the possibility of "reproducable
> research."
> 
> On a related note, does the 'sp' package have an accessible bug tracker?
> I'd like to be able to contribute to improving this very useful package if
> at all possible, but I'm not sure where to begin.
> 
> Warm regards,
> 
> Francis Markham
> Research Associate
> Fenner School of Environment and Society
> Australian National University
> 
> On 27 February 2012 19:18, Edzer Pebesma <edzer.pebesma at uni-muenster.de>wrote:
> 
>> Dear Bastien,
>>
>> the %over% method was primary written to work, and it seems that it did.
>>
>> It is hard to tell why ArcGIS is so much faster without access to its
>> source code, and your data. Chances are that it uses a spatial index,
>> and R does not.
>>
>> Talking about spatial indexes, you might want to try %over% in package
>> rgeos (essentially after loading rgeos) as that seems to use a spatial
>> index. Providing a reproducible example is always more inviting for
>> someone on the list (possible me) to look deeper into this.
>>
>> If you want to obtain points on a regular grid within a polygon,
>> sp::spsample with type "regular" may be an alternative route. Package
>> raster may have more efficient routines for many things.
>>
>> On 02/27/2012 09:54 AM, Bastien.Ferland-Raymond at mrnf.gouv.qc.ca wrote:
>>> Dear geo-list,
>>>
>>> I'm just starting using R and the rgdal package to manage and create
>> some shapes and I am hitting some speed issues while using the %over%
>> function. R is significantly slower than ArcGIS.
>>>
>>> Here is what I'm doing (I've shorten the code to make it easier to
>> follow, I don't expect it to be reproducible).
>>>
>>> # I first load a shape giving the perimeter of the territory of interest
>>> terr <- readOGR("D:\\Couches GIS\\06152\\contour",
>> "contour_carto06152_Project")
>>>
>>> # I then want to make a grid of points.  The grid is very dense and
>> rectangular
>>> # to do so, I first make a gridtopology
>>> grille.plac.temp1 <- GridTopology(...)
>>>
>>> # and get all the coordinates from the grid
>>> grille.plac.temp2 <- coordinates(grille.plac.temp1)
>>>
>>> # and finally transform it to a SpatialPointsDataFrame object
>>> grille.plac <-
>> SpatialPointsDataFrame(grille.plac.temp2,data.frame(1:nrow(grille.plac.temp2)),proj4string
>> =CRS(proj4string(terr)))
>>>
>>> # The grid is very big as it is a rectangle in which you can fit the
>> territory shape
>>> # (which is of irregular shape). I then want to reduce it's size by
>> keeping only
>>> # to points from the grid that are in the territory.  I use the %over%
>> function to do
>>> # that.  This is the slow part.
>>> system.time(plac.dans.contour<-grille.plac%over%contour)
>>>    user  system elapsed
>>> 1420.23    9.57 1435.17
>>>
>>> # I then make a new shape with only the points from the grid that are in
>> the territory
>>> grille.plac.reduce <- grille.plac[as.vector(!is.na(plac.dans.contour)),]
>>>
>>>
>>> This code works fine and produce what I want it to produce. However, the
>> %over% functions takes 24 minutes to run, while ArcGIS can run it in less
>> than 1 minute with the intersect tool.  Is there a more efficient function
>> than %over% to do what I want?  Is it normal that ArcGIS is that much
>> faster than R in that case?
>>>
>>> Thanks for your help,
>>>
>>> Bastien Ferland-Raymond, M.Sc. Stat., M.Sc. Biol.
>>> Division des orientations et projets sp?ciaux
>>> Direction des inventaires forestiers
>>> Minist?re des Ressources naturelles et de la Faune du Qu?bec
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> R-sig-Geo at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>> --
>> Edzer Pebesma
>> Institute for Geoinformatics (ifgi), University of M?nster
>> Weseler Stra?e 253, 48151 M?nster, Germany. Phone: +49 251
>> 8333081, Fax: +49 251 8339763  http://ifgi.uni-muenster.de
>> http://www.52north.org/geostatistics      e.pebesma at wwu.de
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
> 
>         [[alternative HTML version deleted]]
> 
> 

-- 
Edzer Pebesma
Institute for Geoinformatics (ifgi), University of Münster
Weseler Straße 253, 48151 Münster, Germany. Phone: +49 251
8333081, Fax: +49 251 8339763  http://ifgi.uni-muenster.de
http://www.52north.org/geostatistics      e.pebesma at wwu.de



More information about the R-sig-Geo mailing list