[R] Spatial join ? optimizing code
Dan Davison
davison at stats.ox.ac.uk
Tue Sep 16 20:10:34 CEST 2008
Hi Monica,
I think the key to speeding this up is, for every point in 'track', to
compute the distance to all points in 'classif' 'simultaneously',
using vectorized calculations. Here's my function. On my laptop it's
about 160 times faster than the original for the case I looked at
(10,000 observations in track and 500 in classif). I get around 18
seconds for the 30,000 and 4,000 example (2 GHz processor running
linux).
Dan
dist.merge2 <- function(x, y, xeast, xnorth, yeast, ynorth) {
## construct data frame d in which d[i,] contains information
## associated with the closest point in y to x[i,]
xpos <- as.matrix(x[,c(xeast, xnorth)])
xposl <- lapply(seq.int(nrow(x)), function(i) xpos[i,])
ypos <- t(as.matrix(y[,c(yeast, ynorth)]))
yinfo <- y[,! colnames(y) %in% c(yeast,ynorth)]
get.match.and.dist <- function(point) {
sqdists <- colSums((point - ypos)^2)
ind <- which.min(sqdists)
c(ind, sqrt(sqdists[ind]))
}
match <- sapply(xposl, get.match.and.dist)
cbind(xpos, mindist=match[2,], yinfo[match[1,],])
}
It's marginally faster to convert xpos to a list followed by sapply as
I do here, than to leave it as a matrix and use apply to get the
matches.
On Tue, Sep 16, 2008 at 04:23:33PM +0000, Monica Pisica wrote:
>
> Hi,
>
> Few days ago I have asked about spatial join on the minimum distance between 2 sets of points with coordinates and attributes in 2 different data frames.
>
> Simon Knapp sent code to do it when calculating distance on a sphere using lat, long coordinates and I've change his code to use Euclidian distances since my data had UTM coordinates.
>
> Typically one data frame has around 30 000 points and the classification data frame has around 4000 points, and the aim is to add to each point from the first data frame all the attributes from the second data frame of the point that is closest to it.
>
> On my PC (Dell, OptiPlex GX620, X86 ? based PC, 4 GB RAM, 3192 Mhz processor)
> It took quite a long time to do the join:
>
> user system elapsed
> 8166.07 2.98 8194.43
>
> Sys.info()
> sysname release
> "Windows" "XP"
> version nodename
> "build 2600, Service Pack 2"
> machine
> "x86"
> I am running R 2.7.1 patched.
> I wonder if any of you can suggest or help (or have time) in optimizing this code to make it run faster. My programming skills are not high enough to do it.
>
> Thanks,
>
> Monica
>
> #### code follows:
> #### x a data frame with over 30000 points with coord in UTM, xeast, xnorth
> #### y a data frame with over 4000 points with UTM coord (yeast, ynorth) and
> ##### classification
> ### calculating Euclidian distance
>
> dist <- function(xeast, xnorth, yeast, ynorth) {
> ((xeast-yeast)^2 + (xnorth-ynorth)^2)^0.5
> }
>
> ### doing the merge by location with minimum distance
>
> dist.merge <- function(x, y, xeast, xnorth, yeast, ynorth){
> tmp <- t(apply(x[,c(xeast, xnorth)], 1, function(x, y){
> dists <- apply(y, 1, function(x, y) dist(x[2],
> x[1], y[2], y[1]), x)
> cbind(1:nrow(y), dists)[dists == min(dists),,drop=F][1,]
> }
> , y[,c(yeast, ynorth)]))
> tmp <- cbind(x, min.dist=tmp[,2], y[tmp[,1],-match(c(yeast,
> ynorth), names(y))])
> row.names(tmp) <- NULL
> tmp
> }
>
> #### code end
>
> _________________________________________________________________
>
> Live.
>
> ______________________________________________
> R-help at r-project.org mailing list
> https://stat.ethz.ch/mailman/listinfo/r-help
> PLEASE do read the posting guide http://www.R-project.org/posting-guide.html
> and provide commented, minimal, self-contained, reproducible code.
--
http://www.stats.ox.ac.uk/~davison
More information about the R-help
mailing list