[R] finding euclidean proximate points in two datasets
David Winsemius
dwinsemius at comcast.net
Thu May 20 18:18:45 CEST 2010
On May 20, 2010, at 11:12 AM, Alexander Shenkin wrote:
> On 5/20/2010 9:18 AM, David Winsemius wrote:
>>
>> On May 20, 2010, at 10:02 AM, Alexander Shenkin wrote:
>>
>>> Hello all,
>>>
>>> I've been pouring through the various spatial packages, but
>>> haven't come
>>> across the right thing yet.
>>
>> There is a SIG for such questions.
>
> thanks - just joined it.
>
>>> Given a set of points in 2-d space X, i'm trying to find the
>>> subset of
>>> points in Y proximate to each point in X. Furthermore, the
>>> proximity
>>> threshold of each point in X differs (X$threshold). I've
>>> constructed
>>> this myself already, but it's horrificly slow with a dataset of 40k+
>>> points in one set, and a 700 in the other.
>>>
>>> A very inefficient example of what I'm looking for:
>>
>> Not really a reproducible example. If euclidean_dist is a function ,
>> then it is not one in any of the packages I have installed.
>
> it's not reproducible - i'll make a better effort to include
> reproducible code in the future. and that function is just one i
> would
> have written, but didn't want to clog the email with code. Anyway,
> here
> is a reproducible example:
>
> X = data.frame(x=c(1,2,3), y=c(2,3,1), threshold=c(1,2,4))
> Y = data.frame(x=c(5,2,3,4,2,5,2,3), y=c(5,2,2,4,1,2,3,1))
> proximate=list()
> i=1
> for (pt in 1:length(X$x)) {
> proximate[[i]] <- sqrt((X[pt,]$x - Y$x)^2 + (X[pt,]$y - Y$y)^2) >
> X[pt,]$threshold
> i=i+1
> }
> proximate
This might be a more R-ish way of approaching the problem. It is not
always the case that apply functions will be faster than looping, but
they may be more compact and they get you ready for thinking about
indexed solutions which would be generally faster.
> apply(X, 1, function(.x) { (.x[1] - Y$x )^2+(.x[2] - Y$y)^2
<.x[3]^2 } )
[,1] [,2] [,3]
[1,] FALSE FALSE FALSE
[2,] FALSE TRUE TRUE
[3,] FALSE TRUE TRUE
[4,] FALSE FALSE TRUE
[5,] FALSE FALSE TRUE
[6,] FALSE FALSE TRUE
[7,] FALSE TRUE TRUE
[8,] FALSE FALSE TRUE
Transposing lets you check that the results are the same as your
solution modulo yours and mine being slightly different classes:
> t( apply(X, 1, function(.x) { (.x[1] - Y$x )^2+(.x[2] - Y$y)^2
<.x[3]^2 } ) )
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8]
[1,] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[2,] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
[3,] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
> proximate
[[1]]
[1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE
[[2]]
[1] FALSE TRUE TRUE FALSE FALSE FALSE TRUE FALSE
[[3]]
[1] FALSE TRUE TRUE TRUE TRUE TRUE TRUE TRUE
>
>>>
>>> for (pt in X$idx) {
>>> proximity[i] = euclidian_dist(X[pt]$x, X[pt]$y, Y$x, Y$y) <
>>> X$threshold
>>> i = i+1
>>> }
>>>
>>
>> Have you considered first creating a subset of candidate points
>> that are
>> within "threshold" of each reference point on both coordinates. That
>> might sidestep a lot of calculations on points that are easily
>> eliminated on a single comparison. Then you could calculate distances
>> within that surviving subset of points. On average that should give
>> you
>> an over 50% "hit rate":
>>
>>> (4/3)*pi*0.5^3
>> [1] 0.5235988
>
> That's a nice idea. I'll still be waiting quite a while while my
> machine cranks, but not as long. Still - I suspect there would be
> much
> bigger gains if there were tailored packages. I'll re-post over on
> sig-geo about that. thanks.
>
>>> Perhaps crossdist() in spatstat is what I should use, and then
>>> code a
>>> comparison with X$threshold after the cross-distances are computed.
>>> However, I was wondering if there was another tool I should be
>>> considering. Any and all thoughts are very welcome. Thanks in
>>> advance.
>>>
>>> Thanks,
>>> Allie
>>> --
>>> Alexander Shenkin
>>> PhD Candidate
>>> School of Natural Resources and Environment
>>> University of Florida
David Winsemius, MD
West Hartford, CT
More information about the R-help
mailing list