[R] Applying a function on n nearest neighbours

Fri Oct 30 10:28:49 CET 2009

I'm having a problem where I have to apply a function to a subset of a 
variable, where the subset is defined by the n nearest neighbours of a 
second variable.

Here's an example applied to the 'iris' dataset:

$ head(iris)
  Sepal.Length Sepal.Width Petal.Length Petal.Width Species
1          5.1         3.5          1.4         0.2  setosa
2          4.9         3.0          1.4         0.2  setosa
3          4.7         3.2          1.3         0.2  setosa
4          4.6         3.1          1.5         0.2  setosa
5          5.0         3.6          1.4         0.2  setosa
6          5.4         3.9          1.7         0.4  setosa

For each row, I look at the value of Sepal.Length. I then figure out the 
n rows where the value of Sepal.Length is closest to that in the 
original row, and apply a function on the values of Sepal.Width to these 
rows (typically returning a scalar).

For example, setting n = 5 and calculcating the mean on a slightly 
modified dataset, based on the first row (Sepal.Length ~= 5.1):

$ set.seed(1)
$ iris[,1:4]=iris[,1:4]+runif(150)/100
$ x=iris$Sepal.Length[1]
$ (pos=which(order(abs(iris$Sepal.Length-x)) %in% 2:6))
[1] 18 26 40 42 52
$ mean(iris$Sepal.Width[pos])
[1] 3.086595

Now, I could easily use a 'for' loop or 'sapply' to do this for all 
rows, but I would think there is a better (and perhaps even faster?) 
way. Anyone know of a specific function in a package for this sort of 
thing?

Also note that this way of doing it won't necessarily work on the 
unmodified dataset, where a number of rows have the same values for 
'Sepal.Length', and the original row won't necessarily have 'order' 
value equal to 1. (Exactly how to break ties when there are more than n 
number of observations with the same distance to the original row isn't 
very important, though. For example, using the ones with lowest row 
numbers would be an OK solution, or n random ones, would both OK.)

-- 
Karl Ove Hufthammer