[R-sig-Geo] knearneigh() computing efficiency on large datasets

Pierre Roudier pierre.roudier at gmail.com
Mon Feb 1 01:31:22 CET 2010


Dear Edzer and Robert,

Thank you for your replies.

> Suggestions to improve this would be to use an index structure (for the
> generic case of possibly irregularly spread data) such as the PR-bucket
> quadtree used in package gstat (nsearch.c).

Yes, I thought about that, as I want to process both regularly and
irregularly gridded data. Unfortunately, my C experience is a bit
short for that. I'll keep looking for something - if I don't find
anything I'd try to code that myself.

> You can try one of the 'focal' functions of the raster package on
> R-Forge. Not particularly fast either, but probably much faster that
> what you have now.

Thanks Robert (and congrats for such a great package BTW), I installed
the last version and will certainly test it, but as I said I would
like to process also ungridded data, so I'm still thinking about a
more general solution.

I was wondering if it could be a solution to use RSAGA or spgrass6 to
do that. Feedback, anyone ?

Cheers,

Pierre



2010/1/29 Edzer Pebesma <edzer.pebesma at uni-muenster.de>:
> Pierre Roudier wrote:
>>
>> Dear list,
>>
>> I am processing gridded data like DEMs, imported in R from geotiff
>> files. Of course, such data is regularly gridded, but very often with
>> NA values. In the framwork of some tests, I have to generate a
>> neighbourhood of the DEM, so that to extract local extrema of the
>> layer. For this task, I wrote a function based on  the knearneig()
>> function from the spdep package, with k=4 or k=8, to generate and
>> analyse the neighbourhood of each point.
>>
>> Unfortunately, I often have to process big datasets (let's say grids
>> with between  50,000 and 350,000 points), and that's working but
>> knearneigh() takes *hours* to process.
>>
>> Does anybody would have any suggestion to improve the efficiency of this
>> step ?
>>
>
> the algorithm (knn.c file in the spdep sources) seems to compute all
> distances between all grid cells, and for each grid cell find the nearest k
> by some kind of insertion sort. That makes it general for all kinds of
> spatial data points, including irregularly spread ones, but makes it also
> inefficient (because O(n^2) with n the number of grid cells) for large data
> sets.
>
> Suggestions to improve this would be to use an index structure (for the
> generic case of possibly irregularly spread data) such as the PR-bucket
> quadtree used in package gstat (nsearch.c), or specifically for gridded data
> to write a special knearneigh() that exploits the grid indexing already
> present: you can explicitly address the nearest 4/8/12/... neighbours by
> using row/col indexes. In this latter case you'd probably want to rewrite
> the test function for the full grid, instead of looping over the cells.
>>
>> Thanks,
>>
>> Pierre
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at stat.math.ethz.ch
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
> --
> Edzer Pebesma
> Institute for Geoinformatics (ifgi), University of Münster Weseler Straße
> 253, 48151 Münster, Germany. Phone: +49 251 8333081, Fax: +49 251 8339763
>  http://ifgi.uni-muenster.de http://www.52north.org/geostatistics
>  e.pebesma at wwu.de
>
>



More information about the R-sig-Geo mailing list