[R-sig-Geo] Regionalisation of a large raster (a way to speed up spdep::skater?)

Thu Apr 7 11:47:15 CEST 2016

Hi everyone,

I want to perform regionalisation (i.e. spatially constrained
clustering) of a raster grid. The raster rectangle has ~90,000 pixels.
In it, I am only interested in a region within a specific, somewhat
convoluted, shape (i.e. the sea pixels, excluding land) which
represents ~35,000 pixels.

I have been able to convert pixels to polygons and remove a few odd
bits in order to compute a neighbourhood list representing one fully
connected graph, with spdep::poly2nb (there is probably a more clever
way for a large regular grid but this works reasonably quickly). From
it, I have computed the weights (based on 46 variables measured in
each pixel) and the minimum spanning tree. And then I fed this MST and
data to spdep::skater, and asked to resolve 7 clusters (number based
on a priori knowledge). It's been running for ~10 hours on 12 cores
(on Xeon CPUs at 3.40GHz), each eating up to 5GB of RAM, and I have no
idea if it is any close to finishing.

Eventually, I'll want to resolve from 5 to 12 clusters and compute
some a posteriori metrics to decide on the ideal number of clusters. I
can dedicate a few more cores but that will not be enough to speed it
up significantly.

Is there any clever way to speed spdep::skater up, that would maybe
exploit the fact that I am working on a regular grid?

I've thought about computing 100 to 200 clusters using regular k-means
(or pam) and then consider those as input polygons to skater but many
clusters end up as pixels scattered all over the place.
I've considered adding lat and lon to the data fed to the k-means at
this preliminary step to force spatial contiguity, but that becomes a
bit difficult to justify cleanly in a methods section of a paper, and
does not really work (regions are still scattered locally).
Of course I could reduce the resolution of the original data but that
would be a shame.
Finally, I've thought about:
1- run skater on low resolution data (few, large pixels)
2- group the central (large) pixels of each region as a polygon and
break appart pixels on the border into smaller pixels
3- compute average characteristics on these new pixels
4- re-run skater with large central polygons intact and smaller
pixel-polygons on the borders
and repeat this until the borders are well defined
But this involves quite a bit of coding and I am not really sure how
representative the mean characteristics would be for each large area.
Before embarking on this, I wanted to check whether another solution
existed.

Does anyone have experience with ClusterPy
http://www.rise-group.org/risem/clusterpy/, especially in terms of
speed? And in terms of which algorithm resembles skater most? or is
most robust? (I understand the concepts behind skater, I'm not
confident with the others).

Thank you in advance. Sincerely,

Jean-Olivier Irisson
—
Université Pierre et Marie Curie
Laboratoire d'Océanographie de Villefranche
2 Quai de la Corderie, 06230 Villefranche-sur-Mer
Tel: +33 04 93 76 38 04
Mob: +33 06 21 05 19 90
http://www.obs-vlfr.fr/~irisson/
Send me large files at: http://www.obs-vlfr.fr/~irisson/upload/