[R-sig-Geo] GStat 'predict.c' parallel estimation

Edzer Pebesma edzer.pebesma at uni-muenster.de
Fri Aug 28 09:45:17 CEST 2015

On 08/28/2015 05:15 AM, Tim Peterson wrote:
> On 27/08/15 16:55, Edzer Pebesma wrote:
>> On 08/27/2015 02:45 AM, Tim Peterson wrote:
>>> Hi all,
>>> I'm considering editing the GStat kriging file 'predict.c' (while loop
>>> at line 151-165) for parallel estimation. I require parallel estimation
>>> for both kriging and indicator simulation and will be applying it to a
>>> region of ~30 million grid cells, which will be repeated for 300 time
>>> points.
>>> I appreciate that snow can be used for parallel kriging
>>> (https://r-forge.r-project.org/scm/viewvc.php/pkg/demo/snow.R?view=markup&revision=89&root=gstat)
>>> but, because of my computation burden, I require parallel calculation
>>> using Xeon Phi cards - hence I think the changes should be in
>>> 'predict.c'. Also, in looking at the GStat C-code I suspect that the
>>> global variables in 'glvars.c' may complicate the parallel calculations.
>>> So anyway, before I embark upon this task (most likely using openMP),
>>> I'd be grateful if anyone could share tips or thoughts (and my edits
>>> will be freely shared.)
>> I've always thought of these global variables as being constant during
>> program execution. R being non-thread safe, I'm not sure how well R
>> combines with embedded C code that uses threads.
> Thanks again Edzer. That's good to know about the global variables. Re
> the threaded C code working with R, my understanding was that the while
> loop undertaking the estimation is not interacting with R and so can be
> made parallel without (hopefully) influencing R. Does this sound
> reasonable?

Be careful: print_progress is called in the iteration over prediction
locations, and calls Rprintf(), not only to print progress, but also to
allow user interrupts; these need to be surpressed (can be done in
runtime, I believe). There are some R_CheckUserInterrupt()s, but not in
the prediction code.

>>> Also, some may be questioning the parallel estimation of each
>>> simulation. Considering that I'm estimating ~30 million cells, and will
>>> only be using ~70 cores per simulation, I am happy to accept the
>>> deficiency that recently estimated cells will not be used in the
>>> estimation of nearby cells within a concurrent parallel cycle. However,
>>> the estimates do need to be accessible to latter parallel cycles.
>> I can see that a shared memory approach could mess up the set of
>> previously simulated values ("msim" in msim.c), and that filling in
>> locally could be done in independent threads; yet, what do you do at the
>> boundaries of these regions?
> To clarify, the best approach I see for simulations is to randomly
> select, say, 10*"number of cores"  grid cells and undertake their
> parallel estimation. These solutions are then added back into the "msim"
> and another 10*"number of cores" grid cells are then selected and
> estimated. This would overcome the boundary effects but would mean that
> the cores may not be running at 100% all of the  time (ie all 10*"number
> of cores" cells would need to finish before the next batch could start)

I think that could work pretty well. Not only adding to msim, also
updating the search tree would have to be done in a single-threaded
synchronization event.

>> As an alternative to this approach (and the one you sketch below) you
>> could think about a circulant embedding algorithm that uses a parallel
>> version of the fft; see fields:sim.rf and the routines in package
>> RandomFields.
> Thanks. I didn't know about circulant embedding. However, my problem is
> high non-bi-Gaussian. The extreme values within the system (which is
> water table elevation) are significantly more correlated than 'average'
> values.
>>> Lastly, my backup option is to use openMP to parallelise the GSLib
>>> Fortran subroutines 'cokb3d.for' and 'sisim.for' (ie outside of R). This
>>> is mainly because the subroutines are fairly standalone and don't use
>>> pointers - and hence possibly less problematic to make parallel.
>>> Many thanks,
>>> Tim
>>> ----------------------
>>> Dr. Tim Peterson
>>> The Department of Infrastructure Engineering
>>> The University of Melbourne, 3010 Australia
>>> T: +61 3 8344 9950, M: +61 0438 385 937
>>> Dept. profile :
>>> http://www.ie.unimelb.edu.au/people/staff.php?person_ID=141135
>>> Research Gate : https://www.researchgate.net/profile/Tim_Peterson7
>>> Google Scholar:
>>> http://scholar.google.com.au/citations?user=kkYJLF4AAAAJ&hl=en&oi=ao
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> R-sig-Geo at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo

Edzer Pebesma
Institute for Geoinformatics (ifgi),  University of Münster,
Heisenbergstraße 2, 48149 Münster, Germany; +49 251 83 33081
Journal of Statistical Software:   http://www.jstatsoft.org/
Computers & Geosciences:   http://elsevier.com/locate/cageo/
Spatial Statistics Society http://www.spatialstatistics.info

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: OpenPGP digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20150828/897f0e37/attachment.bin>

More information about the R-sig-Geo mailing list