[R-sig-Geo] Raster in parallel computing?

Jonathan Greenberg jgrn at illinois.edu
Wed Jan 8 20:13:03 CET 2014


Hi all:

Wanted to respond a bit to this thread.  rasterEngine is designed for
parallel processing a single file (via chunking the inputs and parsing
each chunk to a different worker).  I do have a parallel resample
algorithm on my radar, and I think I could adapt it for use with
rasterEngine.  However, this particular test case, that many files
need a highly optimized system, so I think would benefit from:

1) Using e.g. gdalwarp natively from GDAL, which should be a lot
faster than any resampling techniques I've seen in any software
package.  You can use gdalUtils for a within-R wrapper for the core
GDAL binaries:
install.packages("gdalUtils", repos="http://R-Forge.R-project.org")

Note that I just pushed version 0.2.0 to both r-forge and to CRAN.
Waiting to hear from CRAN, as soon as they accept it I'll make a
general announcement.

2) As Matteo mentioned, the "unit" of parallelization with that many
files should be the file itself (not within-file parallelization).  In
general, I recommend you develop your parallel code using foreach, and
then use whatever parallel backend you want.  "parallel" is now built
into the core R, so you might want to stick with that as a backend
rather than snow/snowfall.  foreach is, in my opinion, conceptually
easier than the other parallel packages, and is also very flexible
(you can move your code from a single, multicore machine to an OpenMPI
cluster with little effort).  This will only execute the resampling
one per worker, so multiple files being open shouldn't be an issue
(since if you have 8 workers, you'll only have 8 files open at a
time).

To be clear, we're talking about the difference between:
(what rasterEngine does):
fileA -> fileA_chunk1 -> worker1
        -> fileA_chunk2 -> worker2

(what Matteo and I are proposing):
fileA -> worker1
fileB -> worker2

Be aware that resampling is I/O intensive, so parallel processing the
files may not give you much (or any) benefit, if you are on a single
hard drive.  Even minor tweaks like reading from one drive and writing
to another may generate some additional speed.

If you've never learned parallel computing in R, I HIGHLY recommend
working through:
http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf

This is the best tutorial I've seen on parallel computing in R.
Wherever it says library("snow") I'd recommend switching that with
library("parallel"), and pay attention to the foreach discussion.

--j



On Wed, Jan 8, 2014 at 12:00 PM, Camilo Mora <cmora at dal.ca> wrote:
> Thank you Roger,
>
> Yes I should have mentioned that I have looked extensively over this
> question. Curiously, Jonathan From "spatial.tools" provided me with some
> advice as his tool at the current time does not run the function "resample".
>
> Thanks again,
>
> Camilo
>
>
>
>
> Quoting Roger Bivand <Roger.Bivand at nhh.no>:
>
>> Did you notice the thread started yesterday that appears to meet your
>> need:
>>
>> https://stat.ethz.ch/pipermail/r-sig-geo/2014-January/020156.html
>>
>> It is always a good idea to look at the list archives, a search on:
>>
>> "list:R-sig-geo raster parallel"
>>
>> gives a number of potentially interesting hits. You could then preface
>> your posting by saying that you have already tried some possible solutions,
>> and would like help with them.
>>
>> Roger
>>
>> On Wed, 8 Jan 2014, Camilo Mora wrote:
>>
>>> Hi everyone,
>>>
>>> I am using the package "raster" to interpolate a large number of rasters
>>> (~1million) of different resolutions to a unique 1degree resolution grid and
>>> wonder if you know if it is possible to do this in parallel computer?.
>>>
>>> My code (example below) works like a charm but it will take 30 days to
>>> process all the rasters. Sadly, the process only uses one core of my
>>> computer. I wonder if there is a way to run this code (example below) in
>>> parallel computer?.
>>>
>>> Thanks,
>>>
>>> Camilo
>>>
>>> ####TEST CODE######
>>> library (raster)
>>>
>>> #creates 3 test rasters
>>> a <- raster(nrow=3, ncol=3)
>>> a[] <- 1:9
>>>
>>> b <- raster(nrow=3, ncol=3)
>>> b[] <- 10:18
>>>
>>> c <- raster(nrow=3, ncol=3)
>>> c[] <- 19:27
>>>
>>> #concatenates the rasters
>>> d<-brick(a,b,c)
>>>
>>> #creates a raster at a different resolution
>>> s <- raster(nrow=10, ncol=10)
>>>
>>> #interpolates data from the brick to the new resolution
>>> s <- resample(d, s, method='bilinear')
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> R-sig-Geo at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; fax +47 55 95 95 43
>> e-mail: Roger.Bivand at nhh.no
>>
>>
>>
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo



-- 
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
259 Computing Applications Building, MC-150
605 East Springfield Avenue
Champaign, IL  61820-6371
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007



More information about the R-sig-Geo mailing list