[R-sig-Geo] Raster in parallel computing?

Robert J. Hijmans r.hijmans at gmail.com
Wed Jan 8 20:37:45 CET 2014


Jonathan,

Thanks for that useful guidance. You write that "resampling is I/O
intensive, so parallel processing the files may not give you much (or
any) benefit". I do not think that is true. Comparted to other
functions resample is relatively slow because it needs to match all
the old cells to the new cells, which is -- in its current simple
implementation -- rather CPU intensive.

By using a Stack (or Brick), the number of computations needed is
strongly reduced. So I think the first step to this problem is to
group sets of rasters and stack these. After that, the build in
support for might be more efficient than the foreach approach that I
showed, but I would not bet on that. A simple way to improve I/O, see
?rasterOptions (increase chunksize and maxmemory), but be conservative
with that when also using parallel computing.

Robert


On Wed, Jan 8, 2014 at 11:13 AM, Jonathan Greenberg <jgrn at illinois.edu> wrote:
> Hi all:
>
> Wanted to respond a bit to this thread.  rasterEngine is designed for
> parallel processing a single file (via chunking the inputs and parsing
> each chunk to a different worker).  I do have a parallel resample
> algorithm on my radar, and I think I could adapt it for use with
> rasterEngine.  However, this particular test case, that many files
> need a highly optimized system, so I think would benefit from:
>
> 1) Using e.g. gdalwarp natively from GDAL, which should be a lot
> faster than any resampling techniques I've seen in any software
> package.  You can use gdalUtils for a within-R wrapper for the core
> GDAL binaries:
> install.packages("gdalUtils", repos="http://R-Forge.R-project.org")
>
> Note that I just pushed version 0.2.0 to both r-forge and to CRAN.
> Waiting to hear from CRAN, as soon as they accept it I'll make a
> general announcement.
>
> 2) As Matteo mentioned, the "unit" of parallelization with that many
> files should be the file itself (not within-file parallelization).  In
> general, I recommend you develop your parallel code using foreach, and
> then use whatever parallel backend you want.  "parallel" is now built
> into the core R, so you might want to stick with that as a backend
> rather than snow/snowfall.  foreach is, in my opinion, conceptually
> easier than the other parallel packages, and is also very flexible
> (you can move your code from a single, multicore machine to an OpenMPI
> cluster with little effort).  This will only execute the resampling
> one per worker, so multiple files being open shouldn't be an issue
> (since if you have 8 workers, you'll only have 8 files open at a
> time).
>
> To be clear, we're talking about the difference between:
> (what rasterEngine does):
> fileA -> fileA_chunk1 -> worker1
>         -> fileA_chunk2 -> worker2
>
> (what Matteo and I are proposing):
> fileA -> worker1
> fileB -> worker2
>
> Be aware that resampling is I/O intensive, so parallel processing the
> files may not give you much (or any) benefit, if you are on a single
> hard drive.  Even minor tweaks like reading from one drive and writing
> to another may generate some additional speed.
>
> If you've never learned parallel computing in R, I HIGHLY recommend
> working through:
> http://trg.apbionet.org/euasiagrid/docs/parallelR.notes.pdf
>
> This is the best tutorial I've seen on parallel computing in R.
> Wherever it says library("snow") I'd recommend switching that with
> library("parallel"), and pay attention to the foreach discussion.
>
> --j
>
>
>
> On Wed, Jan 8, 2014 at 12:00 PM, Camilo Mora <cmora at dal.ca> wrote:
>> Thank you Roger,
>>
>> Yes I should have mentioned that I have looked extensively over this
>> question. Curiously, Jonathan From "spatial.tools" provided me with some
>> advice as his tool at the current time does not run the function "resample".
>>
>> Thanks again,
>>
>> Camilo
>>
>>
>>
>>
>> Quoting Roger Bivand <Roger.Bivand at nhh.no>:
>>
>>> Did you notice the thread started yesterday that appears to meet your
>>> need:
>>>
>>> https://stat.ethz.ch/pipermail/r-sig-geo/2014-January/020156.html
>>>
>>> It is always a good idea to look at the list archives, a search on:
>>>
>>> "list:R-sig-geo raster parallel"
>>>
>>> gives a number of potentially interesting hits. You could then preface
>>> your posting by saying that you have already tried some possible solutions,
>>> and would like help with them.
>>>
>>> Roger
>>>
>>> On Wed, 8 Jan 2014, Camilo Mora wrote:
>>>
>>>> Hi everyone,
>>>>
>>>> I am using the package "raster" to interpolate a large number of rasters
>>>> (~1million) of different resolutions to a unique 1degree resolution grid and
>>>> wonder if you know if it is possible to do this in parallel computer?.
>>>>
>>>> My code (example below) works like a charm but it will take 30 days to
>>>> process all the rasters. Sadly, the process only uses one core of my
>>>> computer. I wonder if there is a way to run this code (example below) in
>>>> parallel computer?.
>>>>
>>>> Thanks,
>>>>
>>>> Camilo
>>>>
>>>> ####TEST CODE######
>>>> library (raster)
>>>>
>>>> #creates 3 test rasters
>>>> a <- raster(nrow=3, ncol=3)
>>>> a[] <- 1:9
>>>>
>>>> b <- raster(nrow=3, ncol=3)
>>>> b[] <- 10:18
>>>>
>>>> c <- raster(nrow=3, ncol=3)
>>>> c[] <- 19:27
>>>>
>>>> #concatenates the rasters
>>>> d<-brick(a,b,c)
>>>>
>>>> #creates a raster at a different resolution
>>>> s <- raster(nrow=10, ncol=10)
>>>>
>>>> #interpolates data from the brick to the new resolution
>>>> s <- resample(d, s, method='bilinear')
>>>>
>>>> _______________________________________________
>>>> R-sig-Geo mailing list
>>>> R-sig-Geo at r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; fax +47 55 95 95 43
>>> e-mail: Roger.Bivand at nhh.no
>>>
>>>
>>>
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
>
>
> --
> Jonathan A. Greenberg, PhD
> Assistant Professor
> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
> Department of Geography and Geographic Information Science
> University of Illinois at Urbana-Champaign
> 259 Computing Applications Building, MC-150
> 605 East Springfield Avenue
> Champaign, IL  61820-6371
> Phone: 217-300-1924
> http://www.geog.illinois.edu/~jgrn/
> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo



More information about the R-sig-Geo mailing list