[R-sig-Geo] loops in rasterEngine
Jonathan Greenberg
jgrn at illinois.edu
Thu Mar 13 17:41:24 CET 2014
There is typically a diminishing returns with larger and larger
"chunks", but there is a low-end in terms of file size and chunk size
for how much parallel processing helps. However, if you are only
reading and writing once, I don't see too much of an advantage of
loading everything into memory once, since that is what calc()/focal()
and rasterEngine() both do (read from the file once, process it, and
write the output), just in small sizes. If you are using the same
input file over and over again than I do see it being helpful. As a
general rule, we try to use chunking in raster processing because it
becomes almost infinitely scalable (almost :)
One issue that I've run into with R is that I've found it REALLY hard
to do a memory profile of a function -- if I could figure out the
memory footprint of a function (its MAX memory usage during its
execution), I could auto-optimize the chunk size. Right now, I use a
conservative multiplier of the number of input bands divided by the
number of workers in the cluster, but this doesn't account for memory
spikes within the function itself. Perhaps Robert can chime in on any
tricks for auto-optimizing the chunk size he's come across?
--j
On Thu, Mar 13, 2014 at 11:26 AM, Boulanger, Yan
<Yan.Boulanger at rncan-nrcan.gc.ca> wrote:
> Is it so that system.time in this case can strongly depend on how much data is placed in RAM? In my case, I'm far from being memory limited (RAM = 192 Gb) and most of the time, it's faster to put everything in memory and then process it. The major limiting speed factor here is I/O.
>
> yan
>
> Yan Boulanger, Chercheur scientifique / Research scientist
> Ressources Naturelles Canada, Canadian Forest Service
> Centre de Foresterie des Laurentides
> 1055, rue du P.E.P.S.
> C.P. 10380, succ. Sainte-Foy
> Québec (Québec) Canada
> G1V 4C7
> Tel. : +1 418 649-6859
>
>
> -----Original Message-----
> From: jgrn307 at gmail.com [mailto:jgrn307 at gmail.com] On Behalf Of Jonathan Greenberg
> Sent: 13 mars 2014 12:18
> To: Alex Zvoleff
> Cc: Boulanger, Yan; r-sig-geo at r-project.org
> Subject: Re: [R-sig-Geo] loops in rasterEngine
>
> Yan:
>
> Looks like you are getting great help with this -- I want to echo Alex's note that rasterEngine is not a catchall -- for REALLY simple processes you'll get better performance using calc() or using LESS workers (which may seem counter intuitive). I'm submitting a paper this week that showed that a function that just multiplies a raster by
> 10 ran faster than calc() only when using 4 workers
> (sfQuickInit(cpus=4)) (vs. calc's 1), but was slower than calc if you have less or more workers. As a rule, rasterEngine, at present, is slower than calc when operation in sequential mode.
>
> Now, as an important note, if you grab the latest spatial.tools from r-forge, I have added a feature that will return multiple rasters at once, which seems like what you want to do. You'll want to return a list-of-arrays (each component will be written to its own raster) and make sure you specify the output filenames (the components will be matched against the output filenames). This may result in a significant speedup because you are only reading each raster once, and returning all the outputs (vs. the example above reads/writes the rasters for every i).
>
> --j
>
> On Thu, Mar 13, 2014 at 9:06 AM, Alex Zvoleff <azvoleff at conservation.org> wrote:
>> On Wed, Mar 12, 2014 at 11:29 PM, Boulanger, Yan
>> <Yan.Boulanger at rncan-nrcan.gc.ca> wrote:
>>> Actually, I have several rasters of more than 440 000 000 pixels
>>> (MODIS covering all Canada) and I have a 32-cores machine so I would
>>> like to take advantage of it! ;-)
>>>
>>> Time is money (really?!!)
>>
>> As mentioned earlier, I would be careful about using rasterEngine for
>> this kind of task. It may actually slow you down. I would recommend
>> testing on smaller subsets to determine your gains (or losses) from
>> doing this type of calculation in parallel versus sequentially. While
>> I have seen great speed increases for CPU intensive calculations from
>> using rasterEngine, it sounds like your processing is heavily IO
>> intensive. I am not sure 32 cores will help you unless you have a very
>> fast disk or RAID array.
>>
>> Alex
>>
>>>
>>> Thanks again!
>>> yan
>>>
>>> Yan Boulanger, Chercheur scientifique / Research scientist Ressources
>>> Naturelles Canada, Canadian Forest Service Centre de Foresterie des
>>> Laurentides 1055, rue du P.E.P.S.
>>> C.P. 10380, succ. Sainte-Foy
>>> Québec (Québec) Canada
>>> G1V 4C7
>>> Tel. : +1 418 649-6859
>>>
>>> From: Forrest Stevens [mailto:forrest at ufl.edu]
>>> Sent: 12 mars 2014 22:25
>>> To: Boulanger, Yan
>>> Cc: r-sig-geo at r-project.org
>>> Subject: Re: [R-sig-Geo] loops in rasterEngine
>>>
>>> Hi Yan, I guess I would be surprised for such a simple process if rasterEngine() would be worth the overhead? Though, admittedly, Jonathan Greenberg might have more information on the topic. To do such an operation this is the approach I would take without using rasterEngine():
>>>
>>>
>>> for (i in 1:5) {
>>> assign(paste("Safranyik_zones_1961_1990b_",i, sep=""),
>>> Safranyik_zones_1961_1990b == i) }
>>>
>>>
>>> To do it using rasterEngine() this is the function definition that I would use. This of course requires that you've already created a cluster using one of the various supported parallel backends otherwise you'll gain nothing from the parallel processing.
>>>
>>>
>>> require("spatial.tools")
>>>
>>> ## Begin a parallel cluster and register it with foreach:
>>> ## The number of nodes/cores to use in the cluster cpus = 2 cl <-
>>> makeCluster(spec = cpus, type = "PSOCK", methods = FALSE) ## Register
>>> the cluster with foreach:
>>> registerDoParallel(cl)
>>>
>>> ## Or use the following, quick and dirty way:
>>> #sfQuickInit(cpus=2)
>>>
>>> fun_zone <- function( zones, i, ...) {
>>> return(zones == i)
>>> }
>>>
>>> for (j in 1:5){
>>> assign(paste("Safranyik_zones_1961_1990b_",j, sep=""),
>>> rasterEngine( zones=Safranyik_zones_1961_1990b, args=list("i"=j),
>>> fun=fun_zone) ) }
>>>
>>> stopCluster(cl)
>>> #sfQuickStop()
>>>
>>>
>>> Hope this helps,
>>> Forrest
>>>
>>> --
>>> Forrest R. Stevens
>>> Ph.D. Candidate, QSE3 IGERT Fellow
>>> Department of Geography
>>> Land Use and Environmental Change Institute University of Florida
>>> www.clas.ufl.edu/users/forrest<http://www.clas.ufl.edu/users/forrest>
>>>
>>> On Wed, Mar 12, 2014 at 8:51 PM, Boulanger, Yan <Yan.Boulanger at rncan-nrcan.gc.ca<mailto:Yan.Boulanger at rncan-nrcan.gc.ca>> wrote:
>>> Hi folks,
>>>
>>> I guess I have a lot to learn to write functions but I'm stuck when using rasterEngine. It seems that it should be very easy to do but I'm missing something, apparently... I have a raster, Safranyik_zones_1961_1990, with values (integer) from 1 to 5. I would like to create five rasters for which value will be 1 when the raster Safranyik_zones_1961_1990 is equal to "i", and NA otherwise. I would like to run everything in a loop . Here's what I thought would be ok.
>>>
>>> fun_zone <- function(Safranyik_zones,i,...) { Safranyik_zonesb <-
>>> Safranyik_zones Safranyik_zonesb[] <- NA
>>> Safranyik_zonesb[Safranyik_zones == i] <- 1
>>> return(Safranyik_zonesb)
>>> }
>>>
>>> for (i in 1:5){
>>> Safranyik_zones_1961_1990b <-
>>> rasterEngine(Safranyik_zones=Safranyik_zones_1961_1990,i=i,
>>> fun=fun_zone) assign(paste("Safranyik_zones_1961_1990b_",i,
>>> sep=""),Safranyik_zones_1961_1990b[[1]])
>>> }
>>>
>>> Of course, it says that « i » is missing...:
>>>
>>>>Erreur dans Safranyik_zones == i : 'i' est manquant
>>>
>>> Any help?
>>>
>>> Thanks in advance,
>>>
>>> Yan
>>>
>>>
>>> Yan Boulanger, Chercheur scientifique / Research scientist Ressources
>>> Naturelles Canada, Canadian Forest Service Centre de Foresterie des
>>> Laurentides 1055, rue du P.E.P.S.
>>> C.P. 10380, succ. Sainte-Foy
>>> Québec (Québec) Canada
>>> G1V 4C7
>>> Tel. : +1 418 649-6859
>>>
>>>
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> R-sig-Geo at r-project.org<mailto:R-sig-Geo at r-project.org>
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>>
>>> [[alternative HTML version deleted]]
>>>
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> R-sig-Geo at r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>
>>
>>
>>
>> --
>> Alex Zvoleff
>> Postdoctoral Associate
>> Tropical Ecology Assessment and Monitoring (TEAM) Network Conservation
>> International
>> 2011 Crystal Dr. Suite 500, Arlington, Virginia 22202, USA
>> Tel: +1-703-341-2749, Fax: +1-703-979-0953, Skype: azvoleff
>> http://www.teamnetwork.org | http://www.conservation.org
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
>
>
> --
> Jonathan A. Greenberg, PhD
> Assistant Professor
> Global Environmental Analysis and Remote Sensing (GEARS) Laboratory Department of Geography and Geographic Information Science University of Illinois at Urbana-Champaign
> 259 Computing Applications Building, MC-150
> 605 East Springfield Avenue
> Champaign, IL 61820-6371
> Phone: 217-300-1924
> http://www.geog.illinois.edu/~jgrn/
> AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
--
Jonathan A. Greenberg, PhD
Assistant Professor
Global Environmental Analysis and Remote Sensing (GEARS) Laboratory
Department of Geography and Geographic Information Science
University of Illinois at Urbana-Champaign
259 Computing Applications Building, MC-150
605 East Springfield Avenue
Champaign, IL 61820-6371
Phone: 217-300-1924
http://www.geog.illinois.edu/~jgrn/
AIM: jgrn307, MSN: jgrn307 at hotmail.com, Gchat: jgrn307, Skype: jgrn3007
More information about the R-sig-Geo
mailing list