[R-sig-Geo] slow computation progress for calc function

Tue Jun 25 15:19:07 CEST 2019

I have used the slice approach with success using complex functions and
multiple outputs on 12,000+ layers (3,000 X 3,000 cells approx) loaded into
chunked NetCDF files on a desktop machine, so this should work.

This was all done using the ncdf4 and raster packages. There is some work
involved in setting up the input / output NetCDF files though. The trick
was to select a chunking strategy that minimises row-wise read times
through the time series, then extract slices for each row into a matrix
using the ncdf4 package and use apply() with your custom functions.

The majority of overhead will be in read / write if you're using rle with
one output. I suspect clusteR / calc will be a lot faster on a chunked
NetCDF as well... I've seen some huge speed improvements before but it was
a special case with fewer layers and more computationally expensively
functions.

In any case, stacking 8,000 separate rasters is going to be super slow for
processing in R unless you use something like NetCDF.

On Tue., 25 Jun. 2019, 9:45 pm Roger Bivand, <Roger.Bivand using nhh.no> wrote:

> On Tue, 25 Jun 2019, Sara Shaeri via R-sig-Geo wrote:
>
> > Hi Barry,Yes all of them are running at near 100% usage.
> >
> > Sara
> >
> >    On Tuesday, June 25, 2019, 9:17:05 PM GMT+10, Barry Rowlingson
> >    <b.rowlingson using lancaster.ac.uk> wrote:
> >
> > On Tue, Jun 25, 2019 at 2:32 AM Sara Shaeri via R-sig-Geo
> > <r-sig-geo using r-project.org> wrote:
> >
> > interflood <- clusterR(all_predictions, calc, args=list(function(x){y <-
> > rle(as.numeric(x));return(max( y$lengths[y$values == 0]))}))
> >
> > If I understand this correctly you are trying to find the length of the
> > longest run of zeroes in each pixel stack?
>
> This is how I read this too, finding the longest run of zeroes (no flood?)
> among 8000 layers. This means that each of the raster cells is
> independent. I assume that all_predictions is not trying to fit into
> memory (how many copies across the cluster?). I believe GRASS reads and
> writes by default by raster row, so would just iterate across this row by
> row.
>
> I suspect that the clusterR() framework is not what you need, this should
> be feasible on a laptop (data 2M x 8K x INT4 ~ 64G) by stepping through in
> blocks, shouldn't it? One row is max 64M? Read a row for the whole stack
> updating the rle's on read, store one row until all layers processed,
> write the row as INT4? Try GRASS?
>
> Just thinking aloud, the underlying problem is needing to slice cell-wise
> through the array.
>
> Roger
>
> > You need to find out where
> > the bottleneck is - are all your beginCluster(30) CPU cores running at
> > near 100% usage? If not then there's a memory or disk bottleneck which
> > would need a different optimisation strategy than trying to find
> > something to optimise the CPU usage. Barry
> >
> >
> >         [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > R-sig-Geo using r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >
> >
> >       [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > R-sig-Geo using r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

	[[alternative HTML version deleted]]