[R-sig-Geo] Faster way to get raster average?
Michael Sumner
mdsumner at gmail.com
Wed May 6 07:22:29 CEST 2015
Sometimes it can be best to use the NetCDF tools more directly, you can
approach the speed of the already in memory case by slurping in the array
and reshaping to a matrix (like Robert's example, but I think my row/col
specification is correct):
library(raster)
cmip <- brick(nc=150, nr=114, nl=1872)
cmip <- setValues(cmip, matrix(rep(1:17100, 1872), nc=1872))
writeRaster(cmip, "example.nc")
## in memory from our example
system.time(r1 <- cellStats(cmip, mean, na.rm = TRUE))
# user system elapsed
# 0.08 0.00 0.07
## from disk
cmipdisk <- brick("example.nc")
system.time(r2 <- cellStats(cmipdisk, mean, na.rm = TRUE))
# user system elapsed
# 4.89 0.34 5.24
## in memory, but from disk (so yes we are IO-bound)
system.time(r3 <- .colMeans(getValues(cmipdisk), nrow(cmipdisk) *
ncol(cmipdisk), nlayers(cmipdisk), na.rm=TRUE))
# user system elapsed
# 4.71 0.31 5.03
## avoid raster completely is faster, even with apply
system.time(r4 <- apply(get.var.ncdf(open.ncdf("example.nc"), "variable"),
3, mean, na.rm = TRUE))
# user system elapsed
# 2.56 0.19 2.75
#
## reshape to avoid apply (second fastest to already in memory)
system.time(r5 <- .colMeans(matrix(get.var.ncdf(open.ncdf("example.nc"),
"variable"), nrow = prod(dim(cmip)[1:2])), nrow(cmipdisk) *
ncol(cmipdisk), nlayers(cmipdisk), na.rm=FALSE))
# user system elapsed
# 1.25 0.10 1.36
#
max(abs(r1 - r2))
#[1] 0
max(abs(r2 - r3))
#[1] 0
max(abs(r3 - r4))
#[1] 0
max(abs(r4 - r5))
#[1] 0
You can replace ncdf with ncdf4 or RNetCDF with a slight change of
open/getvar functions.
Cheers, Mike.
On Wed, 6 May 2015 at 14:28 Robert J. Hijmans <r.hijmans at gmail.com> wrote:
> This suggest that most of the time is spend on reading data from disk.
> It could be more efficient to do this in one step.
>
> m <- .colMeans(getValues(cmip), nrow(cmip), ncol(cmip), na.rm=FALSE)
>
> I do not think there is much more you can do beyond that --- although
> a solid state hard disk could help.
>
> ( I am guessing that na.rm=FALSE is tiny bit faster, and that you do
> not need it to be TRUE. )
>
> Robert
>
> On Tue, May 5, 2015 at 6:48 PM, Thiago V. dos Santos
> <thi_veloso at yahoo.com.br> wrote:
> > Hi all,
> > I am working with some terabytes of CMIP5 climate files. Each file is a
> netcdf with multiple layers (timesteps) representing monthly data.
> > For each file, I need to extract the average value of the raster and put
> all values in a data frame. This is my current approach:---------------
> > library(raster)
> > # make up some datacmip <- brick(nc=150, nr=114, nl=1872)cmip <-
> setValues(cmip, matrix(rep(1:17100, 1872), nc=1872))
> > # get mean values (area average) as data framescmip.mean <-
> as.data.frame(cellStats(cmip, mean, na.rm=T))---------------
> > which works pretty fast in this example:
> >> system.time(as.data.frame(cellStats(cmip, mean, na.rm=T)))
> > user system elapsed
> > 0.069 0.012 0.081
> > However, the calculation with my actual data is substantially slower:
> >> system.time(as.data.frame(cellStats(cmip, mean, na.rm=T)))
> > user system elapsed
> > 4.600 1.105 5.704
> > Since I will have to deal with thousands of files, here comes my
> question: is there a faster way to get a the average value of a raster ?
> > Many thanks,
> > --
> > Thiago V. dos Santos
> > PhD student
> > Land and Atmospheric Science
> > University of Minnesota
> >
> http://www.laas.umn.edu/CurrentStudents/MeettheStudents/ThiagodosSantos/index.htm
> > Phone: (612) 323 9898
> > [[alternative HTML version deleted]]
> >
> > _______________________________________________
> > R-sig-Geo mailing list
> > R-sig-Geo at r-project.org
> > https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
[[alternative HTML version deleted]]
More information about the R-sig-Geo
mailing list