[R-sig-Geo] subs (raster) on large dataset

Lyndon Estes lestes at princeton.edu
Fri May 20 19:48:37 CEST 2011


Hi Robert,

My apologies for an even longer delay this time around.  I have gotten
back to this now and followed your suggestions.  Responses are
interspersed in (shortened) text of previous messages.

> That is *horrible*. I am not sure what is going on here. This is my
> rather empirical test to see if you have enough RAM for a given
> computation (and to work with chunks if not). Although it is somewhat
> inefficient, I have never seen anything crazy like this. I wonder if
> you have insufficient RAM and that virtual RAM is created on-disk as
> the object grows. I may lower the default maxmemory setting because of
> this. You can also set it with setOptions. You can also do
> setOptions(todisk=TRUE)  before using this function (and set it back
> afterwards, to force the function to use chunks (canProcessInMemory
> will return FALSE before doing the memory consuming test).

I used the setOptions function as you have suggested by embedding it
in my function (full code pasted at the foot of this email):

# Force to write to disk
if(raster:::.toDisk() != TRUE) {
   setOptions(todisk = TRUE)
   cat("We don't want memory problems--forcing write to disk.\n")
}

With this option, my code completed in 33 minutes (with the largest
raster being ~1 GB).  I also reran my code with toDisk = FALSE, and
watched R memory usage via top.  As you suspected, memory was
constantly increasing, both actual and virtual. I believe I had enough
RAM available at the time (~2 GB), so there seems to be some problem
there, but perhaps this is an issue for R-Sig-Mac?

> all raster functions that start with an "." are hidden but are
> accessible via raster:::
> e.g.
> raster:::.maxmemory

Thanks, I guess I should have already known that...

> In some functions that is true, but not in this one. Even if you
> supply a filename, the whole thing will be done in memory if possible,
> and the only at the end will the resulting raster data be written to
> disk. Perhaps it would be safer if it would always work as you
> expected.

It's easy enough for me to write the toDIsk = TRUE option into
functions, so thanks again for this tip.

Lastly, there is one small thing that I noticed while working on this
issue, based on the function that follows.  I noticed that in a case
such as this:

    cat("Substituting...")
    subs(spatialmaster, CSMtable, by = LUfield, which = outfields[i],
subsWithNA = TRUE, filename =
                   out.nm, datatype = type, overwrite = TRUE, progress = "text")

That the progress bar and cat don't play together, in that cat is not
displayed. I am not sure why, but perhaps it is simply being
overwritten in the console by the progress bar?  I opted to turn off
the progress bar in favor of the messages produced by cat in this
case. This isn't something I am terribly concerned with, but I thought
I would mention it in case it is of interest.

Thanks again for all your help.

Cheers, Lyndon



spatCSM <- function(spatialmaster, resamplegrid, res.factor, CSMtable,
LUfield, outfields, outnames, type) {
# Creates spatial output grids from results produced by runCSM
function. Path needs to be set in advance
# Args:
#   spatialmaster: A grid defining the location of each spatial units
#   resamplegrid: An optional grid defining the resolution to which
results should be resampled
#   resamplefactor: Factor by which to aggregate (e.g. 10 times
current pixel size, the default if not
#     specified)
#   CSMtable: The output table of CSM statistics generated by runCSM
#   LUfield: The field in CSMtable containing the spatial unit codes
that match values in spatialmaster grid
#   outfields: A vector of column names in CSMtable for which gridded
outputs are wanted, e.g. yield, CV yield
#   outnames: A vector of names for writing output grids, one for each
specifed grid
#   type: Output datatype for grid, e.g. INT2S
# Returns: R format raster grids, saved to disk, of mean yield and
other optional grids

  library(raster)
  if(missing(spatialmaster) | missing(CSMtable) | missing(LUfield) |
     missing(outfields) | missing(type)) {
     stop("Missing parameter, check function list.")
  }
  if(length(outfields) != length(outnames)) {
     stop("Number of names for output grids does not match number of
specified output grids")
  }

  cat("Running...")

  # Force to write to disk
  if(raster:::.toDisk() != TRUE) {
     setOptions(todisk = TRUE)
     cat("We don't want memory problems--forcing write to disk.\n")
  }

  for(i in 1:length(outfields)) {

    out.nm <- paste(outnames[i], ".grd", sep = "")
    cat("Substituting...")
    subs(spatialmaster, CSMtable, by = LUfield, which = outfields[i],
subsWithNA = TRUE, filename =
                   out.nm, datatype = type, overwrite = TRUE)#,
progress = "text")

    if(!missing(resamplegrid)) {
      cat("Aggregating...")
      if(missing(res.factor)) {
        res.factor <- 10
      }
      subs.g <- raster(out.nm)
      out.nm2 <- paste("agg.", out.nm, sep = "")
      aggregate(subs.g, res.factor, fun = mean, na.rm = TRUE,
                filename = out.nm2, datatype = type, overwrite =
TRUE)#, progress = "text")
      subs.g.agg <- raster(out.nm2)
      out.nm3 <- paste(outnames[i], ".",
as.integer(res(resamplegrid)[1]), ".grd", sep = "")
      cat("Resampling...")
      resample(subs.g.agg, resamplegrid, method = "bilinear", filename
= out.nm3, datatype = type,
               overwrite = TRUE)#, progress = "text")
    }
  }
  cat("Done.")
  setOptions(todisk = FALSE)  # Return to default option allowing
raster processing in memory
}



More information about the R-sig-Geo mailing list