[R-sig-Geo] subs (raster) on large dataset
Lyndon Estes
lestes at princeton.edu
Thu Apr 14 16:02:41 CEST 2011
Dear List,
I am using a function to assign the results of a crop model simulation
to a raster, which I then aggregate, and finally resample to the
resolution of another grid. This involves the subs function of raster,
and the raster whose values I am replacing is very large (ncells =
263411880). Despite this, I have successfully run subs on this same
dataset in the past. This time, however, it is running for over 9
hours without completing (because I kill R in order get other work
done), and I am on a MacBook Pro with a 2.4 Ghz Core 2 Duo processor
and 4 GB of ram, with R 2.12.2 and raster v 1.8-9. I am thus trying to
figure out if there is something wrong with my coding. First some
code/results, followed by my specific questions:
Here's the function:
spatCSM <- function(spatialmaster, resamplegrid, res.factor, CSMtable,
LUfield, outfields, outnames, type) {
# Creates spatial output grids from results produced by runCSM. Path
needs to be set in advance
# Args:
# spatialmaster: A grid defining the location of each unique land
unit (climate-soil combination)
# resamplegrid: An optional grid defining the resolution to which
results should be resampled
# resamplefactor: Factor by which to aggregate (e.g. 10 times
current pixel size, the default if not
# specified)
# CSMtable: The output table of CSM statistics generated by runCSM
# LUfield: The field in CSMtable containing the land unit codes that
match values in spatialmaster grid
# outfields: A vector of column names in CSMtable for which gridded
outputs are wanted, e.g. yield, CV yield
# outnames: A vector of names for writing output grids, one for each
specifed grid
# type: Output datatype for grid, e.g. INT2S
# Returns: R format raster grids, saved to disk, of mean yield and
other optional grids
library(raster)
if(missing(spatialmaster) | missing(CSMtable) | missing(LUfield) |
missing(outfields) | missing(type)) {
stop("Missing parameter, check function list.")
}
if(length(outfields) != length(outnames)) {
stop("Number of names for output grids does not match number of
specified output grids")
}
for(i in 1:length(outfields)) {
#CSMtable[, outfields[i]] <- as.numeric(CSMtable[, outfields[i]])
# Convert to numeric
out.nm <- paste(outnames[i], ".gri", sep = "")
subs(spatialmaster, CSMtable, by = LUfield, which = outfields[i],
subsWithNA = TRUE, filename =
out.nm, datatype = type, overwrite = TRUE)
if(!missing(resamplegrid)) {
cat("Aggregating")
if(missing(res.factor)) {
res.factor <- 10
}
subs.g <- raster(out.nm)
out.nm2 <- paste("agg.", out.nm, sep = "")
aggregate(subs.g, res.factor, fun = mean, na.rm = TRUE,
filename = out.nm2, datatype = type,
overwrite = TRUE)
subs.g.agg <- raster(out.nm2)
out.nm3 <- paste(outnames[i], ".",
as.integer(res(resamplegrid)[1]), ".gri", sep = "")
resample(subs.g.agg, resamplegrid, method = "bilinear", filename
= out.nm3, datatype = type,
overwrite = TRUE)
}
}
}
# Function call
spatCSM(spatialmaster = ltq.g, resamplegrid = base.g2, res.factor =
10, CSMtable = sum.tab.c,
LUfield = "FID", outfields = c("y.mn"), outnames = c("f.yld"),
type = "INT2S")
Inputs
ltq.g
class : RasterLayer
dimensions : 15170, 17364, 263411880 (nrow, ncol, ncell)
resolution : 92.73961, 92.73961 (x, y)
extent : -733488.1, 876842.5, -3806988, -2400128 (xmin, xmax, ymin, ymax)
projection : +proj=aea +lat_1=-18 +lat_2=-32 +lat_0=0 +lon_0=24
+x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs
+towgs84=0,0,0
values : lt.quin.3f.grd
min value : 1
max value : 136050
base.g2
class : RasterLayer
dimensions : 1651, 1937, 3197987 (nrow, ncol, ncell)
resolution : 926.6254, 926.6254 (x, y)
extent : -869378.3, 925495.2, -3887679, -2357820 (xmin, xmax, ymin, ymax)
projection : +proj=aea +lat_1=-18 +lat_2=-32 +lat_0=0 +lon_0=24
+x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs
+towgs84=0,0,0
values : in memory
min value : 1
max value : 1
sum.tab.c
FID CR WSTA.... SOIL_ID... YEARS pre.mn y.mn y.med y.cv y.10
1 16276 MZ AAB_7921 LTAE003301 1980-1998 439 1275 1265 69 0
2 16135 MZ AAB_7921 LTAE003303 1980-1998 422 2071 2295 67 111
3 16820 MZ AAA_7921 LTAE003303 1980-1998 427 2083 2374 66 123
4 16948 MZ AAA_7921 LTAE003304 1980-1998 458 3320 3828 35 1198
5 16149 MZ AAB_7921 LTAE003304 1980-1998 453 3277 3780 37 1086
6 16247 MZ AAB_7921 LTAE003305 1980-1998 423 1420 1645 74 0
7 17414 MZ AAC_7921 LTAE003305 1980-1998 419 1467 1637 68 79
8 17017 MZ AAA_7921 LTAH001401 1980-1998 452 3378 3598 29 1808
9 17016 MZ AAA_7921 LTAH001403 1980-1998 453 3250 3579 32 1507
10 17015 MZ AAA_7921 LTAH001404 1980-1998 425 1607 1918 75 0
Note that I have been testing this function with only a very small
input table providing the values to be substituted (sum.tab.c). This
is because the crop model itself takes a long while to run for each
location provided in ltq.g.
My first question is whether the small table is causing a problem? I
figured it wouldn't, since subsWithNA is set to TRUE. But perhaps it
does.
If that is unlikely, then have I somehow coded this in a way that R is
trying to perform the subs entirely with memory (e.g. because I have
placed subs within a function)?
If neither of these are the case, then I will appreciate any pointers
for how I can make this work more quickly.
I apologize for not providing a fully reproducible example, but the
grid in question is 1 GB, and since the size of it seems to be the
main issue (a version of this grid cut down to the area covered by
sum.tab.c works fine), I don't want to burden anyone with such a large
file.
Thanks in advance for any advice you can provide.
Cheers, Lyndon
--
Lyndon Estes
Research Associate
Woodrow Wilson School
Princeton University
+1-609-258-2392 (o)
+1-609-258-6082 (f)
+1-202-431-0496 (m)
lestes at princeton.edu
More information about the R-sig-Geo
mailing list