[R-sig-Geo] subs (raster) on large dataset

Robert J. Hijmans r.hijmans at gmail.com
Wed Apr 27 06:30:49 CEST 2011


On Tue, Apr 19, 2011 at 12:19 PM, Lyndon Estes <lestes at princeton.edu> wrote:
> Dear Robert,
>
> Thanks for your response on this (sorry for my slow feedback, I had
> another project intervene).

Hi Lyndon,

likewise :); sorry about that.

>
> I have gone through the subs code now:
>
> Using my raster ltq.g:
>
> class       : RasterLayer
> dimensions  : 15170, 17364, 263411880  (nrow, ncol, ncell)
> resolution  : 92.73961, 92.73961  (x, y)
> extent      : -733488.1, 876842.5, -3806988, -2400128  (xmin, xmax, ymin, ymax)
> projection  : +proj=aea +lat_1=-18 +lat_2=-32 +lat_0=0 +lon_0=24
> +x_0=0 +y_0=0 +ellps=WGS84 +datum=WGS84 +units=m +no_defs
> +towgs84=0,0,0
> values      : lt.quin.3f.grd
> min value   : 1
> max value   : 136050
>
> r <- raster(ltq.g)
> tr <- blockSize(r)  # n = 267, nrows = 57, size = 57
>
> From this test, I get quite variable speeds:
>
> i <- 1
> dong <- Sys.time()
> v <- getValues(ltq.g, row = tr$row[i], nrows = tr$size)
> x <- localmerge(v, y, TRUE)
> length(x)  # 989748 = ncells in 57 (nrow) *17364(ncol)
> ding <- Sys.time()
> ding-dong
> # Times:
> # 2.374584 mins, 49.45746 seconds, 44.11 seconds
> # After computer restart: 10.97272 after reboot, 14.53 seconds
>
> The last two times are much better, so perhaps I was having some
> performance bottlenecks on my computer.  Do these seem reasonable to
> you, given my computer (2.4 GHz Core 2, 4 GB RAM Macbook Pro, 64-bit
> R-2.12.2)?
>
> I did catch something potentially important here:
>
> if (canProcessInMemory(x, 3)) {
>            v <- localmerge(getValues(x), y, subsWithNA)
>            r <- setValues(r, v)
>            if (filename != "") {
>                r <- writeRaster(r, filename = filename, ...)
>            }
>            return(r)
>
> Testing this on its own:
>
> canProcessInMemory(ltq.g, 3)
>
> Takes anywhere from 7.6 minutes (after computer restart) to 13.2
> minutes (last night before restart), and returns "TRUE".

That is *horrible*. I am not sure what is going on here. This is my
rather empirical test to see if you have enough RAM for a given
computation (and to work with chunks if not). Although it is somewhat
inefficient, I have never seen anything crazy like this. I wonder if
you have insufficient RAM and that virtual RAM is created on-disk as
the object grows. I may lower the default maxmemory setting because of
this. You can also set it with setOptions. You can also do
setOptions(todisk=TRUE)  before using this function (and set it back
afterwards, to force the function to use chunks (canProcessInMemory
will return FALSE before doing the memory consuming test).


> Breaking out
> the relevant bits of code for canProcessInMemory:
>
> n <- 3 + (nlayers(ltq.g) - 1)
> cells <- round(1.1 * ncell(ltq.g))
> #if ((cells * n) > .maxmemory()) {
> #        return(FALSE)
> #    }
> # Note: I couldn't figure out how to access .maxmemory()--it keeps telling me
> # "Error: could not find function ".maxmemory"", so presumably this is
> something
> # internal to raster that I am not smart enough to figure out how to
> access, so I used this:

all raster functions that start with an "." are hidden but are
accessible via raster:::
e.g.
raster:::.maxmemory


>
> ((cells * n) > as.integer(1e+09))  # 1e+09 is my memory limit.
>  > TRUE
>
> So it seems my raster might be getting processed entirely in memory.
> The slowness of canProcessInMemory then
> comes from this line:
>
> r <- try(matrix(0.1, ncol = n, nrow = cells), silent = TRUE)
>
> There is this line at the beginning of canProcessInMemory that seems
> to me important, but I get the same problem as with .maxmemory(), in
> that I can't figure out how to see its code or what it does:
>
>  if (.toDisk()) {
>        return(FALSE)
>    }
>
> Based on its name, I assume it checks whether the parent function is
> being directed to process onto the disk, i.e. a filename is supplied.
> Is this correct?

No, it checks of the option "todisk" is set to TRUE or FALSE. Does not
depend on having supplied a filename or not (a temp filename will be
assigned if not).

> And, if the answer is yes, is this further
> assumption also correct?
>
> A raster function (e.g. subs) run with an output filename and without
> assignment to an object must necessarily write to disk.  e.g.
>
> subs(x, y, by = "foo, which = "bar", subsWithNA = TRUE, filename =
> out, datatype = type, overwrite = TRUE)
>

It does not matter whether you assign or not (and you should normally assign).

> If this last assumption is correct, then it seems that something is
> going wrong (or perhaps something I have mis-specified in my code)
> such that this large file is being handled in memory when it should be
> processed in chunks to disk.

In some functions that is true, but not in this one. Even if you
supply a filename, the whole thing will be done in memory if possible,
and the only at the end will the resulting raster data be written to
disk. Perhaps it would be safer if it would always work as you
expected.


>
> I hope this makes some sense, but I'll appreciate any points of advice on this.
>
> Thanks, Lyndon
>
> p.s. the progress option doesn't work for me on subs, either "text" or
> "window". Is this because I am mac, or is it indicative of something
> wrong regarding the installation?
>

In the majority of functions, the progress bar only works when
processing is done on chunks (otherwise there really is only 1 step).


>
>
>
>
>
>
>
>
>
> format    : raster
> datatype  : FLT8S
> overwrite : FALSE
> progress  : none
> timer     : FALSE
> chunksize : 1e+06
> maxmemory : 1e+09
> tmpdir    : /var/folders/Xc/XcJpMYIWFJm5ANTLUN-U2U+++TM/-Tmp-/R_raster_tmp/
> setfileext: TRUE
>
>
>
> On Thu, Apr 14, 2011 at 1:43 PM, Robert Hijmans <r.hijmans at gmail.com> wrote:
>> Lyndon,
>>
>> I have no idea what might be causing this. I would suggest to track down
>> where exactly this happens, by going, line by line, through the code of
>> 'subs'
>>
>> getMethod("subs", c("RasterLayer", "data.frame"))
>>
>> Presumably, something is not moving here:
>>
>> for (i in 1:tr$n) {
>>     v <- getValues(x, row = tr$row[i], nrows = tr$size)
>>     r <- writeValues(r, localmerge(v, y, subsWithNA),  tr$row[i])
>>     pbStep(pb)
>> }
>>
>> so you can try
>>
>>  i <- 1
>>  v <- getValues(x, row = tr$row[i], nrows = tr$size)
>>  x <- localmerge(v, y, subsWithNA)
>>
>> and if that last step is indeed very slow, have a step by step look at
>> "localmerge".
>>
>> You can then also see of it matters if you make changes to the "chunksize"
>> and "maxmemory" options ?setOptions. These will affect the size of "v" in
>> the code above, and perhaps there is a non-linear effect on the performance
>> of merge.
>>
>> Robert
>>
>>
>> --
>> View this message in context: http://r-sig-geo.2731867.n2.nabble.com/subs-raster-on-large-dataset-tp6272926p6273741.html
>> Sent from the R-sig-geo mailing list archive at Nabble.com.
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
>
>
> --
> Lyndon Estes
> Research Associate
> Woodrow Wilson School
> Princeton University
> +1-609-258-2392 (o)
> +1-609-258-6082 (f)
> +1-202-431-0496 (m)
> lestes at princeton.edu
>



More information about the R-sig-Geo mailing list