[R-sig-Geo] raster[] slow on large rasters
Michael Sumner
mdsumner at gmail.com
Mon Oct 3 01:57:19 CEST 2016
rgdal::GDALinfo(filename) will show the block size. Here's an example of
the timings difference I'm talking about.
library(raster)
r <- disaggregate(raster(volcano), fact = 20)
writeRaster(r, "supersamp.tif", options = c("TILED=YES"))
rtif <- raster("supersamp.tif")
rmem <- readAll(rtif)
bmem <- brick(rmem)
cell <- sort(sample(ncell(rtif), 100))
library(rbenchmark)
benchmark(tif = extract(rtif, cell),
rmem = extract(rmem, cell),
bmem = extract(bmem, cell))
test replications elapsed relative user.self sys.self user.child sys.child
3 bmem 100 0.01 1 0.01 0.00 NA NA
2 rmem 100 0.15 15 0.12 0.02 NA NA
1 tif 100 12.76 1276 12.35 0.39 NA NA
You could group your cell numbers into tiles and use raster::crop to get
each block in turn, though presumably the block handling functions in
raster are better suited to that. I haven't used those, I tend to deal with
really long time series data sets and reasonable spatial sizes, rather
than really massive spatial grids like this.
I don't know if you can get individual values by cell number via rgdal,
without extracting an entire SpatialPixels object - in which case I reckon
brick/crop is a better way.
Otherwise you might use GDAL itself to do the extractions, but I haven't
done that myself.
I'll try the block tools in raster later if I get a chance, that's my first
guess at what will be best here. If you are familiar with the cell
abstractions in raster you can do a lot of work before hitting the raster
source at all, this can be really handy for organizing things the right
way.
Cheers, Mike.
On Mon, 3 Oct 2016 at 10:05 Kenny Bell <kmb56 at berkeley.edu> wrote:
> I am unsure if the file is tiled - how do I find this out? I am finding
> that sampling the cells directly and using `[` is also slow, though not as
> slow as sampleRandom. Is that what you meant by "index cell extract"?
>
> Using readAll isn't going to work as it reads in very slowly and is large
> (~100GB).
>
> On Sun, Oct 2, 2016 at 3:54 PM, Michael Sumner <mdsumner at gmail.com> wrote:
>
> Is the file tiled? Raster's extract is slow then because it scans line by
> line rather than by tile. The only fix I know is to readAll into memory or
> write to a new untiled file. At any rate you might as well sample the cell
> numbers more directly and use index cell extract instead of sampleRandom
>
> On Mon, 3 Oct 2016, 09:40 Kenny Bell <kmb56 at berkeley.edu> wrote:
>
> Is an approach that could improve this is to arrange the locations to
> collect into contiguous blocks inside raster:::.readCellsGDAL and read them
> in block by block?
>
> On Sun, Oct 2, 2016 at 3:32 PM, Kenny Bell <kmb56 at berkeley.edu> wrote:
>
> No substantial difference, no.
>
> cdl <- brick("Data/CDL/2015_30m_cdls/2015_30m_cdls.img")
> system.time(raster::sampleRandom(cdl, size = 100))
> # user system elapsed
> # 4.16 21.32 25.50
> system.time(cdl[random_pts$row_1D[1:100]])
> # user system elapsed
> # 1.33 5.36 6.69
>
> cdl <- raster("Data/CDL/2015_30m_cdls/2015_30m_cdls.img")
> system.time(raster::sampleRandom(cdl, size = 100))
> # user system elapsed
> # 4.07 21.34 25.46
> system.time(cdl[random_pts$row_1D[1:100]])
> # user system elapsed
> # 1.20 4.97 6.17
>
>
>
> On Sun, Oct 2, 2016 at 2:47 PM, Michael Sumner <mdsumner at gmail.com> wrote:
>
> Try creating it as a single layer brick, does it make a difference?
>
> Cheers, Mike
>
> On Mon, 3 Oct 2016, 08:26 Kenny Bell <kmb56 at berkeley.edu> wrote:
>
> I am trying to sample points from a large RasterLayer (~100GB if read into
> memory).
>
> raster::sampleRandom relies on raster raster:::.readCellsGDAL which seems
> to loop through rows, read in entire columns using rgdal::getRasterData,
> and subset those columns in R.
>
> Sampling 100000 pts from this raster is only a few per column, so this
> isn't efficient.
>
> Using my own random numbers with `[` also relies on
> raster:::.readCellsGDAL.
>
> Does anyone have a suggestion for a better practice?
>
> The raster is public so this code should be reproducible:
>
> download:
> ftp://ftp.nass.usda.gov/download/res/2015_30m_cdls.zip
>
> cdl <- raster("2015_30m_cdls/2015_30m_cdls.img")
> raster::sampleRandom(cdl, size = 100000) # slow
>
> Cheers,
> Kenny
>
> [[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
> --
> Dr. Michael Sumner
> Software and Database Engineer
> Australian Antarctic Division
> 203 Channel Highway
> Kingston Tasmania 7050 Australia
>
>
>
>
> --
> Kendon Bell
> Email: kmb56 at berkeley.edu
> Phone: (510) 612-3375
>
> Ph.D. Candidate
> Department of Agricultural & Resource Economics
> University of California, Berkeley
>
>
>
>
> --
> Kendon Bell
> Email: kmb56 at berkeley.edu
> Phone: (510) 612-3375
>
> Ph.D. Candidate
> Department of Agricultural & Resource Economics
> University of California, Berkeley
>
> --
> Dr. Michael Sumner
> Software and Database Engineer
> Australian Antarctic Division
> 203 Channel Highway
> Kingston Tasmania 7050 Australia
>
>
>
>
> --
> Kendon Bell
> Email: kmb56 at berkeley.edu
> Phone: (510) 612-3375
>
> Ph.D. Candidate
> Department of Agricultural & Resource Economics
> University of California, Berkeley
>
--
Dr. Michael Sumner
Software and Database Engineer
Australian Antarctic Division
203 Channel Highway
Kingston Tasmania 7050 Australia
[[alternative HTML version deleted]]
More information about the R-sig-Geo
mailing list