[R-sig-Geo] raster[] slow on large rasters

Michael Sumner mdsumner at gmail.com
Mon Oct 3 04:04:30 CEST 2016


Hey great, this been on my never-get-to-todo list for ages.

I can commit it to raster but I'd want to do a few tests first. Can you
provide some?

I'll include some tests when I commit it, but there's no actual test suite
as far as I know.

Just FYI, the source is on r-forge via Subversion and the author is Robert
Hijmans, but a couple of us also have write access:

https://r-forge.r-project.org/projects/raster/

Cheers, Mike.

On Mon, 3 Oct 2016 at 11:04 Kenny Bell <kmb56 at berkeley.edu> wrote:

I have implemented an improvement that works for my context. Is this the
right place to suggest code improvements? This implements the approach
where there is one call to rgdal::getRasterData per contiguous block. Is
this the right place to make code change suggestions?

system.time(sampleRandom(cdl, 100))
> system.time(sampleRandom(cdl, 100))
   user  system elapsed
   0.20    0.08    0.61

It changes raster:::.readCellsGDAL to:

.readCellsGDAL <- function(x, cells, layers) {

nl <- nlayers(x)
if (nl == 1) {
if (inherits(x, 'RasterLayer')) {
layers <- bandnr(x)
} else {
layers <- 1
}
}
laysel <- length(layers)
colrow <- matrix(ncol=2+laysel, nrow=length(cells))
colrow[,1] <- colFromCell(x, cells)
colrow[,2] <- rowFromCell(x, cells)
colrow[,3] <- NA
colrow <- colrow[order(colrow[,2], colrow[,1]),]
# This is one if contiguous, something else if not (except for the end of a
row)
diffrowcol <- diff(colrow[,2]) + diff(colrow[,1])
# Block numbers
blocknums <- cumsum(c(TRUE, diffrowcol != 1))
nc <- x at ncols
con <- rgdal::GDAL.open(x at file@name, silent=TRUE)
if (laysel == 1) {
for (blocknum in unique(blocknums)) {
 block_lgl <- blocknum == blocknums
 offs <- c(colrow[block_lgl,2][1] - 1, colrow[block_lgl, 1][1] - 1)
v <- rgdal::getRasterData(con, offset=offs, region.dim=c(1,
sum(block_lgl)), band = layers)
colrow[block_lgl, 3] <- v
}
} else {
for (i in 1:length(rows)) {
thisrow <- colrow[colrow[,2] == rows[i], , drop=FALSE]
if (nrow(thisrow) == 1) {
offs <- c(rows[i]-1, thisrow[,1]-1)
v <- as.vector( rgdal::getRasterData(con, offset=offs, region.dim=c(1, 1)) )
colrow[colrow[,2]==rows[i], 2+(1:laysel)] <- v[layers]

} else {
offs <- c(rows[i]-1, 0)
v <- rgdal::getRasterData(con, offset=offs, region.dim=c(1, nc))
v <- do.call(cbind, lapply(1:nl, function(i) v[,,i]))
colrow[colrow[,2]==rows[i], 2+(1:laysel)] <- v[thisrow[,1], layers]
}
}
}
rgdal::closeDataset(con)
colnames(colrow)[2+(1:laysel)] <- names(x)[layers]
colrow[, 2+(1:laysel)]
}


On Sun, Oct 2, 2016 at 4:05 PM, Kenny Bell <kmb56 at berkeley.edu> wrote:

I am unsure if the file is tiled - how do I find this out? I am finding
that sampling the cells directly and using `[` is also slow, though not as
slow as sampleRandom. Is that what you meant by "index cell extract"?

Using readAll isn't going to work as it reads in very slowly and is large
(~100GB).

On Sun, Oct 2, 2016 at 3:54 PM, Michael Sumner <mdsumner at gmail.com> wrote:

Is the file tiled? Raster's extract is slow then because it scans line by
line rather than by tile. The only fix I know  is to readAll into memory or
write to a new untiled file. At any rate you might as well sample the cell
numbers more directly and use index cell extract instead of sampleRandom

On Mon, 3 Oct 2016, 09:40 Kenny Bell <kmb56 at berkeley.edu> wrote:

Is an approach that could improve this is to arrange the locations to
collect into contiguous blocks inside raster:::.readCellsGDAL and read them
in block by block?

On Sun, Oct 2, 2016 at 3:32 PM, Kenny Bell <kmb56 at berkeley.edu> wrote:

No substantial difference, no.

cdl <- brick("Data/CDL/2015_30m_cdls/2015_30m_cdls.img")
system.time(raster::sampleRandom(cdl, size = 100))
#   user  system elapsed
#   4.16   21.32   25.50
system.time(cdl[random_pts$row_1D[1:100]])
#   user  system elapsed
#   1.33    5.36    6.69

cdl <- raster("Data/CDL/2015_30m_cdls/2015_30m_cdls.img")
system.time(raster::sampleRandom(cdl, size = 100))
#   user  system elapsed
#   4.07   21.34   25.46
system.time(cdl[random_pts$row_1D[1:100]])
#   user  system elapsed
#   1.20    4.97    6.17



On Sun, Oct 2, 2016 at 2:47 PM, Michael Sumner <mdsumner at gmail.com> wrote:

Try creating it as a single layer brick, does it make a difference?

Cheers, Mike

On Mon, 3 Oct 2016, 08:26 Kenny Bell <kmb56 at berkeley.edu> wrote:

I am trying to sample points from a large RasterLayer (~100GB if read into
memory).

raster::sampleRandom relies on raster raster:::.readCellsGDAL which seems
to loop through rows, read in entire columns using rgdal::getRasterData,
and subset those columns in R.

Sampling 100000 pts from this raster is only a few per column, so this
isn't efficient.

Using my own random numbers with `[` also relies on raster:::.readCellsGDAL.

Does anyone have a suggestion for a better practice?

The raster is public so this code should be reproducible:

download:
ftp://ftp.nass.usda.gov/download/res/2015_30m_cdls.zip

cdl <- raster("2015_30m_cdls/2015_30m_cdls.img")
raster::sampleRandom(cdl, size = 100000) # slow

Cheers,
Kenny

        [[alternative HTML version deleted]]

_______________________________________________
R-sig-Geo mailing list
R-sig-Geo at r-project.org
https://stat.ethz.ch/mailman/listinfo/r-sig-geo

-- 
Dr. Michael Sumner
Software and Database Engineer
Australian Antarctic Division
203 Channel Highway
Kingston Tasmania 7050 Australia




-- 
Kendon Bell
Email: kmb56 at berkeley.edu
Phone: (510) 612-3375

Ph.D. Candidate
Department of Agricultural & Resource Economics
University of California, Berkeley




-- 
Kendon Bell
Email: kmb56 at berkeley.edu
Phone: (510) 612-3375

Ph.D. Candidate
Department of Agricultural & Resource Economics
University of California, Berkeley

-- 
Dr. Michael Sumner
Software and Database Engineer
Australian Antarctic Division
203 Channel Highway
Kingston Tasmania 7050 Australia




-- 
Kendon Bell
Email: kmb56 at berkeley.edu
Phone: (510) 612-3375

Ph.D. Candidate
Department of Agricultural & Resource Economics
University of California, Berkeley




-- 
Kendon Bell
Email: kmb56 at berkeley.edu
Phone: (510) 612-3375

Ph.D. Candidate
Department of Agricultural & Resource Economics
University of California, Berkeley

-- 
Dr. Michael Sumner
Software and Database Engineer
Australian Antarctic Division
203 Channel Highway
Kingston Tasmania 7050 Australia

	[[alternative HTML version deleted]]



More information about the R-sig-Geo mailing list