[R-sig-Geo] raster[] slow on large rasters

Kenny Bell kmb56 at berkeley.edu
Mon Oct 3 02:04:31 CEST 2016


I have implemented an improvement that works for my context. Is this the
right place to suggest code improvements? This implements the approach
where there is one call to rgdal::getRasterData per contiguous block. Is
this the right place to make code change suggestions?

system.time(sampleRandom(cdl, 100))
> system.time(sampleRandom(cdl, 100))
   user  system elapsed
   0.20    0.08    0.61

It changes raster:::.readCellsGDAL to:

.readCellsGDAL <- function(x, cells, layers) {

nl <- nlayers(x)
if (nl == 1) {
if (inherits(x, 'RasterLayer')) {
layers <- bandnr(x)
} else {
layers <- 1
}
}
laysel <- length(layers)
colrow <- matrix(ncol=2+laysel, nrow=length(cells))
colrow[,1] <- colFromCell(x, cells)
colrow[,2] <- rowFromCell(x, cells)
colrow[,3] <- NA
colrow <- colrow[order(colrow[,2], colrow[,1]),]
# This is one if contiguous, something else if not (except for the end of a
row)
diffrowcol <- diff(colrow[,2]) + diff(colrow[,1])
# Block numbers
blocknums <- cumsum(c(TRUE, diffrowcol != 1))
nc <- x at ncols
con <- rgdal::GDAL.open(x at file@name, silent=TRUE)
if (laysel == 1) {
for (blocknum in unique(blocknums)) {
 block_lgl <- blocknum == blocknums
 offs <- c(colrow[block_lgl,2][1] - 1, colrow[block_lgl, 1][1] - 1)
v <- rgdal::getRasterData(con, offset=offs, region.dim=c(1,
sum(block_lgl)), band = layers)
colrow[block_lgl, 3] <- v
}
} else {
for (i in 1:length(rows)) {
thisrow <- colrow[colrow[,2] == rows[i], , drop=FALSE]
if (nrow(thisrow) == 1) {
offs <- c(rows[i]-1, thisrow[,1]-1)
v <- as.vector( rgdal::getRasterData(con, offset=offs, region.dim=c(1, 1)) )
colrow[colrow[,2]==rows[i], 2+(1:laysel)] <- v[layers]

} else {
offs <- c(rows[i]-1, 0)
v <- rgdal::getRasterData(con, offset=offs, region.dim=c(1, nc))
v <- do.call(cbind, lapply(1:nl, function(i) v[,,i]))
colrow[colrow[,2]==rows[i], 2+(1:laysel)] <- v[thisrow[,1], layers]
}
}
}
rgdal::closeDataset(con)
colnames(colrow)[2+(1:laysel)] <- names(x)[layers]
colrow[, 2+(1:laysel)]
}


On Sun, Oct 2, 2016 at 4:05 PM, Kenny Bell <kmb56 at berkeley.edu> wrote:

> I am unsure if the file is tiled - how do I find this out? I am finding
> that sampling the cells directly and using `[` is also slow, though not as
> slow as sampleRandom. Is that what you meant by "index cell extract"?
>
> Using readAll isn't going to work as it reads in very slowly and is large
> (~100GB).
>
> On Sun, Oct 2, 2016 at 3:54 PM, Michael Sumner <mdsumner at gmail.com> wrote:
>
>> Is the file tiled? Raster's extract is slow then because it scans line by
>> line rather than by tile. The only fix I know  is to readAll into memory or
>> write to a new untiled file. At any rate you might as well sample the cell
>> numbers more directly and use index cell extract instead of sampleRandom
>>
>> On Mon, 3 Oct 2016, 09:40 Kenny Bell <kmb56 at berkeley.edu> wrote:
>>
>>> Is an approach that could improve this is to arrange the locations to
>>> collect into contiguous blocks inside raster:::.readCellsGDAL and read them
>>> in block by block?
>>>
>>> On Sun, Oct 2, 2016 at 3:32 PM, Kenny Bell <kmb56 at berkeley.edu> wrote:
>>>
>>>> No substantial difference, no.
>>>>
>>>> cdl <- brick("Data/CDL/2015_30m_cdls/2015_30m_cdls.img")
>>>> system.time(raster::sampleRandom(cdl, size = 100))
>>>> #   user  system elapsed
>>>> #   4.16   21.32   25.50
>>>> system.time(cdl[random_pts$row_1D[1:100]])
>>>> #   user  system elapsed
>>>> #   1.33    5.36    6.69
>>>>
>>>> cdl <- raster("Data/CDL/2015_30m_cdls/2015_30m_cdls.img")
>>>> system.time(raster::sampleRandom(cdl, size = 100))
>>>> #   user  system elapsed
>>>> #   4.07   21.34   25.46
>>>> system.time(cdl[random_pts$row_1D[1:100]])
>>>> #   user  system elapsed
>>>> #   1.20    4.97    6.17
>>>>
>>>>
>>>>
>>>> On Sun, Oct 2, 2016 at 2:47 PM, Michael Sumner <mdsumner at gmail.com>
>>>> wrote:
>>>>
>>>>> Try creating it as a single layer brick, does it make a difference?
>>>>>
>>>>> Cheers, Mike
>>>>>
>>>>> On Mon, 3 Oct 2016, 08:26 Kenny Bell <kmb56 at berkeley.edu> wrote:
>>>>>
>>>>>> I am trying to sample points from a large RasterLayer (~100GB if read
>>>>>> into
>>>>>> memory).
>>>>>>
>>>>>> raster::sampleRandom relies on raster raster:::.readCellsGDAL which
>>>>>> seems
>>>>>> to loop through rows, read in entire columns using
>>>>>> rgdal::getRasterData,
>>>>>> and subset those columns in R.
>>>>>>
>>>>>> Sampling 100000 pts from this raster is only a few per column, so this
>>>>>> isn't efficient.
>>>>>>
>>>>>> Using my own random numbers with `[` also relies on
>>>>>> raster:::.readCellsGDAL.
>>>>>>
>>>>>> Does anyone have a suggestion for a better practice?
>>>>>>
>>>>>> The raster is public so this code should be reproducible:
>>>>>>
>>>>>> download:
>>>>>> ftp://ftp.nass.usda.gov/download/res/2015_30m_cdls.zip
>>>>>>
>>>>>> cdl <- raster("2015_30m_cdls/2015_30m_cdls.img")
>>>>>> raster::sampleRandom(cdl, size = 100000) # slow
>>>>>>
>>>>>> Cheers,
>>>>>> Kenny
>>>>>>
>>>>>>         [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> R-sig-Geo mailing list
>>>>>> R-sig-Geo at r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>
>>>>> --
>>>>> Dr. Michael Sumner
>>>>> Software and Database Engineer
>>>>> Australian Antarctic Division
>>>>> 203 Channel Highway
>>>>> Kingston Tasmania 7050 Australia
>>>>>
>>>>>
>>>>
>>>>
>>>> --
>>>> Kendon Bell
>>>> Email: kmb56 at berkeley.edu
>>>> Phone: (510) 612-3375
>>>>
>>>> Ph.D. Candidate
>>>> Department of Agricultural & Resource Economics
>>>> University of California, Berkeley
>>>>
>>>
>>>
>>>
>>> --
>>> Kendon Bell
>>> Email: kmb56 at berkeley.edu
>>> Phone: (510) 612-3375
>>>
>>> Ph.D. Candidate
>>> Department of Agricultural & Resource Economics
>>> University of California, Berkeley
>>>
>> --
>> Dr. Michael Sumner
>> Software and Database Engineer
>> Australian Antarctic Division
>> 203 Channel Highway
>> Kingston Tasmania 7050 Australia
>>
>>
>
>
> --
> Kendon Bell
> Email: kmb56 at berkeley.edu
> Phone: (510) 612-3375
>
> Ph.D. Candidate
> Department of Agricultural & Resource Economics
> University of California, Berkeley
>



-- 
Kendon Bell
Email: kmb56 at berkeley.edu
Phone: (510) 612-3375

Ph.D. Candidate
Department of Agricultural & Resource Economics
University of California, Berkeley

	[[alternative HTML version deleted]]



More information about the R-sig-Geo mailing list