[R-sig-Geo] Current options for creating/querying vector data WITHOUT loading them into memory?

Roger Bivand Roger.Bivand at nhh.no
Sat Jan 18 08:31:17 CET 2014

On Fri, 17 Jan 2014, Tim Keitt wrote:

> On Fri, Jan 17, 2014 at 1:22 PM, Roger Bivand <Roger.Bivand at nhh.no> wrote:
>> On Fri, 17 Jan 2014, Jonathan Greenberg wrote:
>>  Across all vector formats, which do you think would be a good
>>> intermediate between in-memory Spatial* and PostGIS?  I'd put a few
>>> stipulations:
>>> 1) The format should be open source and supported by existing APIs
>>> (OGR/rgeos)
>>> 2) It should be portable (file-based)
>>> 3) It should be "scalable" (able to support arbitrarily large vector
>>> databases)
>> Could I ask for a range of use cases? The sp classes are designed for
>> statistical analysis, so in general some hundreds of thousands of
>> observations/features should suffice amply. The use cases should
>> demonstrate which kinds of objects and functionalities are thought
>> necessary. The fact that there is lots of data doesn't mean that it is all
>> needed for analysis or inference, or even visualization, I think?
>> Have you considered interfacing the OGR utilities from the system() call
>> to subset features/fields?
>> I think that 2) - file-based - is moot, if there is that much data, it
>> needs to be in a database system, possibly with an OGR driver, which OGR
>> utilities could access.
>> Have you considered Terralib (now 4, the development version 5 will be
>> closer to GDAL/OGR)? My intuition is that this is a viable solution.
>> We really also need to accommodate space-time objects in any significant
>> revision, I think - or at least prepare object structures that are
>> forward-looking with regard to temporal data.
>> I have asked several times for volunteers to rewrite rgdal::readOGR
>> (without anyone stepping forward), because it is fairly inefficient, and
>> should support SQL queries introduced in GDAL/OGR from 1.8. Supporting the
>> OGR SQL dialect means that all drivers support queries on FID and field
>> values.
>> Within the next four years, I will be giving up maintenance of rgdal and
>> rgeos (possibly other packages too). I can help, but users do not deserve
>> key packages potentially compromised by the health and poor responsiveness
>> of an emeritus. Forward planning is needed for others to take on these
>> responsibilities before it becomes a matter of urgency. The pool of active
>> developers must be enlarged this year.
> Roger,
> Thank you for your maintenance efforts!
> I've drifted towards postgis/C++ over time in my own work, but am now 
> developing some courses around R. I anticipate being fairly active with 
> R development going forward. I have a full rewrite of the OGR io bits 
> that I will make available soon. It works really well when your data are 
> in postgis or any other OGR format.


This is very positive, thanks! When you are ready (or even before!), I'd 
strongly encourage others to get to know more of the rgdal/rgeos 
internals, and developments in the underlying software and standards. I'll 
be speaking at OGRS in Helsinki in June (http://2014.ogrs-community.org); 
could we use that as a tentative time frame (especially if interacting 
with others in the open source geospatial communities may be helpful)? 
Should we try to put an RFC together (and put it on R-forge, for example)?

I have considered using the OGC/GEOS representation under a thin "new" sp, 
but couldn't see how to avoid having at least one representation of 
geometries in memory. The sp <-> GEOS bridge in rgeos is there and sort-of 
works (the classes don't map exactly), but involves a lot of conversion. 
As OGR can link to GEOS, it might make sense to consider merging the 
packages. I couldn't see how to approach the elegance of your external 
pointer code for low-level GDAL interaction - pointing to an open GDAL 
object, but then regular grids have sparse geometries.


>> Roger
>>> Cheers!
>>> --j
>>> On Thu, Jan 16, 2014 at 2:49 PM, Tim Keitt <tkeitt at utexas.edu> wrote:
>>>> On Thu, Jan 16, 2014 at 1:40 PM, Jonathan Greenberg <jgrn at illinois.edu>
>>>> wrote:
>>>>> I've wondered if it would be possible to do something like what Robert
>>>>> did with the raster() package, where the analysis (read/write) was
>>>>> being done on-demand on the data rather than entirely in-memory.
>>>>> Vector data is, of course, much more complicated to come up with
>>>>> elegant solutions than raster data, but I think some basic
>>>>> functionality would be great.  Perhaps spatialite as a backbone (since
>>>>> you can easily install sqlite executable via the Rsqlite package, and
>>>>> there is a now-abandoned but available code base in
>>>>> http://cran.r-project.org/web/packages/SQLiteMap/ (I spoke to the
>>>>> developer who said he won't be updating it) that might allow for a
>>>>> relatively easy cross-platform install of the spatialite addon.
>>>>> Something that would fill in the gap between the Spatial* classes
>>>>> (which won't scale to large datasets) and PostGIS (which requires much
>>>>> more complex installation requirements)?
>>>>> How does spatialite perform in terms of large queries?  I imagine not
>>>>> as well as PostGIS, but does it at least scale memory-wise on most
>>>>> standard queries?
>>>> I've not used it. Generally sqlite is faster than postgresql but not as
>>>> reliable. I just don't want to learn another syntax variation. Utilizing
>>>> spatial indices for example in spatialite requires explicit modification
>>>> of
>>>> your SQL queries. There is no automatic index queries based on the
>>>> planner
>>>> as in postgresql. But its a very useful tool as you can do everything
>>>> out of
>>>> a single file on disk.
>>>> THK
>>>>> --j
>>>>> On Thu, Jan 16, 2014 at 1:14 PM, Tim Keitt <tkeitt at utexas.edu> wrote:
>>>>>> On Thu, Jan 16, 2014 at 1:09 PM, Barry Rowlingson
>>>>>> <b.rowlingson at lancaster.ac.uk> wrote:
>>>>>>> Well, back when I wrote 'rmap' I abstracted out the storage of the
>>>>>>> data from the data object... So your object in R could represent a
>>>>>>> subset of a shapefile, and the code only grabbed that chunk of the
>>>>>>> shapefile when it was needed, for example to plot (the R object was
>>>>>>> basically the name of the shapefile plus a selection vector).
>>>>>>> Then we threw that code out and sp classes were born!
>>>>>>>  I've often thought about restoring some of this kind of
>>>>>>> functionality, but R's object-oriented classes just frustrate me. Its
>>>>>>> not so simple to build a superclass of sp class objects. Or maybe it
>>>>>>> is now? For some value of 'simple'...
>>>>>>>  Suppose you had a gigantic spatialite db - if you want to work with
>>>>>>> it spatially (mapping, rgeos) you are going to have to get the bits
>>>>>>> you need into main memory, so the simplest is just to load selections
>>>>>>> into sp-class objects. Is that already possible with the OGR
>>>>>>> spatialite driver? Can you also load subsets of shapefiles using some
>>>>>>> SQL passed to the OGR shapefile driver?
>>>>>>>  What would you want to do on whole-dataset objects of this class?
>>>>>>> Would you want to do the processing on the database if possible (if
>>>>>>> its PostGIS or Spatialite)? Or have an automatic chunking procedure
>>>>>>> for operations that don't need the whole database at once, such as
>>>>>>> finding centroids of polygons?
>>>>>>> Hmmm thoughts thoughts thoughts and no action :( Sorry!
>>>>>> Barry,
>>>>>> I'll have more to say on this in a couple of weeks.
>>>>>> THK
>>>>>>> Barry
>>>>>>> On Thu, Jan 16, 2014 at 6:52 PM, Jonathan Greenberg <
>>>>>>> jgrn at illinois.edu>
>>>>>>> wrote:
>>>>>>>> r-sig-geo'ers:
>>>>>>>> As vector datasets are getting a lot larger, there is a limitation
>>>>>>>> with the Spatial* formats in that they must be loaded into main
>>>>>>>> memory.  I was curious what folks who have been dealing with massive
>>>>>>>> vector files have come up with working within the R environment?  Has
>>>>>>>> anyone played around with file geodatabases or spatialite formats
>>>>>>>> (for
>>>>>>>> instance)?  How are you creating/querying the data?
>>>>>>>> Thanks!
>>>>>>>> --j
