[R-sig-Geo] best practice for reading large shapefiles?

Wed Apr 27 08:18:26 CEST 2016

On 26/04/16 22:33, Vinh Nguyen wrote:
> On Tue, Apr 26, 2016 at 1:12 PM, Roger Bivand <Roger.Bivand at nhh.no> wrote:
>> On Tue, 26 Apr 2016, Vinh Nguyen wrote:
>>
>>> Would loading the shapefile into postgresql first and then use readOGR
>>> to read from postgres be a recommended approach?  That is, would the
>>> bottleneck still occur?  Thank you.
>>
>>
>> Most likely, as both use the respective OGR drivers. With data this size,
>> you'll need a competent platform (probably Linux, say 128GB RAM) as
>> everything is in memory. I find it hard to grasp what the point of doing
>> this might be - visualization won't work as none of the considerable detail
>> certainly in these files will be visible. Can you put the lot into an SQLite
>> file and access the attributes as SQL queries? I don't see the analysis or
>> statistics here.
>>
> 
> - I can't tell from your response whether you are recommending PostGIS
> is a recommended approach or not.  Could you clarify?

Roger said the bottleneck would most likely still occur, but couldn't
make much of a recommendation because you had not revealed the purpose
of reading this data in R.

> 
> - I am working on a Windows server with 64gb ram, so not too weak,
> especially for some files that are a few gb in size.  Again, not sure
> if the job just halted or it's still running, but just rather slow.
> I've killed it for now as the memory usage still has not grown after a
> few hours.

Messages that certain things do not work are often helpful, leading to
improvement in the software. With your report, however, we can't do
really much.

> 
> - Yes, the shapes are quite granular and many in quantity.  The use
> case was not to visualize them all at once.  Wanted a master file so
> that when I get a data set of interest, I could intersect the two and
> then subset the areas of interest (eg, within a state or county).
> Then visualize/analyze from there.  The master shapefile was meant to
> make it easy (reading in one file) as opposed to deciding which
> shapefile to read in depending on the project.

Using PostGIS for this use case may make sense, since PostGIS creates
and stores spatial indexes with its geometry data, and does everything
in database, rather than in memory. In R, you'd probably do
intersections with rgeos::gIntersects, which creates a spatial index on
the fly but doesn't store this index. Only experimentation can tell you
the magnitude of this difference.

> 
> - I just looked back at the 30 PLSS zip files, and they provide shapes
> for 3 levels of granularity.  I went with the smallest.  I just
> realized that the mid-size one would be sufficient for now, which
> results in dbf=138mb and shp=501mb.  Attempting to read this in now (~
> 30 minutes), which I assume will read in fine after some time.  Will
> respond to this thread if this is not the case.
> 
(see my 2nd comment)

Best regards,
-- 
Edzer Pebesma
Institute for Geoinformatics  (ifgi),  University of Münster
Heisenbergstraße 2, 48149 Münster, Germany; +49 251 83 33081
Journal of Statistical Software:   http://www.jstatsoft.org/
Computers & Geosciences:   http://elsevier.com/locate/cageo/
Spatial Statistics Society http://www.spatialstatistics.info

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: OpenPGP digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20160427/5ef0449d/attachment.bin>