[R-sig-Geo] split() function on SpatialPolygonsDataFrame increases file size

Edzer Pebesma edzer.pebesma at uni-muenster.de
Tue Mar 22 23:10:58 CET 2016


I can't confirm this in a simple example:

> library(sp)
> x = GridTopology(c(0,0), c(1,1), c(100,100))
> p = as(x, "SpatialPolygons")
> p$z = 1:10000
> object.size(p)
35763352 bytes
> xx = split(p, rep(1:4, len = length(p)))
> object.size(xx)
35812992 bytes
> xx = split(p, rep(1:8, len = length(p)))
> object.size(xx)
35825792 bytes

In case this persists, please consider sharing data that reproduce this
with me off-line.


On 22/03/16 22:44, Michael Sumner wrote:
> On Wed, 23 Mar 2016, 05:19 Luke Macaulay <luke.macaulay at gmail.com> wrote:
> 
>> I have a large SpatialPolygonsDataFrame of 500,000 polygons named sub
>> that I am splitting into 8 separate objects using split() to perform
>> multicore processing on.
>>
>> xx<-split(sub, rep(1:cores, len=nrow(sub at data)))
>>
>> The original file size in R's environment shows 4gb, but after the
>> split, the list size increases to 7gb, which seems like a really big
>> increase.
>>
>> Is this normal?  I wonder if there's increased file size due to the
>> reproduction of polygon borders and vertices that were previously
>> shared in the unsplit data, or is something else is going on?
> 
> 
> 
> These objects never share vertices. If you think that can help there are
> ways to store these objects as tables that removes redundancy.
> 
> Can you set up a clear demonstration that is reproducible? I think advice
> here needs much more info, particularly on what kind of shapes your
> polygons are and what the processing is to do.
> 
> Cheers, Mike
> 
> 
> 
> 
> I
>> suspect that the split files are retaining some of the entire
>> dataset's characteristics, but I'm not sure.
>>
>> I thought this post
>> (
>> http://stackoverflow.com/questions/29137914/r-split-function-size-increase-issue
>> )
>> would solve my problem: the poster split by a numeric ID variable that
>> was used as an index in the created list, leading to the creation of
>> many empty lists. But I'm not splitting on a column, and after trying
>> the split in various ways, including trying to split on a created
>> column that was a factor, I still have the same problem.
>>
>> The problem ultimately is that when I try to process this on multiple
>> cores, I max out my memory.
>>
>> Much thanks,
>> Luke
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo at r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>

-- 
Edzer Pebesma
Institute for Geoinformatics  (ifgi),  University of Münster
Heisenbergstraße 2, 48149 Münster, Germany; +49 251 83 33081
Journal of Statistical Software:   http://www.jstatsoft.org/
Computers & Geosciences:   http://elsevier.com/locate/cageo/
Spatial Statistics Society http://www.spatialstatistics.info

-------------- next part --------------
A non-text attachment was scrubbed...
Name: signature.asc
Type: application/pgp-signature
Size: 490 bytes
Desc: OpenPGP digital signature
URL: <https://stat.ethz.ch/pipermail/r-sig-geo/attachments/20160322/e3a63b54/attachment.bin>


More information about the R-sig-Geo mailing list