[R-sig-Geo] large shapefiles; zip code data

Thu Mar 12 14:53:25 CET 2009

On Thu, 12 Mar 2009, Ben Fissel wrote:

> Hello,
>
> I am attempting to fit a model CAR count data model of the Besag-York-Mollie
> form to US zip code data (entire US minus Hawaii and Alaska).  However, I'm
> running out of memory reading in the zip code data.  The zip code data I
> obtained form the census at
> http://www2.census.gov/geo/tiger/TIGER2008/tl_2008_us_zcta500.zip and are
> shapefiles.  I've allocated 4GB of memory to R which is the max my OS will
> give it (Vista).  Despite this when I attempt to load the shapefiles in I
> run out of memory using readOGR or readShapePoly .  I had a similar problem
> in Stata and worked around it by reading in the shapefiles for the lower 48
> states http://www.census.gov/geo/www/cob/z52000.html separately and
> concatenating them together, relabeling the ID in the process.  I'm trying
> to do the same thing in R but relabeling the ID is not as straight forward
> for me given my novice R programming ability.  Luckily I found a little help
> at
> http://help.nceas.ucsb.edu/R:_Spatial#Understanding_spatial_data_formats_in_Rwhich
> I adapted to my code.

Ben,

In addition to the answers you've already had, you could look at the 
spRbind() and spChFIDs() methods in maptools. spChFIDs() lets you 
manipulate the IDs (to make them unique, for example), and spRbind() 
sticks them together. I've used the combination for assembling US census 
tracts (68'), so a larger task than you face. I have then run poly2nb() in 
spdep on the output, which completed, although needing a lot of time.

There is an example of this in detail on the ASDAR book website, 
http://www.asdar-book.org, see the code examples for Chapter 5, but there 
just assembling data for counties in three US states.

Hope this helps,

Roger

>
> spatdata <- readOGR(".", "zt01_d00")
> #spatdata <- readShapePoly("zt01_d00")
> names(spatdata)[3] <- "ZT00_D00"
> names(spatdata)[4] <- "ZT00_D00_I"
>
> for (j in 2:2){    # Just loop over one file until I get it to work
>   filename <- paste("zt", statelist[j], "_d00",sep ="")  #statlist  is a
> vector of the form statelist <- c("01","04",...,"56") with number that
> correspond the 48 state shapefiles
>
>   spatdf <- readOGR(".", filename)
> # spatdf <- readShapePoly(filename)
>   names(spatdf)[3] <- "ZT00_D00"
>   names(spatdf)[4] <- "ZT00_D00_I"
>   mergedata <- rbind(spatdata at data,spatdf at data)
>   mergepolys <- c(spatdata at polygons,spatdf at polygons)
>   mergepolysp <-
> SpatialPolygons(mergepolys,proj4string=CRS(proj4string(spatdf)))
>   rm("spatdata","spatdf","filename")
>
>   for (i in 1: length(mergepolys)){
>     sNew = as.character(i)
>     mergepolys[i]@ID = sNew
>   }
>   ID <- c(as.character(1:length(mergepolys)))
>   mergedataID <- cbind(ID,mergedata)
>   spatdata <- SpatialPolygonsDataFrame(mergepolysp,data =
> mergedataID,match.ID = FALSE)
>   rm("mergepolys","mergedata","mergepolysp","mergedataID","ID")
>
>   gc()
> }
>
> However in the for loop over "i" I get an error when trying to relabel the
> ID: "Error in validObject(.Object) :  invalid class "SpatialPolygons"
> object: non-unique Polygons ID slot values" .  I've tried a number of
> different ways to change the ID in 'mergepolys' but haven't been successful
> yet.
>
> Ultimately, I just want to get the shapefiles into R so I can identify
> contiguous zip codes for the spatial regression.  Whether I get this by
> loading in one big shape zip code file or concatenating 48 state files is
> irrelevant to me.  Perhaps the census shapefiles have superfluous data that
> I can get rid of to free up memory and still achieve my objective, I don't
> know enough about shapefiles and how R reads them to know what I can throw
> away.  Maybe I'm going about this all wrong.  Thank you for any help and or
> suggestions that you can provide.
>
> After getting the shapefiles in I plan to identify contiguous zip codes and
> use R2Winbugs to fit the model as outlined in "Applied Spatial Data Analysis
> with R".  However, given the memory issues I'm having I am concerned that
> forming the spatial weighting matrix won't be possible, will R try to store
> this as an nxn matrix?  Furthermore, I have about 50+ other covariates that
> I need to merge in with the zip code data that is going to take up memory as
> well.  Simply put, is the memory bottleneck just in the function(s) loading
> the shapefiles or am I going to have trouble fitting this model with the
> covariates in R?
>
> I've seen the thread "mapping by zip codes"
> https://stat.ethz.ch/pipermail/r-sig-geo/2009-March/005194.html , which
> provides very useful information but hasn't helped me get around the
> problems I'm having.
>
> I've tried to be complete yet concise.  If there is any other information
> you need please let me know.
>
> Thanks for any help and or suggestions you can provide.
>
> -Ben
>

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no