[R-sig-Geo] merging data with SpatialPolygonsDataFrame

Dylan Beaudette dylan.beaudette at gmail.com
Fri Jan 25 22:05:34 CET 2008


On Friday 25 January 2008, Roger Bivand wrote:
> On Thu, 24 Jan 2008, Dylan Beaudette wrote:
> > On Thursday 24 January 2008, Roger Bivand wrote:
> >> On Thu, 24 Jan 2008, Dylan Beaudette wrote:
> >>> Hi Roger. Thanks for contributing some answers to this.
> >>>
> >>> I was recently working with a colleague on developing some sample
> >>> exercises for new students. Since joining new attribute data to a GIS
> >>> layer's table is a very common operation we included some samples on
> >>> how to do this within R. You have hinted at some possible ways to do it
> >>> above, but do you have a 'best practices' approach to doing this using
> >>> 'sp' methods and objects?
> >>>
> >>> For example:
> >>>
> >>> # contains an attribute col named 'veg_code'
> >>> veg <- readOGR(something.shp)
> >>>
> >>> # code meanings: indexed by 'veg_code'
> >>> codes <- read.dbf(table.dbf)
> >>>
> >>> # what is the best way to join up the attributes in 'veg' with the rows
> >>> in 'codes' ?
> >>
> >> Hi Dylan,
> >>
> >> This is a different question, but I won't break it out of this thread
> >> yet.
> >>
> >> You are doing look-up on codes to give labels to veg$veg_code, right?
> >>
> >> veg$veg_code are integer indices to codes$V1 (say V1, I don't know what
> >> it is). If length(unique(veg$veg_code)) == length(codes$V1), and
> >> sort(unique(veg$veg_code)) is 1:length(codes$V1), you should think of
> >> the factor as your friend:
> >>
> >> veg$veg_code_factor <- factor(veg$veg_code,
> >> labels=as.character(codes$V1))
> >>
> >> If not, you need another layer using perhaps order() or match() on the
> >> matching substring of codes$V1 to find out which value of veg$veg_code
> >> should have which label in as.character(codes$V1). Alternatively use the
> >> levels= argument to factor().
> >>
> >> Something like:
> >>
> >> set.seed(1)
> >> veg_code <- rpois(100, 4)
> >> table(veg_code)
> >> V1 <- paste("code", 0:10)
> >> V1
> >> levs <- 0:10
> >> veg_code_factor <- factor(veg_code, levels=levs, labels=V1)
> >> table(veg_code_factor, veg_code)
> >>
> >> No merging or messing with veg itself is needed, apart from adding a
> >> single extra factor column. The factor abstraction is a great strength
> >> of the S language.
> >>
> >> Have I misunderstood you?
> >>
> >> Roger
> >
> > I think so.
> >
> > I was (trying to) describe the process of joining, either 1:1 or many:1,
> > the att table associated with an sp object and some other data frame.
> >
> > merge() seems to work fine, but the order of the rows are different from
> > the original data frame attached to the sp object.

Thanks for the detailed response Roger.

> My experience is that merge() is often a false friend, because of object
> coercion and sorting, as you suggest. The rules for SpatialLinesDataFrame
> and SpatialPolygonsDataFrame are simple, and SpatialPointsDataFrame (and
> by extension SpatialPixelsDataFrame) can be made as simple. The first two
> constructor functions take a match.ID= argument, default TRUE, which
> matches the geometry ID - the ID slot in the component Lines or Polygons
> objects - to the data.frame row.name. For SpatialPointsDataFrame(), the
> same argument exists, but is only used if the coordinate matrix has row
> names to be matched to the data.frame row names. If it has been used, the
> data slot row names are the geometry IDs.
>
> From there, it gets harder. merge() takes lots of arguments, and a lot of
> trial and error is needed to reach a satisfactory result, that is a single
> data frame with row names matching the geometry IDs (not necessarily in
> the right order, but the same set of IDs).
>
> > My question was on the best way to 'update' the data frame attached to an
> > sp object, based on the results from a merge() with some other data.
>
> For one to one, there isn't really an alternative to extracting the data
> slot, do the merge, check the geometry IDs against the output data frame
> row names, and re-construct the object. That is, essentially where this
> thread began.
>
> If there is less data than geometries, merge() should fill out with NAs.
> If there is more data than geometries, the output data frame will need
> subsetting. Both match(), order(), and %in% are very handy here.
>
> For one geometry list to many - see reshape() first, for example to
> flatten a "tall" set of space/time observations so that the time values
> become new attribute columns. This could be met stations stacked by
> station and date, which need widening to stations by date*attributes
>
> If need be, the IDs of the geometries can be changed too, see
> ?"spChFIDs-methods" in maptools.
>
> Any closer?

I think that you have answered my questions and reaffirmed my thoughts on how 
to proceed. I am mostly interested in 1:1 and 1:many joins (geom:attr), and I 
think that the following approach should work in the general case:

# read vector data:
v <- readOGR(dsn='...', layer='...')

# safer approach to 1:1 or 1:many (geom:atts) joins
# add a sorting id for later use
v at data$sorting_id <- 1:nrow(v at data)

# make a copy of the original table
orig.table <- v at data

# make the table with 'joined' data
new.table <- merge(x=orig.table, y=veg_codes, by.x='CODE', by.y='code')

# re-order this table based on the original row order
new.table.ordered <- new.table[order(new.table$sorting_id), ]

# restore the origina row names
row.names(new.table.ordered) <- row.names(orig.table)

# replace the data table
v at data <- new.table.ordered


This should ensure that the merged rows are in the original order, and have 
the original row.names.

Does that sound correct?

Thanks,

Dylan



> Roger
>
> > does that help?
> >
> > cheers,
> >
> > Dylan
> >
> >>> As of now we are using merge to replace the dataframe slot of the
> >>> original file. We first re-order the results from merge to match the
> >>> original row ordering:
> >>>
> >>>
> >>> # an example file:
> >>> veg <- readOGR(dsn='ArcGISLabData/BrownsPond/', layer='vegevector')
> >>>
> >>> # some example codes
> >>> veg_codes <- data.frame(code=1:4, meaning=c('code 1','code 2','code
> >>> 3','code 4'))
> >>>
> >>> # join the original data table with the veg codes table
> >>> combined <- merge(x=veg at data, y=veg_codes, by.x='CODE', by.y='code')
> >>>
> >>>
> >>> # overwrite the original data frame with the combined version
> >>> # note that the original order needs to be restored
> >>> # since the original data was sorted on 'ID', we can use that to
> >>> restore # the correct order in the 'combined' dataframe:
> >>> v at data <- combined[order(combined$ID),]
> >>>
> >>>
> >>> In summary, is there a safer or preferred way to do this?
> >>>
> >>> thanks,
> >>>
> >>> Dylan



-- 
Dylan Beaudette
Soil Resource Laboratory
http://casoilresource.lawr.ucdavis.edu/
University of California at Davis
530.754.7341




More information about the R-sig-Geo mailing list