[R-sig-Geo] Is a simple feature -friendly version of spdep being developed?

Tue May 16 10:25:04 CEST 2017

On Tue, 16 May 2017, Tiernan Martin wrote:

>> Should the weights be
>> squirreled away inside the object flowing down the pipe?
>
> I'd propose that the that functions be written in such a way that the
> neighbors, weights, and test results can be stored in list cols within the
> data.frame.
>
> This way multiple tests (and their inputs) can be stored in a single,
> easily comprehensible, rectangular data object.
>
> Here's some more pretend code to illustrate that concept with regard to sf:
>
>  nc %>%
>    group_by(NAME) %>%
>    nest %>%
>    mutate(
>           geometry = map(.x = data, ~.x[['geometry']]),
>           neighbors = map(.x = data, .f = my_nb), # creates a list col of
> the neighbors
>           weights = map(.x = data, .f = add_weights), # creates a weights
> object (class=wt, is this a sparse matrix in spdep?)
>           test_moran = map2(.x = data, .y =  weights, .f = my_moran), #
> creates list of Moran's I and sample kurtosis of x
>           test_geary = map2(.x = data, .y = weights, .f = my_geary) #
> creates list of Geary's C and sample kurtosis of x
>               )
>
> #> # A tibble: 100 x 7
> #>           NAME              data         geometry  neighbors  weights
> test_moran test_geary
> #>        <fctr>            <list>           <list>     <list>   <list>
> <list>     <list>
> #>  1        Ashe <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
> <list [2]> <list [2]>
> #>  2   Alleghany <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
> <list [2]> <list [2]>
> #>  3       Surry <tibble [1 x 14]> <simple_feature> <list [4]> <S3: wt>
> <list [2]> <list [2]>
> #>  4   Currituck <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
> <list [2]> <list [2]>
> #>  5 Northampton <tibble [1 x 14]> <simple_feature> <list [3]> <S3: wt>
> <list [2]> <list [2]>
> #>  6    Hertford <tibble [1 x 14]> <simple_feature> <list [6]> <S3: wt>
> <list [2]> <list [2]>
> #>  7      Camden <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
> <list [2]> <list [2]>
> #>  8       Gates <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
> <list [2]> <list [2]>
> #>  9      Warren <tibble [1 x 14]> <simple_feature> <list [5]> <S3: wt>
> <list [2]> <list [2]>
> #> 10      Stokes <tibble [1 x 14]> <simple_feature> <list [2]> <S3: wt>
> <list [2]> <list [2]>
> #> # ... with 90 more rows
>
> Examples of this nested data.frame / list col approach can be found here:
> http://r4ds.had.co.nz/many-models.html#nested-data

Thanks, interesting but really scary. You get numeric output for local 
Moran's I and local Geary, but what will happen next?

All of these tests assume a correctly specified mean model. Without this 
assumption being met (absence of trends, absence of global spatial 
autocorrelation in the local test case, absence of missing right-hand-side 
variables and absence of incorrect functional forms for these, ...), a map 
pretending to invite "inference" of "hotspots" etc. - collections of 
observations with similar values of the local statistic, will most often 
be highly misleading. This isn't only about whether it is possible to do 
technically, but also about not making it simpler than reason would 
suggest. With a correct mean model, there may not be any (local) spatial 
autocorrelation left. See ?localmoran.sad and ?localmoran.exact, and their 
references, and McMillen (2003) (the McSpatial author). I won't even 
mention the need to adjust p.values for multiple comparisons ...

>
>> Extending this to adding weights to sf objects may be going too far. It
>> seems slightly forced to - say - attach a sparse matrix to an sf geometry
>> column as an attribute, or to add a neighbour nested list column when
>> writing subsetting protection is very far from obvious. Like time series
>> data (not time stamped observations, but real ordered time series),
>> spatial data have implicit order that is tidy in a different way than tidy
>> (graph dependencies between observation rows). Maybe mapped features with
>> attached attribute data are "tidier" than a table, because they would
>> preserve their relative positions?
>
> This seems like the crux of the problem. Are you saying that the orthogonal
> "tidyness" of the tidyverse is fundamentally different from the underlying
> relationships of spatial data, and therefore there's a limit to the
> usefulness of tidyverse approach to spatial analysis?

Right, this is central and open. Spatial objects can be tidy in the 
tidyverse sense - sf shows this can be done. The open question is whether 
this extends to the analysis of the data. From looking around, it seems 
that time series and graphs suffer from the same kinds of questions, and 
it will be interesting to see what stars turns up for spatio-temporal 
data:

https://github.com/edzer/stars

I see that stars wisely does not include trip/trajectory data structures, 
but analysis might involve reading the array-based representation of 
spatio-temporal environmental drivers for an ensemble of buffers around 
trajectories (say animal movement), then modelling.

Maybe: "everything should be as tidy as it can be but not tidier"?

Roger

>
> On Mon, May 15, 2017 at 5:17 AM Roger Bivand <Roger.Bivand at nhh.no> wrote:
>
>> On Sun, 14 May 2017, Tiernan Martin wrote:
>>
>>> You’re right about one thing: those of us introduced to the pipe in our
>>> formative years find it hard to escape (everything is better here inside
>>> the pipe - right?) ;)
>>
>> Pipes can be great if they don't involve extra inputs at a later stage in
>> the pipe, but as:
>>
>> https://github.com/thomasp85/tidygraph
>>
>> suggests, it isn't very easy, and may be forced.
>>
>>>
>>> Thank you for your very informative response. Before reading it, I had a
>>> vague inkling that matrices were part of the challenge of adapting these
>>> tools to work with simple features, but I also thought perhaps all this
>>> issue needed was a little nudge from a user like me. Your response made
>> it
>>> clear that there is a series of technical decisions that need to be made,
>>> and in order to do that I imagine some exploratory testing is in order.
>> Of
>>> course, it is possible that the data.frame + list-col foundation is just
>>> not well-suited to the needs of spatial weighting/testing tools, but I
>>> suspect that there is some valuable insight to be gained from exploring
>>> this possibility further. And as I stated in my first post, going
>> forward I
>>> expect more users will be interested in seeing this sort of integration
>>> happen.
>>
>> I think that there are two tasks - one to create neighbour objects from sf
>> objects, then another to see whether the costs of putting the spatial
>> weights into a data.frame (or even more challenging, a tibble) to test for
>> spatial autocorrelation or model spatial dependence. Should the weights be
>> squirreled away inside the object flowing down the pipe?
>>
>> sf_object %>% add_weights(...) -> sf_object_with_weights
>> moran.test(sf_object_with_weights, ...)
>> geary.test(sf_object_with_weights, ...)
>>
>> or
>>
>> sf_object %>% add_weights(...) %>% moran.test(sf_object_with_weights, ...)
>> -> output
>>
>> all in one (output is a htest object). It gets more complex in:
>>
>> lm.morantest(lm(formula, sf_object), sf_object_with_weights)
>>
>> as we need the weights and the lm output object.
>>
>>>
>>>> How should we go about this technically? spdep is on R-Forge and is
>> happy
>>>> there. Its present functionality has to be maintained as it stands, it
>>> has
>>>> too many reverse dependencies to break. Should it be re-written from
>>>> scratch (with more sparse matrices internally for computation)?
>>>> ...
>>>> We'd need proof of concept with realistically sized data sets (not yet
>> NY
>>>> taxis, but maybe later ...). spdep started as spweights, sptests and
>>>> spdep, and the first two got folded into the third when stable. If
>>> weights
>>>> are the first thing to go for, sfweights is where to go first (and port
>>>> the STRtree envelope intersections for contiguities). It could define
>> the
>>>> new classes, and tests and modelling would use them.
>>>
>>> Sounds to me like this package would need to be built from the ground up,
>>> possibly following a similar path to the `sp` development process as you
>>> mentioned. Maybe that presents a challenge with regards to resources
>> (i.e.,
>>> funding, time, etc.). Perhaps project this is a candidate for future
>>> proposals to the R Consortium or other funding sources. I hope that this
>>> post is useful in documenting user interest in seeing the `sf` protocol
>>> integrated into other spatial tools and workflows, and I look forward to
>>> hearing others' thoughts on the matter.
>>
>> Making spatial weights from sf objects is just expanding spdep, and is
>> feasible. There is no point looking for funding, all it needs is knowledge
>> of how things have been done in spdep. Contiguity, knn, distance,
>> graph-based neighbours are all feasible.
>>
>> Extending this to adding weights to sf objects may be going too far. It
>> seems slightly forced to - say - attach a sparse matrix to an sf geometry
>> column as an attribute, or to add a neighbour nested list column when
>> writing subsetting protection is very far from obvious. Like time series
>> data (not time stamped observations, but real ordered time series),
>> spatial data have implicit order that is tidy in a different way than tidy
>> (graph dependencies between observation rows). Maybe mapped features with
>> attached attribute data are "tidier" than a table, because they would
>> preserve their relative positions?
>>
>> Roger
>>
>>>
>>> Thanks again,
>>>
>>> Tiernan
>>>
>>>
>>>
>>>
>>>
>>> On Sat, May 13, 2017 at 1:39 PM Roger Bivand <Roger.Bivand at nhh.no>
>> wrote:
>>>
>>>> On Fri, 12 May 2017, Tiernan Martin wrote:
>>>>
>>>>> Is anyone thinking about creating an adaptation of the `spdep` package
>>>> that
>>>>> expects sf-class inputs and works well in a pipeline?
>>>>
>>>> I assume that "this is not a pipeline" is a joke for insiders (in the
>>>> pipe)?
>>>>
>>>> No, there is your issue on the sfr github repository that is relevant
>> for
>>>> contiguous neighbours, but not beyond that:
>>>>
>>>> https://github.com/edzer/sfr/issues/234
>>>>
>>>> An sf is a data.frame, and as such should "just work", like "Spatial"
>>>> objects have, in formula/data settings. The problem is (of course) the
>>>> weights matrix.
>>>>
>>>> Should it be a list column (each row has a list nesting two lists, first
>>>> indices, second non-zero weights), or remain separate as it has been for
>>>> 20 years, or become a column-oriented representation (Matrix package) -
>> a
>>>> nested list like a list column or a listw obeject is row-oriented. I had
>>>> started thinking about using a sparse column-oriented representation,
>> but
>>>> have not implemented functions accepting them instead of listw objects.
>>>>
>>>> I am very cautious about creating classes for data that were data.frame,
>>>> then sf, and then have the weights built-in. In the simple case it would
>>>> work, but all you have to do is re-order the rows and the link between
>> the
>>>> neighbour ids and row order breaks down; the same applies to subsetting.
>>>>
>>>> The problems to solve first are related the workflows, and easiest to
>> look
>>>> at in the univariate case (Moran, Geary, join-count, Mantel, G, ...) for
>>>> global and local tests. I think that all or almost all of the NSE verbs
>>>> will cause chaos (mutate, select, group) once weights have been created.
>>>> If there is a way to bind the neighour IDs to the input sf object rows,
>> a
>>>> list column might be possible, but we have to permit multiple such
>> columns
>>>> (Queen, Rook, ...), and ensure that subsetting and row-reordering keep
>> the
>>>> graph relations intact (and their modifications, like
>> row-standardisation)
>>>>
>>>> If you've seen any examples of how tidy relates to graphs (adjacency
>> lists
>>>> are like nb objects), we could learn from that.
>>>>
>>>> How should we go about this technically? spdep is on R-Forge and is
>> happy
>>>> there. Its present functionality has to be maintained as it stands, it
>> has
>>>> too many reverse dependencies to break. Should it be re-written from
>>>> scratch (with more sparse matrices internally for computation)?
>>>>
>>>> Your example creates weights on the fly for a local G* map, but has
>> n=100,
>>>> not say n=90000 (LA census blocks). Using the sf_ geos based methods
>> does
>>>> not use STRtrees, which cut the time needed to find contigious
>> neighbours
>>>> from days to seconds. We ought to pre-compute weights, but this messes
>> up
>>>> the flow of data, because a second lot of data (the weights) have to
>> enter
>>>> the pipe, and be matched with the row-ids.
>>>>
>>>> We'd need proof of concept with realistically sized data sets (not yet
>> NY
>>>> taxis, but maybe later ...). spdep started as spweights, sptests and
>>>> spdep, and the first two got folded into the third when stable. If
>> weights
>>>> are the first thing to go for, sfweights is where to go first (and port
>>>> the STRtree envelope intersections for contiguities). It could define
>> the
>>>> new classes, and tests and modelling would use them.
>>>>
>>>> Thanks for starting this discussion.
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> I understand that there is skepticism about the wisdom of adopting the
>>>>> “tidyverse” principles throughout the R package ecosystem, and I share
>>>> the
>>>>> concern that an over-reliance on any single paradigm could reduce the
>>>>> resilience and diversity of the system as a whole.
>>>>>
>>>>> That said, I believe that the enthusiastic adoption of the `sf` package
>>>> and
>>>>> the package's connections with widely-used tidyverse packages like
>>>> `dplyr`
>>>>> and `ggplot2` may result in increased demand for sf-friendly spatial
>>>>> analysis tools. As an amateur who recently started using R as my
>> primary
>>>>> GIS tool, it seems like the tidyverse's preference for dataframes, S3
>>>>> objects, list columns, and pipeline workflows would be well-suited to
>> the
>>>>> field of spatial analysis. Are there some fundamental reasons why the
>>>>> `spdep` tools cannot (or should not) be adapted to the tidyverse
>>>> "dialect"?
>>>>>
>>>>> Let me put the question in the context of an actual analysis: in
>> February
>>>>> 2017, the pop culture infovis website The Pudding (
>> https://pudding.cool/
>>>> )
>>>>> published an analysis of regional preferences for Oscar-nominated films
>>>> in
>>>>> the US (https://pudding.cool/2017/02/oscars_so_mapped/). A few days
>> ago,
>>>>> the author posted a tutorial explaining the method of “regional
>>>> smoothing”
>>>>> used to create the article’s choropleths (
>>>>> https://pudding.cool/process/regional_smoothing/).
>>>>>
>>>>> The method relies on several `spdep` functions (
>>>>>
>>>>
>> https://github.com/polygraph-cool/smoothing_tutorial/blob/master/smoothing_tutorial.R
>>>> ).
>>>>> In the code below, I provide reprex with a smaller dataset included in
>>>> the
>>>>> `sf` package:
>>>>>
>>>>> library(sf)
>>>>> library(spdep)
>>>>>
>>>>> nc <- st_read(system.file("shape/nc.shp", package = "sf"))  # North
>>>>> Carolina counties
>>>>> nc_shp <- as(nc,'Spatial')
>>>>>
>>>>> coords <- coordinates(nc_shp)
>>>>> IDs<-row.names(as(nc_shp, "data.frame"))
>>>>>
>>>>> knn5 <- knn2nb(knearneigh(coords, k = 5), row.names = IDs)  # find the
>>>>> nearest neighbors for each county
>>>>> knn5 <- include.self(knn5)
>>>>>
>>>>> localGvalues <- localG(x = as.numeric(nc_shp at data$NWBIR74), listw =
>>>>> nb2listw(knn5, style = "B"), zero.policy = TRUE) # calculate the G
>> scores
>>>>> localGvalues <- round(localGvalues,3)
>>>>>
>>>>> nc_shp at data$LOCAL_G <- as.numeric(localGvalues)
>>>>>
>>>>> p1 <- spplot(nc_shp, c('NWBIR74'))
>>>>> p2 <- spplot(nc_shp, c('LOCAL_G'))
>>>>> plot(p1, split=c(1,1,2,2), more=TRUE)
>>>>> plot(p2, split=c(1,2,2,2), more=TRUE)
>>>>>
>>>>> Here’s what I imagine that would look like in a tidyverse pipeline
>>>> (please
>>>>> note that this code is for illustrative purposes and will not run):
>>>>>
>>>>> library(tidyverse)
>>>>> library(purrr)
>>>>> library(sf)
>>>>> library(sfdep) # this package doesn't exist (yet)
>>>>>
>>>>> nc <- st_read(system.file("shape/nc.shp", package = "sf"))
>>>>>
>>>>> nc_g <-
>>>>>  nc %>%
>>>>>  mutate(KNN = map(.x = geometry, ~ sfdep::st_knn(.x, k = 5,
>> include.self
>>>> =
>>>>> TRUE)),  # find the nearest neighbors for each county
>>>>>         NB_LIST = map(.x = KNN, ~ sfdep::st_nb_list(.x, style =
>> 'B')),  #
>>>>> make a list of the neighbors using the binary method
>>>>>         LOCAL_G = sfdep::st_localG(x = NWBIR74, listw = NB_LIST,
>>>>> zero.policy = TRUE),  # calculate the G scores
>>>>>         LOCAL_G = round(LOCAL_G,3))
>>>>>
>>>>> We can see that the (hypothetical) tidyverse version reduces the amount
>>>> of
>>>>> intermediate objects and wraps the creation of the G scores into a
>> single
>>>>> code chunk with clear steps.
>>>>>
>>>>> I'd be grateful to hear from the users and developers of the `spdep`
>> and
>>>>> `sf` packages about this topic!
>>>>>
>>>>> Tiernan Martin
>>>>>
>>>>>       [[alternative HTML version deleted]]
>>>>>
>>>>> _______________________________________________
>>>>> R-sig-Geo mailing list
>>>>> R-sig-Geo at r-project.org
>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>
>> <+47%2055%2095%2093%2055>; e-mail:
>>>> Roger.Bivand at nhh.no
>>>> Editor-in-Chief of The R Journal,
>> https://journal.r-project.org/index.html
>>>> http://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55 <+47%2055%2095%2093%2055>; e-mail:
>> Roger.Bivand at nhh.no
>> Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html
>> http://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

-- 
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand at nhh.no
Editor-in-Chief of The R Journal, https://journal.r-project.org/index.html
http://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en