[R-sig-Geo] Spatial Autocorrelation Estimation Method

Thu Nov 7 10:13:33 CET 2019

On Thu, 7 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Many thanks for your help.
>
> I have an additional question:
>
> Is it possible to create a "separate" lw (nb2listw) (with different 
> rownumbers) from my data set? For now, I am taking my data set and 
> merging with the sf object polygon_nyc with the function 
> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create 
> a huge n x n matrix (depending of the size of my data set).
>
> Taking the polygon_nyc alone and turning it to a lw (weights list) 
> object has only n = 177.
>
> Of course running
>
> spatialreg::lagsarlm(formula=model, data = listings_sample, 
> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>
> does not work ("Input data and weights have different dimensions").
>
> The only option is to take my data set, merge it to my polygon_nyc (by 
> zipcode) and then create the weights list lw? Or there another option?

I think we are getting more clarity. You do not know the location of the 
lettings beyond their zipcode. You do know the boundaries of the zipcode 
areas, and can create a neighbour object from these boundaries. You then 
want to treat all the lettings in a zipcode area i as neighbours, and 
additionally lettings in zipcode areas neighbouring i as neighbours of 
lettings in i. This is the data structure that motivated the 
spdep::nb2blocknb() function:

https://r-spatial.github.io/spdep/reference/nb2blocknb.html

Try running the examples to get a feel for what is going on.

I feel that most of the variability will vanish in the very large numbers 
of neighbours, over-smoothing the outcomes. If you do not have locations 
for the lettings themselves, I don't think you can make much progress.

You could try a linear mixed model (or gam with a spatially structured 
random effect) with a temporal and a spatial random effect. See the HSAR 
package, articles by Dong et al., and maybe 
https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither 
this nor Dong et al. handle spatio-temporal settings. MRF spatial random 
effects at the zipcode level might be a way forward, together with an IID 
random effect at the same level (equivalent to sef-neighbours).

Hope this helps,

Roger

>
> Best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <Roger.Bivand using nhh.no>
> Sent: Wednesday, November 6, 2019 15:07
> To: Robert R
> Cc: r-sig-geo using r-project.org
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Tue, 5 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>> plain text.
>>
>> I will give a better context for my desired outcome.
>>
>> I am taking Airbnb's listings information for New York City available
>> on: http://insideairbnb.com/get-the-data.html
>>
>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>> Listings/ folder. When importing all these 54 files into one single data
>> set, I create a new "date_compiled" variable/column.
>>
>> In total, after the data cleansing process, I have a little more 2
>> million observations.
>
> You have repeat lettings for some, but not all properties. So this is at
> best a very unbalanced panel. For those properties with repeats, you may
> see temporal movement (trend/seasonal).
>
> I suggest (strongly) taking a single borough or even zipcode with some
> hindreds of properties, and working from there. Do not include the
> observation as its own neighbour, perhaps identify repeats and handle them
> specially (create or use a property ID). Unbalanced panels may also create
> a selection bias issue (why are some properties only listed sometimes?).
>
> So this although promising isn't simple, and getting to a hedonic model
> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
> necessarily trust OLS output either, partly because of the repeat property
> issue.
>
> Roger
>
>>
>> I created 54 timedummy variables for each time period available.
>>
>> I want to estimate using a hedonic spatial timedummy model the impact of
>> a variety of characteristics which potentially determine the daily rate
>> on Airbnb listings through time in New York City (e.g. characteristics
>> of the listing as number of bedrooms, if the host if professional,
>> proximity to downtown (New York City Hall) and nearest subway station
>> from the listing, income per capita, etc.).
>>
>> My dependent variable is price (log price, common in the related
>> literature for hedonic prices).
>>
>> The OLS model is done.
>>
>> For the spatial model, I am assuming that hosts, when deciding the
>> pricing of their listings, take not only into account its structural and
>> location characteristics, but also the prices charged by near listings
>> with similar characteristics - spatial autocorrelation is then present,
>> at least spatial dependence is present in the dependent variable.
>>
>> As I wrote in my previous post, I was willing to consider the neighbor
>> itself as a neighbor.
>>
>> Parts of my code can be found below:
>>
>> ########
>>
>> ## packages
>>
>> packages_install <- function(packages){
>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>> if (length(new.packages))
>> install.packages(new.packages, dependencies = TRUE)
>> sapply(packages, require, character.only = TRUE)
>> }
>>
>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>> packages_install(packages_required)
>>
>> # Working directory
>> setwd("C:/Users/User/R")
>>
>>
>>
>> ## shapefile_us
>>
>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>
>> # Columns removal
>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>
>> # Column rename: ZCTA5CE10
>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>
>> # Column class change: zipcode
>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>
>>
>>
>> ## polygon_nyc
>>
>> # Zip code not available in shapefile: 11695
>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>
>>
>>
>> ## weight_matrix
>>
>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>
>> # Include neighbour itself as a neighbour
>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>
>> # Weights to each neighboring polygon
>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>
>>
>>
>> ## listings
>>
>> # Data import
>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>> listings <- listings %>% bind_rows
>>
>> # Characters removal
>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>> listings$price <- gsub("\\$", "", listings$price)
>> listings$price <- gsub(",", "", listings$price)
>>
>>
>>
>> ## timedummy
>>
>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>> timedummy <- gsub("-", "_", timedummy)
>>
>>
>>
>> ## OLS regression
>>
>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>
>> ########
>>
>> Some of my id's repeat in multiple time periods.
>>
>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>
>> Now I want to apply the hedonic model with the timedummy variables.
>>
>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>
>> Again, thank you very much for the help provided until now.
>>
>> Best regards,
>> Robert
>>
>> ________________________________________
>> From: Roger Bivand <Roger.Bivand using nhh.no>
>> Sent: Tuesday, November 5, 2019 15:30
>> To: Robert R
>> Cc: r-sig-geo using r-project.org
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Tue, 5 Nov 2019, Robert R wrote:
>>
>>> I have a large pooled cross-section data set. I would like to
>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>> for now that spatial dependence is present in both the dependent
>>> variable and the error term. My data set is over a period of 4 years,
>>> monthly data (54 periods). For this means, I've created a time dummy
>>> variable for each time period. I also created a weight matrix using the
>>> functions "poly2nb" and "nb2listw". Now I am trying to figure out a way
>>> to estimate my model which contains a really big data set. Basically, my
>>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε My questions are: 1)
>>> My spatial weight matrix for the whole data set will be probably a
>>> enormous matrix with submatrices for each time period itself. I don't
>>> think it would be possible to calculate this. What I would like to know
>>> is a way to estimate each time dummy/period separately (to compare
>>> different periods alone). How to do it? 2) Which package to use: spdep
>>> or splm? Thank you and best regards, Robert
>>
>> Please do not post HTML, only plain text. Almost certainly your model
>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>> untried). What is your cross-sectional size? Using sparse kronecker
>> products, the "enormous" matrix may not be very big. Does it make any
>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>> of the covariates time-varying? Please provide motivation and use area
>> (preferably with affiliation (your email and user name are not
>> informative) - this feels like a real estate problem, probably wrongly
>> specified. You should use splm if time make sense in your case, but if it
>> really doesn't, simplify your approach, as much of the data will be
>> subject to very large temporal autocorrelation.
>>
>> If this is a continuation of your previous question about using
>> self-neighbours, be aware that you should not use self-neighbours in
>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>
>> Roger
>>
>>>
>>>       [[alternative HTML version deleted]]
>>>
>>> _______________________________________________
>>> R-sig-Geo mailing list
>>> R-sig-Geo using r-project.org
>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

-- 
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en