[R-sig-Geo] Spatial Autocorrelation Estimation Method

Roger Bivand Roger@B|v@nd @end|ng |rom nhh@no
Wed Nov 6 15:07:59 CET 2019


On Tue, 5 Nov 2019, Robert R wrote:

> Dear Roger,
>
> Thank you for your reply. I disabled HTML; my e-mails should be now in 
> plain text.
>
> I will give a better context for my desired outcome.
>
> I am taking Airbnb's listings information for New York City available 
> on: http://insideairbnb.com/get-the-data.html
>
> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09) 
> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a 
> Listings/ folder. When importing all these 54 files into one single data 
> set, I create a new "date_compiled" variable/column.
>
> In total, after the data cleansing process, I have a little more 2 
> million observations.

You have repeat lettings for some, but not all properties. So this is at 
best a very unbalanced panel. For those properties with repeats, you may 
see temporal movement (trend/seasonal).

I suggest (strongly) taking a single borough or even zipcode with some 
hindreds of properties, and working from there. Do not include the 
observation as its own neighbour, perhaps identify repeats and handle them 
specially (create or use a property ID). Unbalanced panels may also create 
a selection bias issue (why are some properties only listed sometimes?).

So this although promising isn't simple, and getting to a hedonic model 
may be hard, but not (just) because of spatial autocorrelation. I wouldn't 
necessarily trust OLS output either, partly because of the repeat property 
issue.

Roger

>
> I created 54 timedummy variables for each time period available.
>
> I want to estimate using a hedonic spatial timedummy model the impact of 
> a variety of characteristics which potentially determine the daily rate 
> on Airbnb listings through time in New York City (e.g. characteristics 
> of the listing as number of bedrooms, if the host if professional, 
> proximity to downtown (New York City Hall) and nearest subway station 
> from the listing, income per capita, etc.).
>
> My dependent variable is price (log price, common in the related 
> literature for hedonic prices).
>
> The OLS model is done.
>
> For the spatial model, I am assuming that hosts, when deciding the 
> pricing of their listings, take not only into account its structural and 
> location characteristics, but also the prices charged by near listings 
> with similar characteristics - spatial autocorrelation is then present, 
> at least spatial dependence is present in the dependent variable.
>
> As I wrote in my previous post, I was willing to consider the neighbor 
> itself as a neighbor.
>
> Parts of my code can be found below:
>
> ########
>
> ## packages
>
> packages_install <- function(packages){
> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
> if (length(new.packages))
> install.packages(new.packages, dependencies = TRUE)
> sapply(packages, require, character.only = TRUE)
> }
>
> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
> packages_install(packages_required)
>
> # Working directory
> setwd("C:/Users/User/R")
>
>
>
> ## shapefile_us
>
> # Shapefile zips import and Coordinate Reference System (CRS) transformation
> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>
> # Columns removal
> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>
> # Column rename: ZCTA5CE10
> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>
> # Column class change: zipcode
> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>
>
>
> ## polygon_nyc
>
> # Zip code not available in shapefile: 11695
> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>
>
>
> ## weight_matrix
>
> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>
> # Include neighbour itself as a neighbour
> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>
> # Weights to each neighboring polygon
> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>
>
>
> ## listings
>
> # Data import
> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
> listings <- mapply(cbind, listings, date_compiled = names(listings))
> listings <- listings %>% bind_rows
>
> # Characters removal
> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
> listings$price <- gsub("\\$", "", listings$price)
> listings$price <- gsub(",", "", listings$price)
>
>
>
> ## timedummy
>
> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
> timedummy <- paste(timedummy, sep = "", collapse = " + ")
> timedummy <- gsub("-", "_", timedummy)
>
>
>
> ## OLS regression
>
> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>
> ########
>
> Some of my id's repeat in multiple time periods.
>
> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>
> Now I want to apply the hedonic model with the timedummy variables.
>
> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>
> Again, thank you very much for the help provided until now.
>
> Best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <Roger.Bivand using nhh.no>
> Sent: Tuesday, November 5, 2019 15:30
> To: Robert R
> Cc: r-sig-geo using r-project.org
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> On Tue, 5 Nov 2019, Robert R wrote:
>
>> I have a large pooled cross-section data set. ​I would like to
>> estimate/regress using spatial autocorrelation methods. I am assuming
>> for now that spatial dependence is present in both the dependent
>> variable and the error term.​ ​My data set is over a period of 4 years,
>> monthly data (54 periods). For this means, I've created a time dummy
>> variable for each time period.​ ​I also created a weight matrix using the
>> functions "poly2nb" and "nb2listw".​ ​Now I am trying to figure out a way
>> to estimate my model which contains a really big data set.​ ​Basically, my
>> model is as follows: y = γD + ρW1y + Xβ + λW2u + ε​ ​My questions are:​ ​1)
>> My spatial weight matrix for the whole data set will be probably a
>> enormous matrix with submatrices for each time period itself. I don't
>> think it would be possible to calculate this.​ What I would like to know
>> is a way to estimate each time dummy/period separately (to compare
>> different periods alone). How to do it?​ ​2) Which package to use: spdep
>> or splm?​ ​Thank you and best regards,​ Robert​
>
> Please do not post HTML, only plain text. Almost certainly your model
> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
> untried). What is your cross-sectional size? Using sparse kronecker
> products, the "enormous" matrix may not be very big. Does it make any
> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
> of the covariates time-varying? Please provide motivation and use area
> (preferably with affiliation (your email and user name are not
> informative) - this feels like a real estate problem, probably wrongly
> specified. You should use splm if time make sense in your case, but if it
> really doesn't, simplify your approach, as much of the data will be
> subject to very large temporal autocorrelation.
>
> If this is a continuation of your previous question about using
> self-neighbours, be aware that you should not use self-neighbours in
> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>
> Roger
>
>>
>>       [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo using r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>

-- 
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en


More information about the R-sig-Geo mailing list