[R-sig-Geo] Spatial Autocorrelation Estimation Method
Robert R
u@erc@tch @end|ng |rom out|ook@com
Sun Dec 8 21:55:42 CET 2019
Dear Roger,
Thank you for your answer. Regarding the sparse matrix, you're right - I tested creating one, as follows:
#####
listings_nb <- listings %>% spdep::poly2nb(queen = TRUE) %>% spdep::include.self()
# is.symmetric.nb(listings_nb)
nb_B <- spdep::nb2listw(neighbours = listings_nb, style="B", zero.policy = FALSE)
B <- as(nb_B , "CsparseMatrix")
all(B == t(B))
nb_B1 <- mat2listw(as(B, "dgTMatrix"))
format(object.size(nb_B), units = "Mb")
# 85.6 Mb
format(object.size(nb_B1), units = "Mb")
# 85.6 Mb
#####
The size for both objects is the same - different from this example: https://cran.r-project.org/web/packages/spdep/vignettes/nb_igraph.html
Summing up the data that I have: a "picture" for Airbnb's listings/ads once a month for NYC, incl. lat, lon, price (per night), id, and some characteristics from the listing/ad as number of rooms, bedrooms, guests included, etc. for a period of 54 months. The data was taken here: http://insideairbnb.com/get-the-data.html (listings.csv.gz)
For the OLS, I used a pooled OLS with time dummy fixed effects (date_compiled, when the "picture" was compiled - how Airbnb listings for NYC were shown) because many of my observations (listings id) do not repeat for many periods. Also, many listings changed the room_type at least once during the whole time period analyzed (3 types of room_type: entire home/apt, private room, shared room).
I am now trying a random intercepts three-level hierarchy multilevel model, where _id_ (level 1) are nested within _zipcode_ (level 2), and the last is nested within _borough_ (level 3). So groups: id:(zipcode:borough).
lme4::lmer(log_price ~ factor(room_type) + bedrooms + bathrooms + guests_included + minimum_nights + distance_downtown + distance_subway + number_of_reviews + review_scores_cleanliness + professional_host + host_is_superhost + is_business_travel_ready + offense_misdemeanor + offense_felony + income_per_capita + factor(date_compiled) + (1 | borough / zipcode / id), data = listings)
Roger, do you think it is okay to factor(room_type) (for the 3 types of room) and factor(date_compiled) for the dates when the NYC Airbnb's listings/ads were extracted?
Thank you and best regards,
Robert
________________________________________
From: Roger Bivand <Roger.Bivand using nhh.no>
Sent: Wednesday, December 4, 2019 09:07
To: Robert R
Cc: r-sig-geo using r-project.org
Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
On Wed, 4 Dec 2019, Robert R wrote:
> Dear Roger,
>
> Again, thank you for your answer. What do you mean by "zip code random
> effect"? You mean I should use in plm the model "random"?
>
> regression_re <- plm(formula = model, data = listings, model = "random",
> index = c("id", "date_compiled"))
No, obviously not, your data are not a balanced panel. I mean a multilevel
model, where the <200 zip codes cluster the data, and where a zip code
level IID RE will almost certainly do a better job than dummies. An
MRF/ICAR RE might be an extension.
>
> And any other methodology in dealing with large weight matrices in
> spatialreg::lagsarlm?
Please refer to Bivand et al. (2013) refered to in the package. Probably
the weights would need to be symmetric and very sparse.
I still think that you should focus on a small subset of the data and to
improving the signal-noise ratio before trying to scale up.
Roger
>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <Roger.Bivand using nhh.no>
> Sent: Wednesday, November 27, 2019 13:53
> To: Robert R
> Cc: r-sig-geo using r-project.org
> Subject: SV: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Yes this is expected, since the # neighbours in a single zip code block is a dense matrix, and there will be multiple such matrices. (15000^2)*8 is 1.8e+09 so such a dense matrix will max out your RAM. There is no way to look at block neighbours in that format without subsetting your data (think train/test), use a zip code random effect. I would certainly drop all attempts to examine spatial dependency until you get an aspatial multilevel hedonic model working.
>
> Roger
>
> --
> Roger Bivand
> Norwegian School of Economics
> Helleveien 30, 5045 Bergen, Norway
> Roger.Bivand using nhh.no
>
>
> ________________________________________
> Fra: Robert R <usercatch using outlook.com>
> Sendt: tirsdag 26. november 2019 21.04
> Til: Roger Bivand
> Kopi: r-sig-geo using r-project.org
> Emne: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Dear Roger,
>
> Thank you for your e-mail. Actually there is less noise that it seems. Rental prices are daily rental prices and I have an extract of all Airbnb listings daily prices once a month for a period of 4 years. Each listings information contains the lat, lon, number of bedrooms, category (entire home/apt, shared room or private room), etc.
>
> One question regarding the spdep::nb2blocknb function: it runs super fast with up to n = 1000, and always crashes my R session with n = 15000 or so. Is there an alternative to solve this problem?
>
> Thank you and best regards,
> Robert
>
> ________________________________________
> From: Roger Bivand <Roger.Bivand using nhh.no>
> Sent: Tuesday, November 26, 2019 20:48
> To: Robert R
> Cc: r-sig-geo using r-project.org
> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>
> Sorry for late reply, am indisposed and unable to help further. I feel
> that there is so much noise in your data (differences in offers, rental
> lengths, repeats or not, etc.), that you will certainly have to subset
> vigorously first to isolate response cases that are comparable. What you
> are trying to disentangle are the hedonic components in the bundle where
> you just have price as response, but lots of other bundle characteristics
> on the right hand side (days, etc.). I feel you'd need to try to get to a
> response index of price per day per rental area or some such. I'd
> certainly advise examining responses to a specific driver (major concert
> or sports event) to get a feel for how the market responds, and return to
> spatial hedonic after finding an approach that gives reasonable aspatial
> outcomes.
>
> Roger
>
> On Sun, 17 Nov 2019, Robert R wrote:
>
>> Dear Roger,
>>
>> Thank you for your message and sorry for my late answer.
>>
>> Regarding the number of listings (lettings) for my data set (2.216.642 observations), each listing contains an individual id:
>>
>> unique ids: 180.004
>> time periods: 54 (2015-01 to 2019-09)
>> number of ids that appear only once: 28.486 (of 180.004 ids) (15,8%)
>> number of ids that appear/repeat 2-10 times: 82.641 (of 180.004 ids) (45,9%)
>> number of ids that appear/repeat 11-30 times: 46.465 (of 180.004 ids) (25,8%)
>> number of ids that appear/repeat 31-54 times: 22.412 (of 180.004 ids) (12,5%)
>>
>> Important to notice is that hosts can change the room_category (between entire/home apt, private room and shared room) keeping the same listing id number. In my data, the number of unique ids that in some point changed the room_type is of 7.204 ids.
>>
>> --
>>
>> For the OLS model, I was using only a fixed effect model, where each time period (date_compiled) (54 in total) is a time dummy.
>>
>> plm::plm(formula = model, data = listings, model = "pooling", index = c("id", "date_compiled"))
>>
>>
>> --
>> Osland et al. (2016) (https://doi.org/10.1111/jors.12281) use a spatial fixed effects (SFE) hedonic model, where each defined neighborhood zone in the study area is represented by dummy variables.
>>
>> Dong et al. (2015) (https://doi.org/10.1111/gean.12049) outline four model specifications to accommodate geographically hierarchical data structures: (1) groupwise W and fixed regional effects; (2) groupwise W and random regional effects; (3) proximity-based W and fixed regional effects; and (4) proximity-based W and random regional effects.
>> --
>>
>> I created a new column/variable containing the borough where the zipcode is found (Manhattan, Brooklyn, Queens, Bronx, Staten Island).
>>
>> If I understood it right, the (two-level) Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) considers the occurrence of spatial relations at the (lower) individual (geographical coordinates - in my case, the listing location) and (higher) group level (territorial units - in my case, zipcodes).
>>
>> According to Bivand et al. (2017): "(...) W is a spatial weights matrix. The HSAR model may also be estimated without this component.". So, in this case I only estimate the Hierarchical Spatial Simultaneous Autoregressive Model (HSAR) in a "one-level" basis, i.e., at the higher-level.
>>
>> HSAR::hsar(model, data = listings, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>
>> (Where the "model" formula contains the 54 time dummy variables)
>>
>> Do you think I can proceed with this model? I was able to calculate it.
>>
>> If I remove all observations/rows with NAs in one of the chosen variables/observations, 884.183 observations remain. If I would create a W matrix for HSAR::hsar, I would have a gigantic 884.183 by 884.183 matrix. This is the reason why I put W = NULL.
>>
>>
>> Thank you and best regards
>>
>> ________________________________________
>> From: Roger Bivand <Roger.Bivand using nhh.no>
>> Sent: Monday, November 11, 2019 11:31
>> To: Robert R
>> Cc: r-sig-geo using r-project.org
>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>
>> On Sun, 10 Nov 2019, Robert R wrote:
>>
>>> Dear Roger,
>>>
>>> Again, thank you for your answer. I read the material provided and
>>> decided that Hierarchical Spatial Autoregressive (HSAR) could be the
>>> right model for me.
>>>
>>> I indeed have the precise latitude and longitude information for all my
>>> listings for NYC.
>>>
>>> I created a stratified sample (group = zipcode) with 22172 (1%) of my
>>> observations called listings_sample and tried to replicate the hsar
>>> model, please see below.
>>>
>>> For now W = NULL, because otherwise I would have a 22172 x 22172 matrix.
>>
>> Unless you know definitely that you want to relate the response to its
>> lagged value, you do not need this. Do note that the matrix is very
>> sparse, so could be fitted without difficulty with ML in a cross-sectional
>> model.
>>
>>>
>>> You recommended then to introduce a Markov random field (MRF) random
>>> effect (RE) at the zipcode level, but I did not understand it so well.
>>> Could you develop a litte more?
>>>
>>
>> Did you read the development in
>> https://doi.org/10.1016/j.spasta.2017.01.002? It is explained there, and
>> includes code for fitting the Beijing housing parcels data se from HSAR
>> with many other packages (MCMC, INLA, hglm, etc.). I guess that you should
>> try to create a model that works on a single borough, sing the zipcodes
>> in that borough as a proxy for unobserved neighbourhood effects. Try for
>> example using lme4::lmer() with only a zipcode IID random effect, see if
>> the hedonic estimates are similar to lm(), and leave adding an MRF RE
>> (with for example mgcv::gam() or hglm::hglm()) until you have a working
>> testbed. Then advance step-by-step from there.
>>
>> You still have not said how many repeat lettings you see - it will affect
>> the way you specify your model.
>>
>> Roger
>>
>>> ##############
>>> library(spdep)
>>> library(HSAR)
>>> library(dplyr)
>>> library(splitstackshape)
>>>
>>>
>>> # Stratified sample per zipcode (size = 1%) listings_sample <-
>>> splitstackshape::stratified(indt = listings, group = "zipcode", size =
>>> 0.01)
>>>
>>> # Removing zipcodes from polygon_nyc which are not observable in
>>> listings_sample polygon_nyc_listings <- polygon_nyc %>% filter(zipcode
>>> %in% c(unique(as.character(listings_sample$zipcode))))
>>>
>>>
>>> ## Random effect matrix (N by J)
>>>
>>> # N: 22172
>>> # J: 154
>>>
>>> # Arrange listings_sample by zipcode (ascending)
>>> listings_sample <- listings_sample %>% arrange(zipcode)
>>>
>>> # Count number of listings per zipcode
>>> MM <- listings_sample %>% st_drop_geometry() %>% group_by(zipcode) %>% summarise(count = n()) %>% as.data.frame()
>>> # sum(MM$count)
>>>
>>> # N by J nulled matrix creation
>>> Delta <- matrix(data = 0, nrow = nrow(listings_sample), ncol = dim(MM)[1])
>>>
>>> # The total number of neighbourhood
>>> Uid <- rep(c(1:dim(MM)[1]), MM[,2])
>>>
>>> for(i in 1:dim(MM)[1]) {
>>> Delta[Uid==i,i] <- 1
>>> }
>>> rm(i)
>>>
>>> Delta <- as(Delta,"dgCMatrix")
>>>
>>>
>>> ## Higher-level spatial weights matrix or neighbourhood matrix (J by J)
>>>
>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>> polygon_nyc_nb <- poly2nb(polygon_nyc_listings, row.names = polygon_nyc$zipcode, queen = TRUE)
>>>
>>> # Include neighbour itself as a neighbour
>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>
>>> # Spatial weights matrix for nb
>>> polygon_nyc_nb_matrix <- nb2mat(neighbours = polygon_nyc_nb, style = "W", zero.policy = NULL)
>>> M <- as(polygon_nyc_nb_matrix,"dgCMatrix")
>>>
>>>
>>> ## Fit HSAR SAR upper level random effect
>>> model <- as.formula(log_price ~ guests_included + minimum_nights)
>>>
>>> betas = coef(lm(formula = model, data = listings_sample))
>>> pars = list(rho = 0.5, lambda = 0.5, sigma2e = 2.0, sigma2u = 2.0, betas = betas)
>>>
>>> m_hsar <- hsar(model, data = listings_sample, W = NULL, M = M, Delta = Delta, burnin = 5000, Nsim = 10000, thinning = 1, parameters.start = pars)
>>>
>>> ##############
>>>
>>> Thank you and best regards
>>> Robert
>>>
>>> ________________________________________
>>> From: Roger Bivand <Roger.Bivand using nhh.no>
>>> Sent: Friday, November 8, 2019 13:29
>>> To: Robert R
>>> Cc: r-sig-geo using r-project.org
>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>
>>> On Fri, 8 Nov 2019, Robert R wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Thank you for your answer.
>>>>
>>>> I successfully used the function nb2blocknb() for a smaller dataset.
>>>>
>>>> But for a dataset of over 2 million observations, I get the following
>>>> error: "Error: cannot allocate vector of size 840 Kb".
>>>
>>> I don't think the observations are helpful. If you have repeat lets in the
>>> same property in a given month, you need to handle that anyway. I'd go for
>>> making the modelling exercise work (we agree that this is not panel data,
>>> right?) on a small subset first. I would further argue that you need a
>>> multi-level approach rather than spdep::nb2blocknb(), with a zipcode IID
>>> RE. You could very well take (stratified) samples per zipcode to represent
>>> your data. Once that works, introduce an MRF RE at the zipcode level,
>>> where you do know relative position. Using SARAR is going to be a waste of
>>> time unless you can geocode the letting addresses. A multi-level approach
>>> will work. Having big data in your case with no useful location
>>> information per observation is just adding noise and over-smoothing, I'm
>>> afraid. The approach used in https://doi.org/10.1016/j.spasta.2017.01.002
>>> will work, also when you sample the within zipcode lets, given a split
>>> into training and test sets, and making CV possible.
>>>
>>> Roger
>>>
>>>>
>>>> I am expecting that at least 500.000 observations will be dropped due
>>>> the lack of values for the chosen variables for the regression model, so
>>>> probably I will filter and remove the observations/rows that will not be
>>>> used anyway - do you know if there is any package that does this
>>>> automatically, given the variables/columns chosed by me?
>>>>
>>>> Or would you recommend me another approach to avoid the above mentioned
>>>> error?
>>>>
>>>> Thank you and best regards,
>>>> Robert
>>>>
>>>> ________________________________________
>>>> From: Roger Bivand <Roger.Bivand using nhh.no>
>>>> Sent: Thursday, November 7, 2019 10:13
>>>> To: Robert R
>>>> Cc: r-sig-geo using r-project.org
>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>
>>>> On Thu, 7 Nov 2019, Robert R wrote:
>>>>
>>>>> Dear Roger,
>>>>>
>>>>> Many thanks for your help.
>>>>>
>>>>> I have an additional question:
>>>>>
>>>>> Is it possible to create a "separate" lw (nb2listw) (with different
>>>>> rownumbers) from my data set? For now, I am taking my data set and
>>>>> merging with the sf object polygon_nyc with the function
>>>>> "merge(polygon_nyc, listings, by=c("zipcode" = "zipcode"))", so I create
>>>>> a huge n x n matrix (depending of the size of my data set).
>>>>>
>>>>> Taking the polygon_nyc alone and turning it to a lw (weights list)
>>>>> object has only n = 177.
>>>>>
>>>>> Of course running
>>>>>
>>>>> spatialreg::lagsarlm(formula=model, data = listings_sample,
>>>>> spatialreg::polygon_nyc_lw, tol.solve=1.0e-10)
>>>>>
>>>>> does not work ("Input data and weights have different dimensions").
>>>>>
>>>>> The only option is to take my data set, merge it to my polygon_nyc (by
>>>>> zipcode) and then create the weights list lw? Or there another option?
>>>>
>>>> I think we are getting more clarity. You do not know the location of the
>>>> lettings beyond their zipcode. You do know the boundaries of the zipcode
>>>> areas, and can create a neighbour object from these boundaries. You then
>>>> want to treat all the lettings in a zipcode area i as neighbours, and
>>>> additionally lettings in zipcode areas neighbouring i as neighbours of
>>>> lettings in i. This is the data structure that motivated the
>>>> spdep::nb2blocknb() function:
>>>>
>>>> https://r-spatial.github.io/spdep/reference/nb2blocknb.html
>>>>
>>>> Try running the examples to get a feel for what is going on.
>>>>
>>>> I feel that most of the variability will vanish in the very large numbers
>>>> of neighbours, over-smoothing the outcomes. If you do not have locations
>>>> for the lettings themselves, I don't think you can make much progress.
>>>>
>>>> You could try a linear mixed model (or gam with a spatially structured
>>>> random effect) with a temporal and a spatial random effect. See the HSAR
>>>> package, articles by Dong et al., and maybe
>>>> https://doi.org/10.1016/j.spasta.2017.01.002 for another survey. Neither
>>>> this nor Dong et al. handle spatio-temporal settings. MRF spatial random
>>>> effects at the zipcode level might be a way forward, together with an IID
>>>> random effect at the same level (equivalent to sef-neighbours).
>>>>
>>>> Hope this helps,
>>>>
>>>> Roger
>>>>
>>>>>
>>>>> Best regards,
>>>>> Robert
>>>>>
>>>>> ________________________________________
>>>>> From: Roger Bivand <Roger.Bivand using nhh.no>
>>>>> Sent: Wednesday, November 6, 2019 15:07
>>>>> To: Robert R
>>>>> Cc: r-sig-geo using r-project.org
>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>
>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>
>>>>>> Dear Roger,
>>>>>>
>>>>>> Thank you for your reply. I disabled HTML; my e-mails should be now in
>>>>>> plain text.
>>>>>>
>>>>>> I will give a better context for my desired outcome.
>>>>>>
>>>>>> I am taking Airbnb's listings information for New York City available
>>>>>> on: http://insideairbnb.com/get-the-data.html
>>>>>>
>>>>>> I save every listings.csv.gz file available for NYC (2015-01 to 2019-09)
>>>>>> - in total, 54 files/time periods - as a YYYY-MM-DD.csv file into a
>>>>>> Listings/ folder. When importing all these 54 files into one single data
>>>>>> set, I create a new "date_compiled" variable/column.
>>>>>>
>>>>>> In total, after the data cleansing process, I have a little more 2
>>>>>> million observations.
>>>>>
>>>>> You have repeat lettings for some, but not all properties. So this is at
>>>>> best a very unbalanced panel. For those properties with repeats, you may
>>>>> see temporal movement (trend/seasonal).
>>>>>
>>>>> I suggest (strongly) taking a single borough or even zipcode with some
>>>>> hindreds of properties, and working from there. Do not include the
>>>>> observation as its own neighbour, perhaps identify repeats and handle them
>>>>> specially (create or use a property ID). Unbalanced panels may also create
>>>>> a selection bias issue (why are some properties only listed sometimes?).
>>>>>
>>>>> So this although promising isn't simple, and getting to a hedonic model
>>>>> may be hard, but not (just) because of spatial autocorrelation. I wouldn't
>>>>> necessarily trust OLS output either, partly because of the repeat property
>>>>> issue.
>>>>>
>>>>> Roger
>>>>>
>>>>>>
>>>>>> I created 54 timedummy variables for each time period available.
>>>>>>
>>>>>> I want to estimate using a hedonic spatial timedummy model the impact of
>>>>>> a variety of characteristics which potentially determine the daily rate
>>>>>> on Airbnb listings through time in New York City (e.g. characteristics
>>>>>> of the listing as number of bedrooms, if the host if professional,
>>>>>> proximity to downtown (New York City Hall) and nearest subway station
>>>>>> from the listing, income per capita, etc.).
>>>>>>
>>>>>> My dependent variable is price (log price, common in the related
>>>>>> literature for hedonic prices).
>>>>>>
>>>>>> The OLS model is done.
>>>>>>
>>>>>> For the spatial model, I am assuming that hosts, when deciding the
>>>>>> pricing of their listings, take not only into account its structural and
>>>>>> location characteristics, but also the prices charged by near listings
>>>>>> with similar characteristics - spatial autocorrelation is then present,
>>>>>> at least spatial dependence is present in the dependent variable.
>>>>>>
>>>>>> As I wrote in my previous post, I was willing to consider the neighbor
>>>>>> itself as a neighbor.
>>>>>>
>>>>>> Parts of my code can be found below:
>>>>>>
>>>>>> ########
>>>>>>
>>>>>> ## packages
>>>>>>
>>>>>> packages_install <- function(packages){
>>>>>> new.packages <- packages[!(packages %in% installed.packages()[, "Package"])]
>>>>>> if (length(new.packages))
>>>>>> install.packages(new.packages, dependencies = TRUE)
>>>>>> sapply(packages, require, character.only = TRUE)
>>>>>> }
>>>>>>
>>>>>> packages_required <- c("bookdown", "cowplot", "data.table", "dplyr", "e1071", "fastDummies", "ggplot2", "ggrepel", "janitor", "kableExtra", "knitr", "lubridate", "nngeo", "plm", "RColorBrewer", "readxl", "scales", "sf", "spdep", "stargazer", "tidyverse")
>>>>>> packages_install(packages_required)
>>>>>>
>>>>>> # Working directory
>>>>>> setwd("C:/Users/User/R")
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## shapefile_us
>>>>>>
>>>>>> # Shapefile zips import and Coordinate Reference System (CRS) transformation
>>>>>> # Shapefile download: https://www2.census.gov/geo/tiger/GENZ2018/shp/cb_2018_us_zcta510_500k.zip
>>>>>> shapefile_us <- sf::st_read(dsn = "Shapefile", layer = "cb_2018_us_zcta510_500k")
>>>>>>
>>>>>> # Columns removal
>>>>>> shapefile_us <- shapefile_us %>% select(-c(AFFGEOID10, GEOID10, ALAND10, AWATER10))
>>>>>>
>>>>>> # Column rename: ZCTA5CE10
>>>>>> setnames(shapefile_us, old=c("ZCTA5CE10"), new=c("zipcode"))
>>>>>>
>>>>>> # Column class change: zipcode
>>>>>> shapefile_us$zipcode <- as.character(shapefile_us$zipcode)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## polygon_nyc
>>>>>>
>>>>>> # Zip code not available in shapefile: 11695
>>>>>> polygon_nyc <- shapefile_us %>% filter(zipcode %in% zips_nyc)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## weight_matrix
>>>>>>
>>>>>> # Neighboring polygons: list of neighbors for each polygon (queen contiguity neighbors)
>>>>>> polygon_nyc_nb <- poly2nb((polygon_nyc %>% select(-borough)), queen=TRUE)
>>>>>>
>>>>>> # Include neighbour itself as a neighbour
>>>>>> # for(i in 1:length(polygon_nyc_nb)){polygon_nyc_nb[[i]]=as.integer(c(i,polygon_nyc_nb[[i]]))}
>>>>>> polygon_nyc_nb <- include.self(polygon_nyc_nb)
>>>>>>
>>>>>> # Weights to each neighboring polygon
>>>>>> lw <- nb2listw(neighbours = polygon_nyc_nb, style="W", zero.policy=TRUE)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## listings
>>>>>>
>>>>>> # Data import
>>>>>> files <- list.files(path="Listings/", pattern=".csv", full.names=TRUE)
>>>>>> listings <- setNames(lapply(files, function(x) read.csv(x, stringsAsFactors = FALSE, encoding="UTF-8")), files)
>>>>>> listings <- mapply(cbind, listings, date_compiled = names(listings))
>>>>>> listings <- listings %>% bind_rows
>>>>>>
>>>>>> # Characters removal
>>>>>> listings$date_compiled <- gsub("Listings/", "", listings$date_compiled)
>>>>>> listings$date_compiled <- gsub(".csv", "", listings$date_compiled)
>>>>>> listings$price <- gsub("\\$", "", listings$price)
>>>>>> listings$price <- gsub(",", "", listings$price)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## timedummy
>>>>>>
>>>>>> timedummy <- sapply("date_compiled_", paste, unique(listings$date_compiled), sep="")
>>>>>> timedummy <- paste(timedummy, sep = "", collapse = " + ")
>>>>>> timedummy <- gsub("-", "_", timedummy)
>>>>>>
>>>>>>
>>>>>>
>>>>>> ## OLS regression
>>>>>>
>>>>>> # Pooled cross-section data - Randomly sampled cross sections of Airbnb listings price at different points in time
>>>>>> regression <- plm(formula=as.formula(paste("log_price ~ #some variables", timedummy, sep = "", collapse = " + ")), data=listings, model="pooling", index="id")
>>>>>>
>>>>>> ########
>>>>>>
>>>>>> Some of my id's repeat in multiple time periods.
>>>>>>
>>>>>> I use NYC's zip codes to left join my data with the neighborhood zip code specific characteristics, such as income per capita to that specific zip code, etc.
>>>>>>
>>>>>> Now I want to apply the hedonic model with the timedummy variables.
>>>>>>
>>>>>> Do you know how to proceed? 1) Which package to use (spdep/splm)?; 2) Do I have to join the polygon_nyc (by zip code) to my listings data set, and then calculate the weight matrix "lw"?
>>>>>>
>>>>>> Again, thank you very much for the help provided until now.
>>>>>>
>>>>>> Best regards,
>>>>>> Robert
>>>>>>
>>>>>> ________________________________________
>>>>>> From: Roger Bivand <Roger.Bivand using nhh.no>
>>>>>> Sent: Tuesday, November 5, 2019 15:30
>>>>>> To: Robert R
>>>>>> Cc: r-sig-geo using r-project.org
>>>>>> Subject: Re: [R-sig-Geo] Spatial Autocorrelation Estimation Method
>>>>>>
>>>>>> On Tue, 5 Nov 2019, Robert R wrote:
>>>>>>
>>>>>>> I have a large pooled cross-section data set. ?I would like to
>>>>>>> estimate/regress using spatial autocorrelation methods. I am assuming
>>>>>>> for now that spatial dependence is present in both the dependent
>>>>>>> variable and the error term.? ?My data set is over a period of 4 years,
>>>>>>> monthly data (54 periods). For this means, I've created a time dummy
>>>>>>> variable for each time period.? ?I also created a weight matrix using the
>>>>>>> functions "poly2nb" and "nb2listw".? ?Now I am trying to figure out a way
>>>>>>> to estimate my model which contains a really big data set.? ?Basically, my
>>>>>>> model is as follows: y = ?D + ?W1y + X? + ?W2u + ?? ?My questions are:? ?1)
>>>>>>> My spatial weight matrix for the whole data set will be probably a
>>>>>>> enormous matrix with submatrices for each time period itself. I don't
>>>>>>> think it would be possible to calculate this.? What I would like to know
>>>>>>> is a way to estimate each time dummy/period separately (to compare
>>>>>>> different periods alone). How to do it?? ?2) Which package to use: spdep
>>>>>>> or splm?? ?Thank you and best regards,? Robert?
>>>>>>
>>>>>> Please do not post HTML, only plain text. Almost certainly your model
>>>>>> specification is wrong (SARAR/SAC is always a bad idea if alternatives are
>>>>>> untried). What is your cross-sectional size? Using sparse kronecker
>>>>>> products, the "enormous" matrix may not be very big. Does it make any
>>>>>> sense using time dummies (54 x N x T will be mostly zero anyway)? Are most
>>>>>> of the covariates time-varying? Please provide motivation and use area
>>>>>> (preferably with affiliation (your email and user name are not
>>>>>> informative) - this feels like a real estate problem, probably wrongly
>>>>>> specified. You should use splm if time make sense in your case, but if it
>>>>>> really doesn't, simplify your approach, as much of the data will be
>>>>>> subject to very large temporal autocorrelation.
>>>>>>
>>>>>> If this is a continuation of your previous question about using
>>>>>> self-neighbours, be aware that you should not use self-neighbours in
>>>>>> modelling, they are only useful for the Getis-Ord local G_i^* measure.
>>>>>>
>>>>>> Roger
>>>>>>
>>>>>>>
>>>>>>> [[alternative HTML version deleted]]
>>>>>>>
>>>>>>> _______________________________________________
>>>>>>> R-sig-Geo mailing list
>>>>>>> R-sig-Geo using r-project.org
>>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>
>>>>>> --
>>>>>> Roger Bivand
>>>>>> Department of Economics, Norwegian School of Economics,
>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>>>>>> https://orcid.org/0000-0003-2392-6140
>>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>>
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> --
>>>> Roger Bivand
>>>> Department of Economics, Norwegian School of Economics,
>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>>>> https://orcid.org/0000-0003-2392-6140
>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> --
>> Roger Bivand
>> Department of Economics, Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>> https://orcid.org/0000-0003-2392-6140
>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>
>
> --
> Roger Bivand
> Department of Economics, Norwegian School of Economics,
> Helleveien 30, N-5045 Bergen, Norway.
> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
> https://orcid.org/0000-0003-2392-6140
> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
More information about the R-sig-Geo
mailing list