[R-sig-Geo] Running huge dataset with dnearneigh

Sat Jun 29 01:36:22 CEST 2019

Dear Roger,

How can we deal with a huge dataset when using dnearneigh?

Here is my code:

d <- dnearneigh(spdf,0, 22000)
all_listw <- nb2listw(d, style = "W")

where the spdf object is in the british national grid CRS:
+init=epsg:27700, with 227,973 observations/points. The distance of 22,000
was decided by a training set that had 214 observations and the spdf object
contains both the training set and the testing set.

I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
memory. My laptop showed that when dnearneigh command was run on all
observations, around 6.9 out of 8GB was used by the rsession and that the
%CPU used by the rsession was stated to be around 98%, although another
indicator showed that my computer was around 60% idle. After running the
command for a day, rstudio alerted me that the connection to the rsession
could not be established, so I aborted the entire process altogether. I
think the problem here may be the size of the dataset and perhaps the
limitations of my laptop specs.

Do you have any advice on how I can go about making a neighbours list with
dnearneigh for 227,973 observations in a successful and efficient way?
Also, would you foresee any problems in the next steps, especially when I
will be using the neighbourhood listw object as an input in fitting and
predicting using the spatial lag/error models? (see code below)

model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
model_pred <- spatialreg::predict.sarlm(model, test, all_listw)

I think the predicting part may take some time, since my test set consists
of 227,973 - 214 observations = 227,759 observations.

Here are some solutions that I have thought of:

1. Interpolate the test set point data of 227,759 observations over a more
manageable spatial pixel dataframe with cell size of perhaps 10,000m by
10,000m which would give me around 4900 points. So instead of 227,759
observations, I can make the listw object based on just 4900 + 214 training
points and predict just on 4900 observations.

2. Get hold of better performance machines through cloud computing such as
AWS EC2 services and try running the commands and models there.

3. Parallel computing using the parallel package from r (although I am not
sure whether dnearneigh can be parallelised).

I believe option 1 would be the most manageable but I am not sure how and
by how much this would affect the accuracy of the predictions as
interpolating the dataset would be akin to introducing more estimations in
the prediction. However, I am also grappling with the trade-off between
accuracy and computation time. Hence, if options 2 and 3 can offer a
reasonable computation time (1-2 hours) then I would forgo option 1.

What do you think? Is it possible to make a neighbourhood listw object out
of 227,973 observations efficiently?

Thank you for reading to the end! Apologies for writing a lengthy one, just
wanted to fully describe what I am facing, I hope I didn't miss out
anything crucial.

Thank you so much once again!

jiawen

	[[alternative HTML version deleted]]