[R-sig-Geo] Running huge dataset with dnearneigh

Sun Jun 30 17:57:32 CEST 2019

On Sat, 29 Jun 2019, Jiawen Ng wrote:

> Dear Roger,

Postings go to the whole list ...

>
> How can we deal with a huge dataset when using dnearneigh?
>

First, why distance neighbours? What is the support of the data, point or 
polygon? If polygon, contiguity neighbours are preferred. If not, and the 
intensity of observations is similar across the whole area, distance may 
be justified, but if the intensity varies, some observations will have 
very many neighbours. In that case, unless you have a clear ecological or 
environmental reason for knowing that a known distance threshold binds, it 
is not a good choice.

> Here is my code:
>
> d <- dnearneigh(spdf,0, 22000)
> all_listw <- nb2listw(d, style = "W")
>
> where the spdf object is in the british national grid CRS:
> +init=epsg:27700, with 227,973 observations/points. The distance of 22,000
> was decided by a training set that had 214 observations and the spdf object
> contains both the training set and the testing set.
>

This is questionable. You train on 214 observations - do their areal 
intensity match those of the whole data set? If chosen at random, you run 
into the spatial sampling problems discussed in:

https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author

Are 214 observations for training representative of 227,973 prediction 
sites? Do you only have observations on the response for 214, and an 
unobserved response otherwise? What are the data, what are you trying to 
do and why? This is not a sensible setting for models using weights 
matrices for prediction (I think), because we do not have estimates of the 
prediction error in general.

> I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
> memory. My laptop showed that when dnearneigh command was run on all
> observations, around 6.9 out of 8GB was used by the rsession and that the
> %CPU used by the rsession was stated to be around 98%, although another
> indicator showed that my computer was around 60% idle. After running the
> command for a day, rstudio alerted me that the connection to the rsession
> could not be established, so I aborted the entire process altogether. I
> think the problem here may be the size of the dataset and perhaps the
> limitations of my laptop specs.
>

On planar data, there is no good reason for this, as each observation is 
treated separately, finding and sorting distances, and choosing those 
under the threshold. It will undoubtedly slow if there are more than a few 
neighbours within the threshold, but I already covered the inadvisability 
of defining neighbours in that way.

Using an rtree might help, but you get hit badly if there are many 
neighbours within the threshold you have chosen anyway.

On most 8GB hardware and modern OS, you do not have more than 3-4GB for 
work. So something was swapping on your laptop.

> Do you have any advice on how I can go about making a neighbours list with
> dnearneigh for 227,973 observations in a successful and efficient way?
> Also, would you foresee any problems in the next steps, especially when I
> will be using the neighbourhood listw object as an input in fitting and
> predicting using the spatial lag/error models? (see code below)
>
> model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
> model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>

Why would using a spatial lag model make sense? Why are you suggesting 
this model, do you have a behavioural for why only the spatially lagged 
response should be included?

Why do you think that this is sensible? You are predicting 1000 times for 
each observation - this is not what the prediction methods are written 
for. Most involve inverting an nxn inverse matrix - did you refer to 
Goulard et al. (2017) to get a good understanding of the underlying 
methods?

> I think the predicting part may take some time, since my test set consists
> of 227,973 - 214 observations = 227,759 observations.
>
> Here are some solutions that I have thought of:
>
> 1. Interpolate the test set point data of 227,759 observations over a more
> manageable spatial pixel dataframe with cell size of perhaps 10,000m by
> 10,000m which would give me around 4900 points. So instead of 227,759
> observations, I can make the listw object based on just 4900 + 214 training
> points and predict just on 4900 observations.

But what are you trying to do? Are the observations output areas? House 
sales? If you are not filling in missing areal units (the Goulard et al. 
case), couldn't you simply use geostatistical methods which seem to match 
your support better, and can be fitted and can predict using a local 
neighbourhood? While you are doing that, you could switch to INLA with 
SPDE, which interposes a mesh like the one you suggest. But in that case, 
beware of the mesh choice issue in:

https://doi.org/10.1080/03610926.2018.1536209

>
> 2. Get hold of better performance machines through cloud computing such as
> AWS EC2 services and try running the commands and models there.
>

What you need are methods, not wasted money on hardware as a service.

> 3. Parallel computing using the parallel package from r (although I am not
> sure whether dnearneigh can be parallelised).
>

This could easily be implemented if it was really needed, which I don't 
think it is; better methods understanding lets one do more with less.

> I believe option 1 would be the most manageable but I am not sure how and
> by how much this would affect the accuracy of the predictions as
> interpolating the dataset would be akin to introducing more estimations in
> the prediction. However, I am also grappling with the trade-off between
> accuracy and computation time. Hence, if options 2 and 3 can offer a
> reasonable computation time (1-2 hours) then I would forgo option 1.
>
> What do you think? Is it possible to make a neighbourhood listw object out
> of 227,973 observations efficiently?

Yes, but only if the numbers of neighbours are very small. Look in Bivand 
et al. (2013) to see the use of some fairly large n, but only with few 
neighbours for each observation. You seem to be getting average neighbour 
counts in the thousands, which makes no sense.

>
> Thank you for reading to the end! Apologies for writing a lengthy one, just
> wanted to fully describe what I am facing, I hope I didn't miss out
> anything crucial.
>

Long is OK, but there is no motivation here for why you want to make 200K 
predictions from 200 observations with point support (?) using weights 
matrices.

Hope this clarifies,

Roger

> Thank you so much once again!
>
> jiawen
>
> 	[[alternative HTML version deleted]]
>
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo using r-project.org
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>

-- 
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en