[R-sig-Geo] Running huge dataset with dnearneigh

Roger Bivand Roger@B|v@nd @end|ng |rom nhh@no
Tue Jul 2 11:20:20 CEST 2019


Follow-up: maybe read: https://geocompr.robinlovelace.net/location.html 
for a geomarketing case.

Roger

On Tue, 2 Jul 2019, Roger Bivand wrote:

> On Tue, 2 Jul 2019, Jiawen Ng wrote:
>
>>  Dear Roger,
>>
>>  Thanks for your reply and explanation!
>>
>>  I am just exploring the aspect of geodemographics in store locations.
>>  There
>>  are many factors that can be considered, as you have highlighted!
>
> OK, so I suggest choosing a modest sized case until a selection of working 
> models emerges. Once you reach that stage, you can return to scaling up. I 
> think you need much more data on the customer behaviour around the stores you 
> use to train your models, particularly customer flows associated with actual 
> purchases. Firms used to do this through loyalty programmes and cards, but 
> this data is not open, so you'd need proxies which say city bikes will not 
> give you.
>
> Geodemographics (used for direct mailing as a marketing tool) have largely 
> been eclipsed by profiling in social media with the exception of segments 
> without social media profiles. This is because postcode or OA profiling is 
> often too noisy and so is expensive because there are many false hits. Retail 
> is interesting but very multi-faceted, but some personal services are more 
> closely related to population as they are hard to digitise.
>
> Hope this helps,
>
> Roger
>
>>
>>  Thank you so much for taking the time to write back to me! I will study
>>  and
>>  consider your advice! Thank you!
>>
>>  Jiawen
>>
>>  On Mon, 1 Jul 2019 at 19:12, Roger Bivand <Roger.Bivand using nhh.no> wrote:
>>
>>>  On Mon, 1 Jul 2019, Jiawen Ng wrote:
>>>
>>>>  Dear Roger,
>>>>
>>>>  Thank you so much for your detailed response and pointing out potential
>>>>  pitfalls! It has prompted me to re-evalutate my approach.
>>>>
>>>>  Here is the context: I have some stores' sales data (this is my training
>>>>  set of 214 points), I would like to find out where best to set up new
>>>>  stores in UK. I am using a geodemographics approach to do this: Perform
>>>>  a
>>>>  regression of sales against census data, then predict sales on UK output
>>>>  areas (by centroids) and finally identify new areas with
>>>>  location-allocation models. As the stores are points, this has led me to
>>>>  define UK output areas by its population-weighted centroids, thus
>>>  resulting
>>>>  in the prediction by points rather than by areas. Tests (like moran's I
>>>  and
>>>>  lagrange multiplier) for spatial relationships among the points in my
>>>>  training set were significant hence this has led me to implement some
>>>>  spatial models (specifically spatial lag, error and durbin models) to
>>>>  account for the spatial relationships in the data.
>>>
>>>  I'm afraid that my retail geography is not very up to date, but also that
>>>  your approach is most unlikely to yield constructive results.
>>>
>>>  Most retail stores are organised in large chains, so optimise costs
>>>  between wholesale and retail. Independent retail stores depend crucially
>>>  on access to wholesale stores, so anyway cannot locate without regard to
>>>  supply costs. Some service activities without wholesale dependencies are
>>>  less tied.
>>>
>>>  Most chains certainly behave strategically with regard to each other,
>>>  sometimes locating toe-to-toe to challenge a competing chain
>>>  (Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
>>>  competing chain locations to establish a local monopoly (think
>>>  Hotelling).
>>>
>>>  Population density doesn't express demand, especially unmet demand well
>>>  at
>>>  all. Think food deserts - maybe plenty of people but little disposable
>>>  income. Look at the food desert literature, or the US food stamp
>>>  literature.
>>>
>>>  Finally (all bad news) retail is not only challenged by location shifting
>>>  from high streets to malls, but critically by online shopping, which
>>>  shifts the cost structures one the buyer is engaged at a proposed price
>>>  to
>>>  logistics, to complete the order at the highest margin including returns.
>>>  That only marginally relates to population density.
>>>
>>>  So you'd need more data than you have, a model that explicitly handles
>>>  competition between chains as well as market gaps, and some way of
>>>  handling online leakage to move forward.
>>>
>>>  If population density was a proxy for accessibility (most often it
>>>  isn't),
>>>  it might look like the beginnings of a model, but most often we don't
>>>  know
>>>  what bid-rent surfaces look like, and then, most often different
>>>  activities sort differently across those surfaces.
>>> 
>>>>
>>>>  I am quite unsettled and unclear as to which neighbourhood definition to
>>>  go
>>>>  for actually. I thought of IDW at first as I thought this would
>>>>  summarise
>>>>  each point's relationship with their neighbours very precisely thus
>>>  making
>>>>  the predictions more accurate. Upon your advice (don't use IDW or other
>>>>  general weights for predictions), I decided not to use IDW, and changed
>>>  it
>>>>  to dnearneigh instead (although now I am questioning myself on the
>>>>  definition of what is meant by general weights. Perhaps I am
>>>  understanding
>>>>  the definition of general weights wrong, if dnearneigh is still
>>>  considered
>>>>  to be a 'general weights' method) Why is the use of IDW not advisable
>>>>  however? Is it due to computational reasons? Also, why would having
>>>>  thousands of neighbours be making no sense? Apologies for asking so many
>>>>  questions, I'd just like to really understand the concepts!
>>>> 
>>>
>>>  The model underlying spatial regressions using neighbours tapers
>>>  dependency as the pairwise elements of (I - \rho W)^{-1} (conditional)
>>>  and
>>>  [(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
>>>  matrices. (I - \rho W) is typically sparse, and under certain conditions
>>>  leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
>>>  power series in \rho and W. \rho is typically upward bounded < 1, so
>>>  \rho^i declines as i increases. This dampens \rho^i W^i, so that i
>>>  influences j less and less with increasing i. So in the general case IDW
>>>  is simply replicating what simple contiguity gives you anyway. So the
>>>  sparser W is (within reason), the better. Unless you really know that the
>>>  physics, chemistry or biology of your system give you a known systematic
>>>  relationship like IDW, you may as well stay with contiguity.
>>>
>>>  However, this isn't any use in solving a retail location problem at all.
>>>
>>>>  I believe that both the train and test set has varying intensities. I
>>>>  was
>>>>  weighing the different neighbourhood methods: dnearneigh, knearneigh,
>>>  using
>>>>  IDW etc. and I felt like each method would have its disadvantages -- its
>>>>  difficult to pinpoint which neighbourhood definition would be best. If
>>>  one
>>>>  were to go for knearneigh for example, results may not be fair due to
>>>>  the
>>>>  inhomogeneity of the points -- for instance, point A's nearest
>>>>  neighbours
>>>>  may be within a few hundreds of kilometres while point B's nearest
>>>>  neighbours may be in the thousands. I feel like the choice of any
>>>>  neighbourhood definition can be highly debateable... What do you think?
>>>> 
>>>
>>>  When in doubt use contiguity for polygons and similar graph based methods
>>>  for points. Try to keep the graphs planar (as few intersecting edges as
>>>  possible - rule of thumb).
>>> 
>>>
>>>>  After analysing my problem again, I think that predicting by output
>>>>  areas
>>>>  (points) would be best for my case as I would have to make use of the
>>>>  population data after building the model. Interpolating census data of
>>>  the
>>>>  output area (points) would cause me to lose that information.
>>>> 
>>>
>>>  Baseline, this is not going anywhere constructive, and simply approaching
>>>  retail location in this way is unhelpful - there is far too little
>>>  information in your model.
>>>
>>>  If you really must, first find a fully configured retail model with the
>>>  complete data set needed to replicate the results achieved, and use that
>>>  to benchmark how far your approach succeeds in reaching a similar result
>>>  for that restricted area. I think that you'll find that the retail model
>>>  is much more successful, but if not, there is less structure in
>>>  contemporary retail than I though.
>>>
>>>  Best wishes,
>>>
>>>  Roger
>>>
>>>>  Thank you for the comments and the advice so far,  I would greatly
>>>  welcome
>>>>  and appreciate additional feedback!
>>>>
>>>>  Thank you so much once again!
>>>>
>>>>  Jiawen
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>> 
>>>>
>>>>  On Sun, 30 Jun 2019 at 16:57, Roger Bivand <Roger.Bivand using nhh.no> wrote:
>>>>
>>>>>  On Sat, 29 Jun 2019, Jiawen Ng wrote:
>>>>>
>>>>>>  Dear Roger,
>>>>>
>>>>>  Postings go to the whole list ...
>>>>> 
>>>>>>
>>>>>>  How can we deal with a huge dataset when using dnearneigh?
>>>>>> 
>>>>>
>>>>>  First, why distance neighbours? What is the support of the data, point
>>>  or
>>>>>  polygon? If polygon, contiguity neighbours are preferred. If not, and
>>>  the
>>>>>  intensity of observations is similar across the whole area, distance
>>>>>  may
>>>>>  be justified, but if the intensity varies, some observations will have
>>>>>  very many neighbours. In that case, unless you have a clear ecological
>>>  or
>>>>>  environmental reason for knowing that a known distance threshold binds,
>>>  it
>>>>>  is not a good choice.
>>>>>
>>>>>>  Here is my code:
>>>>>>
>>>>>>  d <- dnearneigh(spdf,0, 22000)
>>>>>>  all_listw <- nb2listw(d, style = "W")
>>>>>>
>>>>>>  where the spdf object is in the british national grid CRS:
>>>>>>  +init=epsg:27700, with 227,973 observations/points. The distance of
>>>>>  22,000
>>>>>>  was decided by a training set that had 214 observations and the spdf
>>>>>  object
>>>>>>  contains both the training set and the testing set.
>>>>>> 
>>>>>
>>>>>  This is questionable. You train on 214 observations - do their areal
>>>>>  intensity match those of the whole data set? If chosen at random, you
>>>  run
>>>>>  into the spatial sampling problems discussed in:
>>>>> 
>>>>> 
>>>>>
>>>  https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
>>>>>
>>>>>  Are 214 observations for training representative of 227,973 prediction
>>>>>  sites? Do you only have observations on the response for 214, and an
>>>>>  unobserved response otherwise? What are the data, what are you trying
>>>>>  to
>>>>>  do and why? This is not a sensible setting for models using weights
>>>>>  matrices for prediction (I think), because we do not have estimates of
>>>  the
>>>>>  prediction error in general.
>>>>>
>>>>>>  I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
>>>>>>  memory. My laptop showed that when dnearneigh command was run on all
>>>>>>  observations, around 6.9 out of 8GB was used by the rsession and that
>>>  the
>>>>>>  %CPU used by the rsession was stated to be around 98%, although
>>>>>>  another
>>>>>>  indicator showed that my computer was around 60% idle. After running
>>>  the
>>>>>>  command for a day, rstudio alerted me that the connection to the
>>>  rsession
>>>>>>  could not be established, so I aborted the entire process altogether.
>>>>>>  I
>>>>>>  think the problem here may be the size of the dataset and perhaps the
>>>>>>  limitations of my laptop specs.
>>>>>> 
>>>>>
>>>>>  On planar data, there is no good reason for this, as each observation
>>>>>  is
>>>>>  treated separately, finding and sorting distances, and choosing those
>>>>>  under the threshold. It will undoubtedly slow if there are more than a
>>>  few
>>>>>  neighbours within the threshold, but I already covered the
>>>  inadvisability
>>>>>  of defining neighbours in that way.
>>>>>
>>>>>  Using an rtree might help, but you get hit badly if there are many
>>>>>  neighbours within the threshold you have chosen anyway.
>>>>>
>>>>>  On most 8GB hardware and modern OS, you do not have more than 3-4GB for
>>>>>  work. So something was swapping on your laptop.
>>>>>
>>>>>>  Do you have any advice on how I can go about making a neighbours list
>>>>>  with
>>>>>>  dnearneigh for 227,973 observations in a successful and efficient way?
>>>>>>  Also, would you foresee any problems in the next steps, especially
>>>  when I
>>>>>>  will be using the neighbourhood listw object as an input in fitting
>>>>>>  and
>>>>>>  predicting using the spatial lag/error models? (see code below)
>>>>>>
>>>>>>  model <-  spatialreg::lagsarlm(rest_formula, data=train, train_listw)
>>>>>>  model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>>>>>> 
>>>>>
>>>>>  Why would using a spatial lag model make sense? Why are you suggesting
>>>>>  this model, do you have a behavioural for why only the spatially lagged
>>>>>  response should be included?
>>>>>
>>>>>  Why do you think that this is sensible? You are predicting 1000 times
>>>  for
>>>>>  each observation - this is not what the prediction methods are written
>>>>>  for. Most involve inverting an nxn inverse matrix - did you refer to
>>>>>  Goulard et al. (2017) to get a good understanding of the underlying
>>>>>  methods?
>>>>>
>>>>>>  I think the predicting part may take some time, since my test set
>>>>>  consists
>>>>>>  of 227,973 - 214 observations = 227,759 observations.
>>>>>>
>>>>>>  Here are some solutions that I have thought of:
>>>>>>
>>>>>>  1. Interpolate the test set point data of 227,759 observations over a
>>>>>  more
>>>>>>  manageable spatial pixel dataframe with cell size of perhaps 10,000m
>>>>>>  by
>>>>>>  10,000m which would give me around 4900 points. So instead of 227,759
>>>>>>  observations, I can make the listw object based on just 4900 + 214
>>>>>  training
>>>>>>  points and predict just on 4900 observations.
>>>>>
>>>>>  But what are you trying to do? Are the observations output areas? House
>>>>>  sales? If you are not filling in missing areal units (the Goulard et
>>>>>  al.
>>>>>  case), couldn't you simply use geostatistical methods which seem to
>>>  match
>>>>>  your support better, and can be fitted and can predict using a local
>>>>>  neighbourhood? While you are doing that, you could switch to INLA with
>>>>>  SPDE, which interposes a mesh like the one you suggest. But in that
>>>  case,
>>>>>  beware of the mesh choice issue in:
>>>>>
>>>>>  https://doi.org/10.1080/03610926.2018.1536209
>>>>> 
>>>>>>
>>>>>>  2. Get hold of better performance machines through cloud computing
>>>>>>  such
>>>>>  as
>>>>>>  AWS EC2 services and try running the commands and models there.
>>>>>> 
>>>>>
>>>>>  What you need are methods, not wasted money on hardware as a service.
>>>>>
>>>>>>  3. Parallel computing using the parallel package from r (although I am
>>>>>  not
>>>>>>  sure whether dnearneigh can be parallelised).
>>>>>> 
>>>>>
>>>>>  This could easily be implemented if it was really needed, which I don't
>>>>>  think it is; better methods understanding lets one do more with less.
>>>>>
>>>>>>  I believe option 1 would be the most manageable but I am not sure how
>>>  and
>>>>>>  by how much this would affect the accuracy of the predictions as
>>>>>>  interpolating the dataset would be akin to introducing more
>>>>>>  estimations
>>>>>  in
>>>>>>  the prediction. However, I am also grappling with the trade-off
>>>>>>  between
>>>>>>  accuracy and computation time. Hence, if options 2 and 3 can offer a
>>>>>>  reasonable computation time (1-2 hours) then I would forgo option 1.
>>>>>>
>>>>>>  What do you think? Is it possible to make a neighbourhood listw object
>>>>>  out
>>>>>>  of 227,973 observations efficiently?
>>>>>
>>>>>  Yes, but only if the numbers of neighbours are very small. Look in
>>>  Bivand
>>>>>  et al. (2013) to see the use of some fairly large n, but only with few
>>>>>  neighbours for each observation. You seem to be getting average
>>>  neighbour
>>>>>  counts in the thousands, which makes no sense.
>>>>> 
>>>>>>
>>>>>>  Thank you for reading to the end! Apologies for writing a lengthy one,
>>>>>  just
>>>>>>  wanted to fully describe what I am facing, I hope I didn't miss out
>>>>>>  anything crucial.
>>>>>> 
>>>>>
>>>>>  Long is OK, but there is no motivation here for why you want to make
>>>  200K
>>>>>  predictions from 200 observations with point support (?) using weights
>>>>>  matrices.
>>>>>
>>>>>  Hope this clarifies,
>>>>>
>>>>>  Roger
>>>>>
>>>>>>  Thank you so much once again!
>>>>>>
>>>>>>  jiawen
>>>>>>
>>>>>>        [[alternative HTML version deleted]]
>>>>>>
>>>>>>  _______________________________________________
>>>>>>  R-sig-Geo mailing list
>>>>>>  R-sig-Geo using r-project.org
>>>>>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>> 
>>>>>
>>>>>  --
>>>>>  Roger Bivand
>>>>>  Department of Economics, Norwegian School of Economics,
>>>>>  Helleveien 30, N-5045 Bergen, Norway.
>>>>>  voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>>>>>  https://orcid.org/0000-0003-2392-6140
>>>>>  https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>> 
>>>>
>>>>        [[alternative HTML version deleted]]
>>>>
>>>>  _______________________________________________
>>>>  R-sig-Geo mailing list
>>>>  R-sig-Geo using r-project.org
>>>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>> 
>>>
>>>  --
>>>  Roger Bivand
>>>  Department of Economics, Norwegian School of Economics,
>>>  Helleveien 30, N-5045 Bergen, Norway.
>>>  voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>>>  https://orcid.org/0000-0003-2392-6140
>>>  https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>> 
>>
>>   [[alternative HTML version deleted]]
>>
>>  _______________________________________________
>>  R-sig-Geo mailing list
>>  R-sig-Geo using r-project.org
>>  https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>> 
>
>

-- 
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en



More information about the R-sig-Geo mailing list