[R-sig-Geo] Running huge dataset with dnearneigh
Roger Bivand
Roger@B|v@nd @end|ng |rom nhh@no
Tue Jul 2 11:20:20 CEST 2019
Follow-up: maybe read: https://geocompr.robinlovelace.net/location.html
for a geomarketing case.
Roger
On Tue, 2 Jul 2019, Roger Bivand wrote:
> On Tue, 2 Jul 2019, Jiawen Ng wrote:
>
>> Dear Roger,
>>
>> Thanks for your reply and explanation!
>>
>> I am just exploring the aspect of geodemographics in store locations.
>> There
>> are many factors that can be considered, as you have highlighted!
>
> OK, so I suggest choosing a modest sized case until a selection of working
> models emerges. Once you reach that stage, you can return to scaling up. I
> think you need much more data on the customer behaviour around the stores you
> use to train your models, particularly customer flows associated with actual
> purchases. Firms used to do this through loyalty programmes and cards, but
> this data is not open, so you'd need proxies which say city bikes will not
> give you.
>
> Geodemographics (used for direct mailing as a marketing tool) have largely
> been eclipsed by profiling in social media with the exception of segments
> without social media profiles. This is because postcode or OA profiling is
> often too noisy and so is expensive because there are many false hits. Retail
> is interesting but very multi-faceted, but some personal services are more
> closely related to population as they are hard to digitise.
>
> Hope this helps,
>
> Roger
>
>>
>> Thank you so much for taking the time to write back to me! I will study
>> and
>> consider your advice! Thank you!
>>
>> Jiawen
>>
>> On Mon, 1 Jul 2019 at 19:12, Roger Bivand <Roger.Bivand using nhh.no> wrote:
>>
>>> On Mon, 1 Jul 2019, Jiawen Ng wrote:
>>>
>>>> Dear Roger,
>>>>
>>>> Thank you so much for your detailed response and pointing out potential
>>>> pitfalls! It has prompted me to re-evalutate my approach.
>>>>
>>>> Here is the context: I have some stores' sales data (this is my training
>>>> set of 214 points), I would like to find out where best to set up new
>>>> stores in UK. I am using a geodemographics approach to do this: Perform
>>>> a
>>>> regression of sales against census data, then predict sales on UK output
>>>> areas (by centroids) and finally identify new areas with
>>>> location-allocation models. As the stores are points, this has led me to
>>>> define UK output areas by its population-weighted centroids, thus
>>> resulting
>>>> in the prediction by points rather than by areas. Tests (like moran's I
>>> and
>>>> lagrange multiplier) for spatial relationships among the points in my
>>>> training set were significant hence this has led me to implement some
>>>> spatial models (specifically spatial lag, error and durbin models) to
>>>> account for the spatial relationships in the data.
>>>
>>> I'm afraid that my retail geography is not very up to date, but also that
>>> your approach is most unlikely to yield constructive results.
>>>
>>> Most retail stores are organised in large chains, so optimise costs
>>> between wholesale and retail. Independent retail stores depend crucially
>>> on access to wholesale stores, so anyway cannot locate without regard to
>>> supply costs. Some service activities without wholesale dependencies are
>>> less tied.
>>>
>>> Most chains certainly behave strategically with regard to each other,
>>> sometimes locating toe-to-toe to challenge a competing chain
>>> (Carrefour/Tesco or their local shop variants), sometimes avoiding nearby
>>> competing chain locations to establish a local monopoly (think
>>> Hotelling).
>>>
>>> Population density doesn't express demand, especially unmet demand well
>>> at
>>> all. Think food deserts - maybe plenty of people but little disposable
>>> income. Look at the food desert literature, or the US food stamp
>>> literature.
>>>
>>> Finally (all bad news) retail is not only challenged by location shifting
>>> from high streets to malls, but critically by online shopping, which
>>> shifts the cost structures one the buyer is engaged at a proposed price
>>> to
>>> logistics, to complete the order at the highest margin including returns.
>>> That only marginally relates to population density.
>>>
>>> So you'd need more data than you have, a model that explicitly handles
>>> competition between chains as well as market gaps, and some way of
>>> handling online leakage to move forward.
>>>
>>> If population density was a proxy for accessibility (most often it
>>> isn't),
>>> it might look like the beginnings of a model, but most often we don't
>>> know
>>> what bid-rent surfaces look like, and then, most often different
>>> activities sort differently across those surfaces.
>>>
>>>>
>>>> I am quite unsettled and unclear as to which neighbourhood definition to
>>> go
>>>> for actually. I thought of IDW at first as I thought this would
>>>> summarise
>>>> each point's relationship with their neighbours very precisely thus
>>> making
>>>> the predictions more accurate. Upon your advice (don't use IDW or other
>>>> general weights for predictions), I decided not to use IDW, and changed
>>> it
>>>> to dnearneigh instead (although now I am questioning myself on the
>>>> definition of what is meant by general weights. Perhaps I am
>>> understanding
>>>> the definition of general weights wrong, if dnearneigh is still
>>> considered
>>>> to be a 'general weights' method) Why is the use of IDW not advisable
>>>> however? Is it due to computational reasons? Also, why would having
>>>> thousands of neighbours be making no sense? Apologies for asking so many
>>>> questions, I'd just like to really understand the concepts!
>>>>
>>>
>>> The model underlying spatial regressions using neighbours tapers
>>> dependency as the pairwise elements of (I - \rho W)^{-1} (conditional)
>>> and
>>> [(I - \rho W) (I - \rho W')]^{-1} (see Wall 2004). These are NxN dense
>>> matrices. (I - \rho W) is typically sparse, and under certain conditions
>>> leads to (I - \rho W)^{-1} = \sum_{i=0}^{\inf} \rho^i W^i, the sum of a
>>> power series in \rho and W. \rho is typically upward bounded < 1, so
>>> \rho^i declines as i increases. This dampens \rho^i W^i, so that i
>>> influences j less and less with increasing i. So in the general case IDW
>>> is simply replicating what simple contiguity gives you anyway. So the
>>> sparser W is (within reason), the better. Unless you really know that the
>>> physics, chemistry or biology of your system give you a known systematic
>>> relationship like IDW, you may as well stay with contiguity.
>>>
>>> However, this isn't any use in solving a retail location problem at all.
>>>
>>>> I believe that both the train and test set has varying intensities. I
>>>> was
>>>> weighing the different neighbourhood methods: dnearneigh, knearneigh,
>>> using
>>>> IDW etc. and I felt like each method would have its disadvantages -- its
>>>> difficult to pinpoint which neighbourhood definition would be best. If
>>> one
>>>> were to go for knearneigh for example, results may not be fair due to
>>>> the
>>>> inhomogeneity of the points -- for instance, point A's nearest
>>>> neighbours
>>>> may be within a few hundreds of kilometres while point B's nearest
>>>> neighbours may be in the thousands. I feel like the choice of any
>>>> neighbourhood definition can be highly debateable... What do you think?
>>>>
>>>
>>> When in doubt use contiguity for polygons and similar graph based methods
>>> for points. Try to keep the graphs planar (as few intersecting edges as
>>> possible - rule of thumb).
>>>
>>>
>>>> After analysing my problem again, I think that predicting by output
>>>> areas
>>>> (points) would be best for my case as I would have to make use of the
>>>> population data after building the model. Interpolating census data of
>>> the
>>>> output area (points) would cause me to lose that information.
>>>>
>>>
>>> Baseline, this is not going anywhere constructive, and simply approaching
>>> retail location in this way is unhelpful - there is far too little
>>> information in your model.
>>>
>>> If you really must, first find a fully configured retail model with the
>>> complete data set needed to replicate the results achieved, and use that
>>> to benchmark how far your approach succeeds in reaching a similar result
>>> for that restricted area. I think that you'll find that the retail model
>>> is much more successful, but if not, there is less structure in
>>> contemporary retail than I though.
>>>
>>> Best wishes,
>>>
>>> Roger
>>>
>>>> Thank you for the comments and the advice so far, I would greatly
>>> welcome
>>>> and appreciate additional feedback!
>>>>
>>>> Thank you so much once again!
>>>>
>>>> Jiawen
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>>>> On Sun, 30 Jun 2019 at 16:57, Roger Bivand <Roger.Bivand using nhh.no> wrote:
>>>>
>>>>> On Sat, 29 Jun 2019, Jiawen Ng wrote:
>>>>>
>>>>>> Dear Roger,
>>>>>
>>>>> Postings go to the whole list ...
>>>>>
>>>>>>
>>>>>> How can we deal with a huge dataset when using dnearneigh?
>>>>>>
>>>>>
>>>>> First, why distance neighbours? What is the support of the data, point
>>> or
>>>>> polygon? If polygon, contiguity neighbours are preferred. If not, and
>>> the
>>>>> intensity of observations is similar across the whole area, distance
>>>>> may
>>>>> be justified, but if the intensity varies, some observations will have
>>>>> very many neighbours. In that case, unless you have a clear ecological
>>> or
>>>>> environmental reason for knowing that a known distance threshold binds,
>>> it
>>>>> is not a good choice.
>>>>>
>>>>>> Here is my code:
>>>>>>
>>>>>> d <- dnearneigh(spdf,0, 22000)
>>>>>> all_listw <- nb2listw(d, style = "W")
>>>>>>
>>>>>> where the spdf object is in the british national grid CRS:
>>>>>> +init=epsg:27700, with 227,973 observations/points. The distance of
>>>>> 22,000
>>>>>> was decided by a training set that had 214 observations and the spdf
>>>>> object
>>>>>> contains both the training set and the testing set.
>>>>>>
>>>>>
>>>>> This is questionable. You train on 214 observations - do their areal
>>>>> intensity match those of the whole data set? If chosen at random, you
>>> run
>>>>> into the spatial sampling problems discussed in:
>>>>>
>>>>>
>>>>>
>>> https://www.sciencedirect.com/science/article/pii/S0304380019302145?dgcid=author
>>>>>
>>>>> Are 214 observations for training representative of 227,973 prediction
>>>>> sites? Do you only have observations on the response for 214, and an
>>>>> unobserved response otherwise? What are the data, what are you trying
>>>>> to
>>>>> do and why? This is not a sensible setting for models using weights
>>>>> matrices for prediction (I think), because we do not have estimates of
>>> the
>>>>> prediction error in general.
>>>>>
>>>>>> I am using a Mac, with a processor of 2.3 GHz Intel Core i5 and 8 GB
>>>>>> memory. My laptop showed that when dnearneigh command was run on all
>>>>>> observations, around 6.9 out of 8GB was used by the rsession and that
>>> the
>>>>>> %CPU used by the rsession was stated to be around 98%, although
>>>>>> another
>>>>>> indicator showed that my computer was around 60% idle. After running
>>> the
>>>>>> command for a day, rstudio alerted me that the connection to the
>>> rsession
>>>>>> could not be established, so I aborted the entire process altogether.
>>>>>> I
>>>>>> think the problem here may be the size of the dataset and perhaps the
>>>>>> limitations of my laptop specs.
>>>>>>
>>>>>
>>>>> On planar data, there is no good reason for this, as each observation
>>>>> is
>>>>> treated separately, finding and sorting distances, and choosing those
>>>>> under the threshold. It will undoubtedly slow if there are more than a
>>> few
>>>>> neighbours within the threshold, but I already covered the
>>> inadvisability
>>>>> of defining neighbours in that way.
>>>>>
>>>>> Using an rtree might help, but you get hit badly if there are many
>>>>> neighbours within the threshold you have chosen anyway.
>>>>>
>>>>> On most 8GB hardware and modern OS, you do not have more than 3-4GB for
>>>>> work. So something was swapping on your laptop.
>>>>>
>>>>>> Do you have any advice on how I can go about making a neighbours list
>>>>> with
>>>>>> dnearneigh for 227,973 observations in a successful and efficient way?
>>>>>> Also, would you foresee any problems in the next steps, especially
>>> when I
>>>>>> will be using the neighbourhood listw object as an input in fitting
>>>>>> and
>>>>>> predicting using the spatial lag/error models? (see code below)
>>>>>>
>>>>>> model <- spatialreg::lagsarlm(rest_formula, data=train, train_listw)
>>>>>> model_pred <- spatialreg::predict.sarlm(model, test, all_listw)
>>>>>>
>>>>>
>>>>> Why would using a spatial lag model make sense? Why are you suggesting
>>>>> this model, do you have a behavioural for why only the spatially lagged
>>>>> response should be included?
>>>>>
>>>>> Why do you think that this is sensible? You are predicting 1000 times
>>> for
>>>>> each observation - this is not what the prediction methods are written
>>>>> for. Most involve inverting an nxn inverse matrix - did you refer to
>>>>> Goulard et al. (2017) to get a good understanding of the underlying
>>>>> methods?
>>>>>
>>>>>> I think the predicting part may take some time, since my test set
>>>>> consists
>>>>>> of 227,973 - 214 observations = 227,759 observations.
>>>>>>
>>>>>> Here are some solutions that I have thought of:
>>>>>>
>>>>>> 1. Interpolate the test set point data of 227,759 observations over a
>>>>> more
>>>>>> manageable spatial pixel dataframe with cell size of perhaps 10,000m
>>>>>> by
>>>>>> 10,000m which would give me around 4900 points. So instead of 227,759
>>>>>> observations, I can make the listw object based on just 4900 + 214
>>>>> training
>>>>>> points and predict just on 4900 observations.
>>>>>
>>>>> But what are you trying to do? Are the observations output areas? House
>>>>> sales? If you are not filling in missing areal units (the Goulard et
>>>>> al.
>>>>> case), couldn't you simply use geostatistical methods which seem to
>>> match
>>>>> your support better, and can be fitted and can predict using a local
>>>>> neighbourhood? While you are doing that, you could switch to INLA with
>>>>> SPDE, which interposes a mesh like the one you suggest. But in that
>>> case,
>>>>> beware of the mesh choice issue in:
>>>>>
>>>>> https://doi.org/10.1080/03610926.2018.1536209
>>>>>
>>>>>>
>>>>>> 2. Get hold of better performance machines through cloud computing
>>>>>> such
>>>>> as
>>>>>> AWS EC2 services and try running the commands and models there.
>>>>>>
>>>>>
>>>>> What you need are methods, not wasted money on hardware as a service.
>>>>>
>>>>>> 3. Parallel computing using the parallel package from r (although I am
>>>>> not
>>>>>> sure whether dnearneigh can be parallelised).
>>>>>>
>>>>>
>>>>> This could easily be implemented if it was really needed, which I don't
>>>>> think it is; better methods understanding lets one do more with less.
>>>>>
>>>>>> I believe option 1 would be the most manageable but I am not sure how
>>> and
>>>>>> by how much this would affect the accuracy of the predictions as
>>>>>> interpolating the dataset would be akin to introducing more
>>>>>> estimations
>>>>> in
>>>>>> the prediction. However, I am also grappling with the trade-off
>>>>>> between
>>>>>> accuracy and computation time. Hence, if options 2 and 3 can offer a
>>>>>> reasonable computation time (1-2 hours) then I would forgo option 1.
>>>>>>
>>>>>> What do you think? Is it possible to make a neighbourhood listw object
>>>>> out
>>>>>> of 227,973 observations efficiently?
>>>>>
>>>>> Yes, but only if the numbers of neighbours are very small. Look in
>>> Bivand
>>>>> et al. (2013) to see the use of some fairly large n, but only with few
>>>>> neighbours for each observation. You seem to be getting average
>>> neighbour
>>>>> counts in the thousands, which makes no sense.
>>>>>
>>>>>>
>>>>>> Thank you for reading to the end! Apologies for writing a lengthy one,
>>>>> just
>>>>>> wanted to fully describe what I am facing, I hope I didn't miss out
>>>>>> anything crucial.
>>>>>>
>>>>>
>>>>> Long is OK, but there is no motivation here for why you want to make
>>> 200K
>>>>> predictions from 200 observations with point support (?) using weights
>>>>> matrices.
>>>>>
>>>>> Hope this clarifies,
>>>>>
>>>>> Roger
>>>>>
>>>>>> Thank you so much once again!
>>>>>>
>>>>>> jiawen
>>>>>>
>>>>>> [[alternative HTML version deleted]]
>>>>>>
>>>>>> _______________________________________________
>>>>>> R-sig-Geo mailing list
>>>>>> R-sig-Geo using r-project.org
>>>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>>>
>>>>>
>>>>> --
>>>>> Roger Bivand
>>>>> Department of Economics, Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>>>>> https://orcid.org/0000-0003-2392-6140
>>>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>>>
>>>>
>>>> [[alternative HTML version deleted]]
>>>>
>>>> _______________________________________________
>>>> R-sig-Geo mailing list
>>>> R-sig-Geo using r-project.org
>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>>>
>>>
>>> --
>>> Roger Bivand
>>> Department of Economics, Norwegian School of Economics,
>>> Helleveien 30, N-5045 Bergen, Norway.
>>> voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
>>> https://orcid.org/0000-0003-2392-6140
>>> https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
>>>
>>
>> [[alternative HTML version deleted]]
>>
>> _______________________________________________
>> R-sig-Geo mailing list
>> R-sig-Geo using r-project.org
>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
>>
>
>
--
Roger Bivand
Department of Economics, Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; e-mail: Roger.Bivand using nhh.no
https://orcid.org/0000-0003-2392-6140
https://scholar.google.no/citations?user=AWeghB0AAAAJ&hl=en
More information about the R-sig-Geo
mailing list