[R-sig-Geo] efficient code/function for rectangular SP weight Matrix and gwr

Roger Bivand Roger.Bivand at nhh.no
Thu May 24 00:03:07 CEST 2007


On Wed, 23 May 2007, Sam Field wrote:

> Roger Bivand wrote:
> > On Tue, 22 May 2007, Sam Field wrote:
> >
> >   
> >> In the event that others are following this thread,  I have included 
> >> some r-code for constructing a rectangular weight matrix from two sets 
> >> of coordinates (in feet). The first set of coords (nhood_pts[,1] and 
> >> nhood_pts[,2]) come from a regular grid, the second set refer to the 
> >> location of patients ( combo2$X and combo2$Y).  I don't provide the data 
> >> only the code below.  The original data consists of 1000 grids and 
> >> 30000+ patients. It takes 30 minutes on my fairly speedy computer to 
> >> complete the first step- extracting pair-wise distances less then a 
> >> given bandwidth (D). I store the neighbor ids and counts in separate 
> >> lists.   The rest of the code calculates kernel weights based on these 
> >> distances. The code is not efficient from any number of standpoints, but 
> >> storing these distances in a list rather then a matrix avoided running 
> >> into memory allocation problems. Here is the code.
> >>     
> >
> > Would spDistsN1() in sp have helped to speed things up a little? Looping
> > over the 1k grid points and passing out the 30k patient points shouldn't
> > be very demanding (34 seconds on a oldish box, more if you need Great
> > Circle distances). For heavyweight things, quadtrees would be a better
> > choice, I have an unpublished package for approximate nearest neighbours.
> >
> > Roger
> >
> >   
> The short answer is yes.  The whole calculation went from 30 minutes to 
> under 30 seconds!
> 
> thanks Roger!  R is well documented, but these lists are invaluable!

Please believe me that developers also search the lists - finding a 
fruitful and complete thread can really help!

> 
> I am using these weights in a logistic GWR.  The bootstrapping step took 
> 3 days.  Are there functions that someone has written in R for logistic 
> GWR (or any other GLM for that matter)?  I need a function that allows 
> for an offset term in the linear predictor.  What I am using now works 
> (looping through glm()), but on this problem it takes a very, very long 
> time.
> 

There is a very preminary ggwr() in the spgwr package, but it doesn't pass 
the offset= argument through. It should, though, pass an + offset(var) 
through inside the formula - could you try that? It uses spDistsN1() 
internally, so should run reasonably. gwr() has a cryptic and experimental 
cl= argument for a snow cluster, but ggwr() hasn't been modified to suit 
computing on a cluster (yet). Would that help (for non-Windows users, I 
believe)?

Best wishes,

Roger

> 
> Sam
> 
> apologies to Roger for sending this directly.
> 

Thanks for remembering!

> 
> >> # define bandwidth
> >> D <- 2*5280
> >>
> >>
> >> #This creates a list of distances < 1 mile
> >> w_raw <- vector("list",length(nhood_pts[,1]))
> >> for (i in 1:length(nhood_pts[,1])){
> >> w_raw[[i]] <- rep(NA,length(combo2$X))}
> >>
> >> system.time(for (i in 1:length(nhood_pts[,1])){for(j in 1:length(combo2$X)){
> >> if(sqrt((nhood_pts[,1][i]-combo2$X[j])^2 + 
> >> (nhood_pts[,2][i]-combo2$Y[j])^2) < D) w_raw[[i]][j] <-  
> >> sqrt((nhood_pts[,1][i]-combo2$X[j])^2 + 
> >> (nhood_pts[,2][i]-combo2$Y[j])^2)}} )
> >>
> >>
> >>
> >> w_raw.id <-   vector("list",length(nhood_pts[,1]))
> >> for (i in 1:length(nhood_pts[,1])){
> >> w_raw.id[[i]] <-    which(!is.na(w_raw[[i]]))}
> >>
> >> for (i in 1:length(nhood_pts[,1])){
> >> w_raw[[i]] <-  w_raw[[i]][w_raw.id[[i]]]}
> >>
> >>
> >> w_raw.count <-   rep(1,length(nhood_pts[,1]))
> >> for (i in 1:length(nhood_pts[,1])){
> >> w_raw.count[i] <-  length(w_raw[[i]])}
> >>
> >> w_weights <-  vector("list",length(nhood_pts[,1]))
> >> for (i in 1:length(nhood_pts[,1])){
> >> w_weights[[i]] <- (1-(w_raw[[i]]/D)^2)^2}
> >>
> >> w <-  vector("list",length(nhood_pts[,1]))
> >> for (i in 1:length(nhood_pts[,1])){
> >> w[[i]] <- w_weights[[i]]/sum(w_weights[[i]])}
> >>
> >>
> >>
> >>
> >>
> >> Roger Bivand wrote:
> >>     
> >>> On Fri, 11 May 2007, Stéphane Dray wrote:
> >>>
> >>>   
> >>>       
> >>>> Hi Sam,
> >>>>
> >>>> I think that this question is quite general and could interest other 
> >>>> people, including me, with very different aims. I have developed a 
> >>>> method to look for the relationships between two data sets that have 
> >>>> been sampled on the same area but for different locations. In my 
> >>>> example, the two samples are two polygons layers. In this approach, I 
> >>>> compute a rectangular weighting matrix where each weight correspond to 
> >>>> the area of intersection between polygons of each layer. I have used 
> >>>> also the matrix form to store these weights (my data set was very small 
> >>>> compared to you). I remember that Roger was also interested by these 
> >>>> rectangular weights in another context. Here we  have different  problems:
> >>>> - how to compute these kind  of weights
> >>>> - how to store them.
> >>>>
> >>>> For the first point, I think that for each method/application, the 
> >>>> solution  is different. We could develop/extend classical tools for 
> >>>> square weights (one set of spatial units) to rectangular weights (two 
> >>>> sets of spatial units).
> >>>> For the second one, It would be probably interesting to define a class 
> >>>> of object in spdep. nb objects are lists, and I think that it would be 
> >>>> the solution for rectangular neighborhood.
> >>>>
> >>>> If I consider two sets of spatial units (A and B) where the number of 
> >>>> units is equal to na and nb.  We could store the neighbors in a list of 
> >>>> length 2. The first element of this list is a list of length na. In this 
> >>>> list, the j-th element is a vector of the neighbors of the j-th unit of 
> >>>> the layer A. These neighbors are spatial units of the layer B.  The 
> >>>> second element of the global list is a list of length nb where each 
> >>>> element is a vector of neighbors.
> >>>>
> >>>> I think that we have to think to a class of object that could be useful 
> >>>> for everybody dealing with this kind of rectangular weights. If this 
> >>>> class is properly defined (second point), we could then develop tools to 
> >>>> construct this kind of neighborhoods (first point). The eventual 
> >>>> extension to more than two data sets could also be taken into account in 
> >>>> this reflexion.
> >>>>
> >>>>     
> >>>>         
> >>> I would welcome input on this. I'm looking at an alternative weights 
> >>> representation through classes in the Matrix package, which is evolving 
> >>> fast, and which seems to be promising. If the dimnames slot is used to 
> >>> hold the region.id values, it might be possible to make progress.
> >>>
> >>> Best wishes,
> >>>
> >>> Roger
> >>>
> >>>   
> >>>       
> >>>> Cheers,
> >>>>
> >>>>
> >>>> Sam Field wrote:
> >>>>     
> >>>>         
> >>>>> List,
> >>>>>
> >>>>> I need to create a rectangular spatial weight matrix for a set of n and 
> >>>>> m objects. I quickly run in to memory allocation problems when 
> >>>>> constructing the full matrix in a single pass. I am looking for a more 
> >>>>> efficient way of doing this. There appears to be efficient procedures in 
> >>>>> spdep for constructing SQUARE spatial weight matrices (e.g. 
> >>>>> dnearneigh()). Are there analogous procedures for constructing distance 
> >>>>> based weights between two different point patterns? I am doing this in 
> >>>>> preparation for implementing an approximate geographically weighted 
> >>>>> logistic regression procedure. I was thinking about using re sampling 
> >>>>> procedure as an inferential frame- perhaps I might get some feedback. 
> >>>>> This is what I was going to do.
> >>>>>
> >>>>> I have a point pattern of 30,000 diabetic people based on where they 
> >>>>> lived during a 2 year period. During that period, approximately 4% of 
> >>>>> them developed diabetes. I am interested in isolating the impact of 
> >>>>> ecological factors on the geographic variation" of the disease, so it is 
> >>>>> necessary to control for the spatial clustering of individual level risk 
> >>>>> factors associated with the disease (diabetes).
> >>>>>
> >>>>> Step 1: Estimate a logistic regression using the full sample and predict 
> >>>>> incidence diabetes using individual level covariates (i.e. who developed 
> >>>>> diabetes over the two year period).
> >>>>>
> >>>>> Step 2. Estimate a weighted logit model at each location (grid). The 
> >>>>> observations would be the people (not the geographic units) and the 
> >>>>> weights would be kernel weights based on distance. The model would only 
> >>>>> contain a single freely estimated parameter, the intercept, but it would 
> >>>>> also contain an offset term. For each patient, the offset term would 
> >>>>> simply be an evaluation of the linear predictor of the global model 
> >>>>> estimated above (based on the observed covariate values), but without 
> >>>>> the intercept. This would effectively fix the estimates of the patient 
> >>>>> level coefficients to their global values, requiring only a local 
> >>>>> estimate of the intercept. My hope is that I could interpret geographic 
> >>>>> variability in the intercept as evidence for a "location effect" net of 
> >>>>> the patient composition or "risk profile" at a particular location. It 
> >>>>> would probably make sense to center the X variables so that the 
> >>>>> intercept was interpretable and estimated in a region of the response 
> >>>>> plane where their is plenty of data. I would let the other covariates 
> >>>>> vary as well, but I doubt the model could be estimated in large portions 
> >>>>> of the study area because of sparse data.
> >>>>>
> >>>>> Step 3. If I were going to do inference on the location specific 
> >>>>> intercepts, I would generate a sampling distribution at each location by 
> >>>>> re sampling from the global model, and repeat Step 2 for each randomly 
> >>>>> drawn sample. This would give me a local sampling distribution of 
> >>>>> intercept estimates at each location and I could compare it to the the 
> >>>>> single one generated from the observed data. The global model represents 
> >>>>> a kind of null because the intercept is fixed to its global value and 
> >>>>> geographic variability is driven entirely by the spatial clustering of 
> >>>>> patient level factors.
> >>>>>
> >>>>>
> >>>>> thanks!
> >>>>>
> >>>>> Sam
> >>>>>
> >>>>> _______________________________________________
> >>>>> R-sig-Geo mailing list
> >>>>> R-sig-Geo at stat.math.ethz.ch
> >>>>> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >>>>>
> >>>>>
> >>>>>   
> >>>>>       
> >>>>>           
> >>>>     
> >>>>         
> >>>       
> >> _______________________________________________
> >> R-sig-Geo mailing list
> >> R-sig-Geo at stat.math.ethz.ch
> >> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> >>
> >>     
> >
> >
> 
> _______________________________________________
> R-sig-Geo mailing list
> R-sig-Geo at stat.math.ethz.ch
> https://stat.ethz.ch/mailman/listinfo/r-sig-geo
> 

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no




More information about the R-sig-Geo mailing list