[R-sig-Geo] Holdout Sampling Adaptive Bandwidth SPGWR

Wed Sep 4 11:08:07 CEST 2013

On Tue, 3 Sep 2013, Paul Bidanset wrote:

> Thank you very much for the example and the clarification. My hold out test
> is random. The vector provided by gw.adapt() allows me to see the bandwidth
> size for each point. Is there a way to see each regression point's
> bandwidth size with the correct format you just showed me?

I wouldn't say "correct", rather feasible. Note that the local 
coefficients used for prediction use gSRDF1 data and positions relative to 
gSRDF2 postitions to calculate the coefficients, then the gSRDF2 
covariates data to predict. Try:

library(spgwr)
example(georgia)
par("ask"=FALSE)
set.seed(1)
s <- sample(nrow(gSRDF), 100)
gSRDF1 <- gSRDF[s,]
plot(gSRDF1, col="orange")
gSRDF2 <- gSRDF[!(1:nrow(gSRDF) %in% s),]
plot(gSRDF2, col="brown", add=TRUE)
bwsel <- gwr.sel(PctBach ~ TotPop90 + PctRural + PctEld + PctFB + PctPov +
   PctBlack, data=gSRDF1, adapt=TRUE, method="aic", longlat=TRUE)
# adding longlat=TRUE to be sure, here the geographical coordinates are
# known from the input data object, distances here in km
bws <- gw.adapt(coordinates(gSRDF1), coordinates(gSRDF2), quant=bwsel,
   longlat=TRUE)
bws
# shows the adaptive bandwidths, which cannot be passed through the 
# bandwidth= argument, which is only for a single fixed value
model1 <- gwr(PctBach ~ TotPop90 + PctRural + PctEld + PctFB + PctPov +
   PctBlack, data=gSRDF1, adapt=bwsel, hatmatrix=TRUE, longlat=TRUE)
model1
PredictionsOfNewData  <- gwr(PctBach ~ TotPop90 + PctRural + PctEld +
   PctFB + PctPov + PctBlack, data=gSRDF1, fit.points=gSRDF2, adapt=bwsel,
   prediction=TRUE, fittedGWRobject=model1, se.fit=TRUE, longlat=TRUE)
PredictionsOfNewData
plot(gSRDF2$PctBach, PredictionsOfNewData$SDF$pred)
library(plotrix)
plotCI(1:nrow(PredictionsOfNewData$SDF), PredictionsOfNewData$SDF$pred,
   uiw=2*PredictionsOfNewData$SDF$pred.se, xlab="test counties")
points(1:nrow(PredictionsOfNewData$SDF), gSRDF2$PctBach, pch=16)
summary(gSRDF2$PctBach - PredictionsOfNewData$SDF$pred)

Hope this helps,

Roger

PS.

# the lm case
lm1 <- lm(PctBach ~ TotPop90 + PctRural + PctEld + PctFB + PctPov +
   PctBlack, data=gSRDF1)
lmpred <- predict(lm1, gSRDF2, se.fit=TRUE)
plot(gSRDF2$PctBach, lmpred$fit)
summary(gSRDF2$PctBach - lmpred$fit)
plotCI(1:nrow(gSRDF2), lmpred$fit, uiw=2*lmpred$se.fit, xlab="test counties")
points(1:nrow(gSRDF2), gSRDF2$PctBach, pch=16)
# maybe also see errorest in ipred

>
>
> On Tue, Sep 3, 2013 at 3:54 PM, Roger Bivand <Roger.Bivand at nhh.no> wrote:
>
>> yOn Fri, 30 Aug 2013, Roger Bivand wrote:
>>
>>  On Fri, 30 Aug 2013, Paul Bidanset wrote:
>>>
>>>  Thank you. I'd like to subset into a specific county. Should there be
>>>> further partitioning from that level?
>>>>
>>>>
>>> No idea. Please re-create your scenario by subsetting georgia and the
>>> coordinates to suit.
>>>
>>>
>> library(spgwr)
>> example(georgia)
>> gSRDF1 <- gSRDF[1:100,]
>> gSRDF2 <- gSRDF[101:159,]
>>
>> bwsel <- gwr.sel(PctBach ~ TotPop90 + PctRural + PctEld + PctFB + PctPov +
>>   PctBlack, data=gSRDF1, adapt=TRUE, method="aic")
>>
>> model1 <- gwr(PctBach ~ TotPop90 + PctRural + PctEld + PctFB + PctPov +
>>   PctBlack, data=gSRDF1, adapt=bwsel, hatmatrix=TRUE)
>> PredictionsOfNewData  <- gwr(PctBach ~ TotPop90 + PctRural + PctEld +
>> PctFB + PctPov + PctBlack, data=gSRDF1, fit.points=gSRDF2, adapt=bwsel,
>>   prediction=TRUE, fittedGWRobject=model1)
>> plot(gSRDF2$PctBach, PredictionsOfNewData$SDF$pred)
>>
>> with the development version of spgwr on R-forge; with the released
>> version the polygons of gSRDF2 cause an error. Note your confusion about
>> adapt= in gwr(), if set as adapt=TRUE, this means adapt=1, so includes all
>> the observations in the kernel, setting a very broad bandwidth. Never call
>> gw.adapt(), it isn't a user-level function, but is exposed for exploring
>> the inadequacies of GWR as a method. I would have appreciated an answer
>> wrt. whether your held out test set is random or clustered, but here I've
>> just subsetted the data in the simplest way.
>>
>> Roger
>>
>>
>>
>>  Roger
>>>
>>>
>>>> On Fri, Aug 30, 2013 at 10:19 AM, Roger Bivand <Roger.Bivand at nhh.no>
>>>> wrote:
>>>>
>>>>  On Fri, 30 Aug 2013, Paul Bidanset wrote:
>>>>>
>>>>>  Alrighty then!
>>>>>
>>>>>>
>>>>>>
>>>>> Thanks. Now make this your case by subsetting georgia in a way that
>>>>> matches your case (all counties west of x?, random set?), and we may be
>>>>> getting closer. In the geographical partition, the fit points are all a
>>>>> long way from the data points, in the random case, they aren't grouped
>>>>> in
>>>>> the same way. You may also need to run the model twice, passing the
>>>>> fitted
>>>>> model (fit.points == data.points) through to the next stage, but I'm
>>>>> unsure
>>>>> about that.
>>>>>
>>>>> Roger
>>>>>
>>>>>
>>>>>  Say I create this adaptive bandwidth model using the original dataset
>>>>>> "georgia"
>>>>>>
>>>>>> coords = cbind(georgia$x, georgia$y)
>>>>>> bwsel <- gwr.sel(PctBach ~ TotPop90 + PctRural + PctEld + PctFB +
>>>>>> PctPov +
>>>>>> PctBlack, data=georgia, adapt=TRUE, coords, gweight=gwr.Gauss, method =
>>>>>> "aic" )
>>>>>> bw1 <- gw.adapt(coords, coords, quant=bwsel)
>>>>>> model1 <- gwr(PctBach ~ TotPop90 + PctRural + PctEld + PctFB + PctPov +
>>>>>> PctBlack, data=georgia, bw=b1, coords, hatmatrix=T)
>>>>>> model 1
>>>>>>
>>>>>> Suppose I receive an updated data set (same dependent and independent
>>>>>> variables) and I wish to test the above model1's ability to predict the
>>>>>> dependent variable of these new data points. If this were a basic lm
>>>>>> regression in R, I would use the "predict()" command. I wish to better
>>>>>> understand how I would do so using a GWR model. I found the below
>>>>>> procedure, but I would like to know first if it is capable
>>>>>> accomplishing
>>>>>> this task, and secondly, if I am specifying it correctly. It seems to
>>>>>> me
>>>>>> that this procedure, as it stands, doesn't take into account the
>>>>>> appropriate bandwidths for the new data, say, "georgiaNewData"
>>>>>>
>>>>>> PredictionsOfNewData  <- gwr(PctBach ~ TotPop90 + PctRural + PctEld +
>>>>>> PctFB
>>>>>> + PctPov + PctBlack, data=gSRDF, adapt=TRUE, gweight=gwr.Gauss, method
>>>>>> =
>>>>>> "aic",  bandwidth=bw1,
>>>>>> predictions=TRUE, fit.points=georgiaNewData)
>>>>>> PredictionsOfNewData
>>>>>>
>>>>>> Thanks in advance for guidance and insight...
>>>>>>
>>>>>>
>>>>>> On Fri, Aug 30, 2013 at 9:01 AM, Roger Bivand <Roger.Bivand at nhh.no>
>>>>>> wrote:
>>>>>>
>>>>>>  Provide a reproducible code example of your problem using a built in
>>>>>> data
>>>>>>
>>>>>>> set. No reproducible example, no response, as I cannot guess (and
>>>>>>> likely
>>>>>>> nobody else can either) what your specific misunderstanding is. Code
>>>>>>> using
>>>>>>> for example the Georgia data set in the package. You seem to be
>>>>>>> assuming
>>>>>>> that you understand how GWR works, I don't think that you do, so you
>>>>>>> have
>>>>>>> to show what you mean in code.
>>>>>>>
>>>>>>> Roger
>>>>>>>
>>>>>>>
>>>>>>> On Fri, 30 Aug 2013, Paul Bidanset wrote:
>>>>>>>
>>>>>>>  Roger,
>>>>>>>
>>>>>>>
>>>>>>>> I think all I would like to know is if it is possible to apply a
>>>>>>>> calibrated
>>>>>>>> GWR model to a hold-out sample, and if so, what the most accurate
>>>>>>>> way to
>>>>>>>> do
>>>>>>>> so is. I understand the pitfalls of GWR but would like to learn as
>>>>>>>> much
>>>>>>>> as
>>>>>>>> I can before progressing to the next spatial methodology I learn in
>>>>>>>> R.
>>>>>>>>
>>>>>>>>
>>>>>>>> On Fri, Aug 30, 2013 at 3:37 AM, Roger Bivand <Roger.Bivand at nhh.no>
>>>>>>>> wrote:
>>>>>>>>
>>>>>>>>  Paul, Luis,
>>>>>>>>
>>>>>>>>
>>>>>>>>> I suspect that your speculations are completely wrong-headed. Please
>>>>>>>>> provide a reproducible example with a built-in data set, so that
>>>>>>>>> there
>>>>>>>>> is
>>>>>>>>> at least minimal clarity in what you are guessing. Note in addition
>>>>>>>>> that
>>>>>>>>> GWR as a technique should not be used for anything other than
>>>>>>>>> exploration
>>>>>>>>> of possible mis-specification in the underlying model with the given
>>>>>>>>> data,
>>>>>>>>> as patterning in coefficients is induced by GWR for simulated
>>>>>>>>> covariates
>>>>>>>>> with no pattern.
>>>>>>>>>
>>>>>>>>> Roger
>>>>>>>>>
>>>>>>>>>
>>>>>>>>> On Fri, 30 Aug 2013, Luis Guerra wrote:
>>>>>>>>>
>>>>>>>>>  Thank you Luis. When calibrating the adaptive model, using adapt=t
>>>>>>>>> in
>>>>>>>>> the
>>>>>>>>>
>>>>>>>>>  bandwidth selection created the proportion you speak of, which then
>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  allowed
>>>>>>>>>>> me to create a bandwidth matrix using gwr.adapt. However, this has
>>>>>>>>>>> not
>>>>>>>>>>> worked for me with holdout samples. Have you had success in this
>>>>>>>>>>> regard?
>>>>>>>>>>>
>>>>>>>>>>>  Now I get what you mean. Let's show an example:
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  bw <- gwr.sel(var ~ var1, data=yourdata, adapt=TRUE)
>>>>>>>>>> m <- gwr(var~var1, data=yourdata, adapt=bw, fit.points=newdata)
>>>>>>>>>>
>>>>>>>>>> So an adaptative bandwidth (bw) is calculated based on"yourdata",
>>>>>>>>>> while
>>>>>>>>>> you
>>>>>>>>>> are fitting "newdata" later on using that previously found bw. I
>>>>>>>>>> had
>>>>>>>>>> not
>>>>>>>>>> thought about it previously. Let's see whether someone else can
>>>>>>>>>> help
>>>>>>>>>> you
>>>>>>>>>> (us).
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  I do not know the intended influence of these "fit.points". I
>>>>>>>>>> would
>>>>>>>>>> think
>>>>>>>>>>
>>>>>>>>>>  that new localized regressions are not calculated, as we're
>>>>>>>>>> testing
>>>>>>>>>>
>>>>>>>>>>> the
>>>>>>>>>>> model and previous data points' ability to predict for these new
>>>>>>>>>>> ones,
>>>>>>>>>>> but
>>>>>>>>>>> I could be wrong. My current method, however, is producing much
>>>>>>>>>>> poorer
>>>>>>>>>>> results with the holdouts, which I am fairly sure is related to my
>>>>>>>>>>> inability to incorporate the new points necessary bandwidths.
>>>>>>>>>>>
>>>>>>>>>>>  Coming back to the previously created example, imagine that
>>>>>>>>>>> "newdata"
>>>>>>>>>>>
>>>>>>>>>>>  is a
>>>>>>>>>>>
>>>>>>>>>> single point that you want to fit. Imagine now that "yourdata" is a
>>>>>>>>>> sample
>>>>>>>>>> with 1000 cases. Then you are getting 1000 models with 1000
>>>>>>>>>> different
>>>>>>>>>> intercepts and 1000 different beta values to adjust var1, rigth?
>>>>>>>>>> Which
>>>>>>>>>> of
>>>>>>>>>> all these parameters do you use for fitting "newdata"? And
>>>>>>>>>> something
>>>>>>>>>> else,
>>>>>>>>>> what would happen with "newdata" if it is enough far away from
>>>>>>>>>> "yourdata"
>>>>>>>>>> and we would be using a fixed bandwidth?
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>  On Aug 29, 2013 8:56 PM, "Luis Guerra" <luispelayo84 at gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>   Dear Paul,
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  I am dealing with this kind of problems right now, and if I am
>>>>>>>>>>>> not
>>>>>>>>>>>> wrong,
>>>>>>>>>>>> when you want to apply an adaptative bandwidth, you should
>>>>>>>>>>>> introduce a
>>>>>>>>>>>> value for the "adapt" parameter instead of for the "bandwidth"
>>>>>>>>>>>> parameter.
>>>>>>>>>>>> This value will be between 0 and 1 and indicates the proportion
>>>>>>>>>>>> of
>>>>>>>>>>>> cases
>>>>>>>>>>>> around your regression point that should be included to estimate
>>>>>>>>>>>> each
>>>>>>>>>>>> local
>>>>>>>>>>>> model. So depending on the amount of points around each case, the
>>>>>>>>>>>> model
>>>>>>>>>>>> will use a different bandwidth for each point to be fitted.
>>>>>>>>>>>>
>>>>>>>>>>>> Related to your question, do you know what is the influence of
>>>>>>>>>>>> the
>>>>>>>>>>>> data
>>>>>>>>>>>> introduced in the "data" parameter to the data to be fitted
>>>>>>>>>>>> (introduced
>>>>>>>>>>>> in
>>>>>>>>>>>> the "fit.points" parameter)? I mean, you have to obtain new local
>>>>>>>>>>>> models
>>>>>>>>>>>> (one for each point to be fitted), so I do not understand whether
>>>>>>>>>>>> the
>>>>>>>>>>>> "data" parameter is used somehow...
>>>>>>>>>>>>
>>>>>>>>>>>> Best regards,
>>>>>>>>>>>>
>>>>>>>>>>>> Luis
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>> On Fri, Aug 30, 2013 at 1:26 AM, Paul Bidanset <
>>>>>>>>>>>> pbidanset at gmail.com
>>>>>>>>>>>>
>>>>>>>>>>>>  wrote:
>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>   Hi Folks,
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>>  I was curious if anyone has had experience applying an SPGWR
>>>>>>>>>>>>> model
>>>>>>>>>>>>> with
>>>>>>>>>>>>> an
>>>>>>>>>>>>> adaptive bandwidth matrix to a holdout or validation sample. I
>>>>>>>>>>>>> am
>>>>>>>>>>>>> using
>>>>>>>>>>>>> the
>>>>>>>>>>>>> "fit.points" command, which does not seem to allow for a new
>>>>>>>>>>>>> bandwidth
>>>>>>>>>>>>> calibrated around the holdout samples XY coordinates. Any
>>>>>>>>>>>>> direction
>>>>>>>>>>>>> would
>>>>>>>>>>>>> be greatly appreciated.  I am also open to other viable methods.
>>>>>>>>>>>>>
>>>>>>>>>>>>> Cheers,
>>>>>>>>>>>>>
>>>>>>>>>>>>> Paul
>>>>>>>>>>>>>
>>>>>>>>>>>>>         [[alternative HTML version deleted]]
>>>>>>>>>>>>>
>>>>>>>>>>>>> ______________________________********_________________
>>>>>>>>>>>>> R-sig-Geo mailing list
>>>>>>>>>>>>> R-sig-Geo at r-project.org
>>>>>>>>>>>>> https://stat.ethz.ch/mailman/********listinfo/r-sig-geo<https://stat.ethz.ch/mailman/******listinfo/r-sig-geo>
>>>>>>>>>>>>> <https:**//stat.ethz.ch/mailman/******listinfo/r-sig-geo<https://stat.ethz.ch/mailman/****listinfo/r-sig-geo>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> <https://**stat.ethz.ch/**mailman/****listinfo/r-sig-geo<http://stat.ethz.ch/mailman/****listinfo/r-sig-geo>
>>>>>>>>>>>>> **<https://stat.ethz.ch/mailman/****listinfo/r-sig-geo<https://stat.ethz.ch/mailman/**listinfo/r-sig-geo>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>  <https://**stat.ethz.ch/****mailman/listinfo/**r-sig-geo<http://stat.ethz.ch/**mailman/listinfo/**r-sig-geo>
>>>>>>>>>>>>> <h**ttp://stat.ethz.ch/mailman/**listinfo/**r-sig-geo<http://stat.ethz.ch/mailman/listinfo/**r-sig-geo>
>>>>>>>>>>>>>>
>>>>>>>>>>>>> <h**ttps://stat.ethz.ch/**mailman/**listinfo/r-sig-geo<http://stat.ethz.ch/mailman/**listinfo/r-sig-geo>
>>>>>>>>>>>>> <h**ttps://stat.ethz.ch/mailman/**listinfo/r-sig-geo<https://stat.ethz.ch/mailman/listinfo/r-sig-geo>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>
>>>>>>>>>>>>>           [[alternative HTML version deleted]]
>>>>>>>>>>>>
>>>>>>>>>>>>
>>>>>>>>>>>  ______________________________********_________________
>>>>>>>>>> R-sig-Geo mailing list
>>>>>>>>>> R-sig-Geo at r-project.org
>>>>>>>>>> https://stat.ethz.ch/mailman/********listinfo/r-sig-geo<https://stat.ethz.ch/mailman/******listinfo/r-sig-geo>
>>>>>>>>>> <https:**//stat.ethz.ch/mailman/******listinfo/r-sig-geo<https://stat.ethz.ch/mailman/****listinfo/r-sig-geo>
>>>>>>>>>>>
>>>>>>>>>> <https://**stat.ethz.ch/**mailman/****listinfo/r-sig-geo<http://stat.ethz.ch/mailman/****listinfo/r-sig-geo>
>>>>>>>>>> **<https://stat.ethz.ch/mailman/****listinfo/r-sig-geo<https://stat.ethz.ch/mailman/**listinfo/r-sig-geo>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>  <https://**stat.ethz.ch/****mailman/listinfo/**r-sig-geo<http://stat.ethz.ch/**mailman/listinfo/**r-sig-geo>
>>>>>>>>>> <h**ttp://stat.ethz.ch/mailman/**listinfo/**r-sig-geo<http://stat.ethz.ch/mailman/listinfo/**r-sig-geo>
>>>>>>>>>>>
>>>>>>>>>> <h**ttps://stat.ethz.ch/**mailman/**listinfo/r-sig-geo<http://stat.ethz.ch/mailman/**listinfo/r-sig-geo>
>>>>>>>>>> <h**ttps://stat.ethz.ch/mailman/**listinfo/r-sig-geo<https://stat.ethz.ch/mailman/listinfo/r-sig-geo>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>>
>>>>>>>>>>  --
>>>>>>>>>>
>>>>>>>>>>  Roger Bivand
>>>>>>>>> Department of Economics, NHH Norwegian School of Economics,
>>>>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>>>>> voice: +47 55 95 93 55; fax +47 55 95 95 43
>>>>>>>>> e-mail: Roger.Bivand at nhh.no
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>>
>>>>>>>>
>>>>>>>>  --
>>>>>>>>
>>>>>>> Roger Bivand
>>>>>>> Department of Economics, NHH Norwegian School of Economics,
>>>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>>>> voice: +47 55 95 93 55; fax +47 55 95 95 43
>>>>>>> e-mail: Roger.Bivand at nhh.no
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>>
>>>>>>
>>>>>>
>>>>>>  --
>>>>> Roger Bivand
>>>>> Department of Economics, NHH Norwegian School of Economics,
>>>>> Helleveien 30, N-5045 Bergen, Norway.
>>>>> voice: +47 55 95 93 55; fax +47 55 95 95 43
>>>>> e-mail: Roger.Bivand at nhh.no
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>> --
>> Roger Bivand
>> Department of Economics, NHH Norwegian School of Economics,
>> Helleveien 30, N-5045 Bergen, Norway.
>> voice: +47 55 95 93 55; fax +47 55 95 95 43
>> e-mail: Roger.Bivand at nhh.no
>>
>>
>
>
>

-- 
Roger Bivand
Department of Economics, NHH Norwegian School of Economics,
Helleveien 30, N-5045 Bergen, Norway.
voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no