[R-sig-Geo] error message when running errorsarlm

Fri May 23 08:42:56 CEST 2008

On Thu, 22 May 2008, Roger Bivand wrote:

> On Thu, 22 May 2008, evans324 at umn.edu wrote:
>
>>  On May 22 2008, Roger Bivand wrote:
>> 
>> > > >   Does that mean that you get a sensible lambda for your model now - 
>> > > >   the line search leads somewhere other than a boundary of the 
>> > > >   interval?
>> > > 
>> > >   I apologize for being unclear. I actually upgraded R and updated 
>> > >   packages, then ran errorsarlm with method="Matrix" and got the same 
>> > >   error messages I'd had previously (i.e., the search led to the 
>> > >   boundary of the interval). I then tried your other suggestion and 
>> > >   used method="spam" and got a result with no error messages.
>> > 
>> >  But we do not know why the two are not the same (they should be), so I 
>> >  would still not trust the outcome. I would be interested in off-list 
>> >  access to the data being used - I think that there is some issue with 
>> >  the scaling of the variable values. Do you see the same difference using 
>> >  spautolm(), which is effectively the same as errorsarlm(), but with a 
>> >  different internal structure?
>>
>>  I do see the same difference using spautolm() and get no error messages
>>  using it. I'll send you then data separately and would appreciate your
>>  opinion on them.

To close the part of the thread about differences between the spam and 
Matrix methods, I can report that on Linux (both 2GB and 1GB 32-bit), 
there is no difference for the original model using the 2500m distance 
criterion, and both line searches reach a lambda of 0.9165158. The same 
applies to Windows 32-bit. R version 2.7.0 using R-generic BLAS, spdep 
0.4-21, Matrix 0.999375-9, spam 0.13-3, in all cases.

The sparser weights cases described below run with adequate speed, but the 
semi-dense weights (average # neighbours 280) run more slowly, but the 
time is mostly spent in making sure that the weights are exactly symmetric 
- now in R functions similar.listw() and listw2U(), both of which will be 
re-written to hand out the time consuming parts to compiled C.

The data were 10055 house prices, the objective to fit a hedonic 
regression. Conclusion: using a more sparse neighbour representation is 
advisable; the computational problems could not be reproduced, but looked 
initially like a package version issue - Matrix is moving fast, and spdep 
is trying to keep up with it.

Roger

>
> Heather:
>
> OK, thanks. On first inspection, the choice of a distance criterion for 
> neighbours seems to be part of the problem. Using:
>
> nb_k5 <- knn2nb(knearneigh(coordinates(rd), k=5))
> nb_k5s <- make.sym.nb(nb_k5)
>
> where rd is the SpatialPointsDataFrame object, with many fewer neighbours 
> than your 2500m or 3000m criteria, gives results from "Matrix" and "spam" 
> that are identical, and most likely what you are after. These weights are the 
> 5 nearest neighbours coerced to symmetric, so all ahave 5 neighbours and the 
> largest number of neighbours is 12 (your 2500m criterion had a mean number of 
> neighbours of 280, maximum 804). If you can live without your choice of 
> neighbours (which in some settings may be getting pretty close to your market 
> segment dummies), I'd advise using something much sparser (but symmetric). 
> The sparser weights matrices also increase the speed dramatically.
>
> If you look at the bottom of ?bptest.sarlm, you'll see a cheap and totally 
> untested way of adjusting the output SEs, but please don't believe what it 
> does, because it is treating the lambda value as known, not estimated. A 
> guess at the remaining heterogeneity would be age by maintenance interaction, 
> older houses will vary in value by maintenance, probably also by 
> neighbourhood?
>
> Hope this helps,
>
> Roger
>> 
>> > > >   There are different traditions. Econometricians and some others in 
>> > > >   social science try to trick the standard errors by "magic", while 
>> > > >   epidemiologists (and crime people) typically use case weights - 
>> > > >   that is model the heteroscedasticity directly. spautolm() can 
>> > > >   include such case weights. I don't think that there is any 
>> > > >   substantive and reliable theory for adjusting the SE, that is 
>> > > >   theory that doesn't appeal to assumptions we already know don't 
>> > > >   hold. Sampling from the posterior gives a handle on this, but is 
>> > > >   not simple, and doesn't really suit 10K observations.
>> > > > 
>> > >   Can you explain "magic" a little further? I'm running this for a 
>> > >   professor who is a bit nervous about black box techniques and I'd 
>> > >   like to be able to offer him a good explanation. I think he'll just 
>> > >   have me calculate White's standard errors and ignore spatial 
>> > >   autocorrelation if I can't be clearer.
>> > > 
>> > 
>> >  If this is all your "professor" can manage, please replace/educate! The 
>> >  model is fundamentally misspecified, and neither "magicing" the standard 
>> >  errors, nor just fitting a simultaneous autoregressive error model will 
>> >  let you make fair decisions on the "significance" or otherwise of the 
>> >  right-hand side variables, which I suppose is the object of the 
>> >  exercise?
>> >
>>  I agree here, but haven't been able to get much advice on this. I
>>  appreciate your input.
>> 
>> >  (Looking at Johnston & DiNardo (1997), pp. 164-166, it looks as if 
>> >  White's SE only help asymptotically (in Prof. Ripley's well-known 
>> >  remark, asymptotics are a foreign country with spatial data), and not in 
>> >  finite samples, and their performance is unknown if the residuals are 
>> >  autocorrelated, which is the case here).
>> 
>> >  The vast number of observations is no help either, because they 
>> >  certainly introduce heterogeneity that has not been controlled for. Is 
>> >  this a grid of global species occurrence data, by any chance? Which RHS 
>> >  variables are covering for differences in environmental drivers? Or is 
>> >  there a better reason for using many observations (instead of careful 
>> >  data collection) than just their being available?
>> >
>>  This is a hedonic regression with a goal of eliciting economic values for
>>  different percentages of tree cover on parcels and in the local
>>  neighborhood as capitalized in home sale prices. We're using all 2005
>>  residential sales from Ramsey and Dakota counties in Minnesota, USA as our
>>  observations. This gives us sales from most study area regions and for all
>>  months. I'll send you a description of the RHS variables with the dataset.
>> 
>> >  More observations do not mean more information if meaningful differences 
>> >  across the observations are not captured by included variables (with the 
>> >  correct functional form). Have you tried GAM with flexible functional 
>> >  forms on the RHS variables and s(x,y) on the (point) locations of the 
>> >  observations?
>>
>>  I haven't tried this, but will look into it. 
>> >  You are not alone in your plight, but if the inferences matter, then 
>> >  it's better to be cautious, irrespective of the "professor".
>> >
>>  Thanks very much for your help.
>>
>>  Regards,
>>  Heather
>>
>>  --- Heather Sander
>>  Ph.D. Candidate:  Conservation Biology
>>  Office:  305 Ecology & 420 Blegen
>>  Mail:  University of Minnesota
>>  Dept. of Geography
>>  414 Social Science Bldg.
>>  267 19th Ave. S.
>>  Minneapolis, MN 55455
>>  USA
>> 
>> 
>> 
>
>

-- 
Roger Bivand
Economic Geography Section, Department of Economics, Norwegian School of
Economics and Business Administration, Helleveien 30, N-5045 Bergen,
Norway. voice: +47 55 95 93 55; fax +47 55 95 95 43
e-mail: Roger.Bivand at nhh.no